JP2014206828A

JP2014206828A - Data serialization device, data serialization method and data serialization program

Info

Publication number: JP2014206828A
Application number: JP2013083289A
Authority: JP
Inventors: 誠中山; Makoto Nakayama; 田中　聡; Satoshi Tanaka; 聡田中
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2013-04-11
Filing date: 2013-04-11
Publication date: 2014-10-30

Abstract

PROBLEM TO BE SOLVED: To accelerate serialization processing of tree structure data.SOLUTION: A data serialization device 1 comprises: a weight determination section 104 for determining a weight based on a size of a partial tree for each partial tree with each node as a root node; a division section 105 for dividing tree structure data into a plurality of node strings each formed from one or more nodes in a continuous serialization order on the basis of the weight of each partial tree determined by the weight determination section 104; a thread allocation section 106 for allocating one thread to each of the plurality of node strings generated by the division by the division section 105; a serialization section 107 for serializing data corresponding to the nodes included in the node string into byte strings in the serialization order in parallel for each thread; a coupling section 108 for coupling serialization results for each of threads obtained by the serialization section 107 in the serialization order; and an output section 109 for outputting a serialization result coupled by the coupling section 108.

Description

本発明は、特に木構造データを直列化するデータ直列化装置、データ直列化方法、及びデータ直列化プログラムに関する。 The present invention particularly relates to a data serialization device, a data serialization method, and a data serialization program for serializing tree structure data.

近年、ビッグデータと呼ばれる膨大な量のデータを扱うための大規模データ処理基盤として、例えばオープンソースソフトウェアのHadoop等が広く利用されている。ビッグデータの処理においては、少しでもデータサイズを縮小するために、データをテキスト形式ではなくバイナリ形式で保存する方法が主流である。そのために、Hadoopでは、SequecneFile（下記非特許文献１参照）及びProtocolBuffers（下記非特許文献２参照）等のシリアライザ／デシリアライザ（Serializer/Deserializer）が用いられる。シリアライザは、HDD（Hard Disk Drive）やSSD(Solid State Drive)といったストレージへの保存及びネットワーク上での送受信等のために、計算機内のメモリ上のデータをバイト列に変換（直列化）する技術である。デシリアライザは、シリアライザにより直列化されたバイト列から元のデータを復元する技術である。SequenceFile及びProtocolBuffers等は、木構造データとして表現した個々のレコード（葉ノード、及び、根ノードを含む中間ノード）を、TLV（Type-Length-Value）形式によるバイナリ形式のバイト列に直列化する。 In recent years, for example, open source software Hadoop has been widely used as a large-scale data processing platform for handling a huge amount of data called big data. In the processing of big data, in order to reduce the data size as much as possible, a method of storing data in binary format instead of text format is the mainstream. Therefore, in Hadoop, serializers / deserializers (Serializer / Deserializer) such as SequenceFile (see Non-Patent Document 1 below) and ProtocolBuffers (see Non-Patent Document 2 below) are used. A serializer is a technology that converts (serializes) data in the memory of a computer into a byte sequence for storage in storage such as HDD (Hard Disk Drive) and SSD (Solid State Drive) and transmission / reception on the network. It is. The deserializer is a technique for restoring original data from a byte sequence serialized by the serializer. SequenceFile, ProtocolBuffers, and the like serialize individual records (leaf nodes and intermediate nodes including root nodes) expressed as tree structure data into a binary byte string in TLV (Type-Length-Value) format.

Tom White著、“Hadoop: The Definitive Guide”、O'Reilly Media、２００９年５月発行、ｐ.１１１Tom White, “Hadoop: The Definitive Guide”, O'Reilly Media, May 2009, p.111 Google、“ProtocolBuffers”、［ｏｎｌｉｎｅ］、インターネット＜ＵＲＬ：http://code.google.com/p/protobuf/＞Google, “ProtocolBuffers”, [online], Internet <URL: http://code.google.com/p/protobuf/>

ところで、直列化すべき元データのサイズが大きい場合、直列化に要する処理時間も当然長くなる。近年、一台の計算機に搭載されるＣＰＵコアの数が増えてきており、その計算資源を有効活用するために複数スレッドを使用したプログラムが開発されている。そこで、バイナリ形式への直列化処理を複数スレッドで並列処理することで処理を高速化できると考えられる。しかしながら、非特許文献１及び２に示されるSequenceFile及びProtocolBuffers等は、単一スレッドで動作するように設計及び実装されており、直列化処理を複数スレッドで並列処理することはできないという問題がある。 By the way, when the size of the original data to be serialized is large, the processing time required for serialization naturally becomes longer. In recent years, the number of CPU cores mounted on one computer has increased, and a program using a plurality of threads has been developed in order to effectively use the calculation resources. Therefore, it is considered that the processing can be speeded up by parallelizing the serialization processing to the binary format with a plurality of threads. However, SequenceFile, ProtocolBuffers, and the like shown in Non-Patent Documents 1 and 2 are designed and implemented so as to operate with a single thread, and there is a problem that serialization processing cannot be performed in parallel with a plurality of threads.

そこで、本発明は、木構造データの直列化処理を高速化することができるデータ直列化装置、データ直列化方法、及びデータ直列化プログラムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a data serialization device, a data serialization method, and a data serialization program that can speed up the serialization processing of tree structure data.

本発明に係るデータ直列化装置は、値を保持する葉ノードと、葉ノード又は他の中間ノードを子ノードとして保持する中間ノードと、親ノードを持たない唯一の中間ノードである根ノードと、を有する木構造データを、予め定めた直列化順序で直列化するデータ直列化装置であって、各ノードを根ノードとする部分木毎に、当該部分木の大きさに基づく重みを決定する重み決定手段と、重み決定手段により決定された各部分木の重みに基づいて、木構造データを、それぞれ直列化順序が連続する一以上のノードからなる複数のノード列に分割する分割手段と、分割手段により分割されて生成された複数のノード列のそれぞれに対して１つのスレッドを割り当てるスレッド割当手段と、スレッド毎に並列して、ノード列に含まれる各ノードに対応するデータを直列化順序でバイト列に直列化する直列化手段と、直列化手段によって得られたスレッド毎の直列化結果を直列化順序で結合する結合手段と、結合手段により結合された直列化結果を出力する出力手段と、を備える。 A data serialization apparatus according to the present invention includes a leaf node that holds a value, an intermediate node that holds a leaf node or another intermediate node as a child node, and a root node that is the only intermediate node that does not have a parent node; Is a data serialization device that serializes tree-structured data having a predetermined serialization order, and determines a weight based on the size of the subtree for each subtree having each node as a root node A dividing unit that divides the tree-structured data into a plurality of node sequences each including one or more nodes that are serialized in sequence based on the weight of each subtree determined by the weight determining unit; Thread allocation means for allocating one thread to each of a plurality of node sequences generated by the means, and each node included in the node sequence in parallel for each thread. Serializing means for serializing data to be byte-sequenced in serialization order, coupling means for combining serialized results for each thread obtained by the serializing means in serialization order, and serialization coupled by the coupling means Output means for outputting a result.

本発明に係るデータ直列化方法は、値を保持する葉ノードと、葉ノード又は他の中間ノードを子ノードとして保持する中間ノードと、親ノードを持たない唯一の中間ノードである根ノードと、を有する木構造データを、予め定めた直列化順序で直列化するデータ直列化装置により実行されるデータ直列化方法であって、各ノードを根ノードとする部分木毎に、当該部分木の大きさに基づく重みを決定する重み決定ステップと、重み決定ステップにおいて決定された各部分木の重みに基づいて、木構造データを、それぞれ直列化順序が連続する一以上のノードからなる複数のノード列に分割する分割ステップと、分割ステップにおいて分割されて生成された複数のノード列のそれぞれに対して１つのスレッドを割り当てるスレッド割当ステップと、スレッド毎に並列して、ノード列に含まれる各ノードに対応するデータを直列化順序でバイト列に直列化する直列化ステップと、直列化ステップにおいて得られたスレッド毎の直列化結果を直列化順序で結合する結合ステップと、結合ステップにおいて結合された直列化結果を出力する出力ステップと、を含む。 The data serialization method according to the present invention includes a leaf node that holds a value, an intermediate node that holds a leaf node or another intermediate node as a child node, and a root node that is the only intermediate node that does not have a parent node; Is a data serialization method executed by a data serialization device that serializes tree-structured data in a predetermined serialization order, for each subtree having each node as a root node, the size of the subtree A weight determination step for determining a weight based on the length, and a plurality of node sequences each including one or more nodes in which the serialization order is continuous based on the weight of each subtree determined in the weight determination step A dividing step for dividing into two, and a thread assigning step for assigning one thread to each of a plurality of node sequences generated by dividing in the dividing step In parallel for each thread, a serialization step for serializing data corresponding to each node included in the node sequence into a byte sequence in the serialization order, and serialization results for each thread obtained in the serialization step are serialized A combining step for combining in a normalization order, and an output step for outputting the serialized result combined in the combining step.

本発明に係るデータ直列化プログラムは、値を保持する葉ノードと、葉ノード又は他の中間ノードを子ノードとして保持する中間ノードと、親ノードを持たない唯一の中間ノードである根ノードと、を有する木構造データを、予め定めた直列化順序で直列化するデータ直列化装置に設けられたコンピュータを、各ノードを根ノードとする部分木毎に、当該部分木の大きさに基づく重みを決定する重み決定手段と、重み決定手段により決定された各部分木の重みに基づいて、木構造データを、それぞれ直列化順序が連続する一以上のノードからなる複数のノード列に分割する分割手段と、分割手段により分割されて生成された複数のノード列のそれぞれに対して１つのスレッドを割り当てるスレッド割当手段と、スレッド毎に並列して、ノード列に含まれる各ノードに対応するデータを直列化順序でバイト列に直列化する直列化手段と、直列化手段によって得られたスレッド毎の直列化結果を直列化順序で結合する結合手段と、結合手段により結合された直列化結果を出力する出力手段、として機能させる。 A data serialization program according to the present invention includes a leaf node that holds a value, an intermediate node that holds a leaf node or another intermediate node as a child node, and a root node that is the only intermediate node that does not have a parent node; A computer provided in a data serialization device for serializing tree-structured data having a predetermined serialization order is assigned a weight based on the size of the subtree for each subtree having each node as a root node. Weight determining means for determining, and dividing means for dividing the tree-structured data into a plurality of node sequences each consisting of one or more nodes in which the serialization order is continuous based on the weight of each subtree determined by the weight determining means A thread allocating unit that allocates one thread to each of a plurality of node sequences generated by the dividing unit, and a node in parallel for each thread. Serializing means for serializing data corresponding to each node included in the byte sequence in serialization order, combining means for combining serialized results for each thread obtained by the serializing means in serialization order, and combining It functions as an output means for outputting the serialized result combined by the means.

このような形態によれば、木構造データの各部分木の大きさに基づく重みに基づいて分割境界が決定され、当該分割境界で分割されたノード列毎に、それぞれ独立したスレッドが割り当てられる。これにより、ノード列毎の直列化処理が各スレッドにおいて並列実行され、最終的に、スレッド毎に得られた直列化結果が結合されて出力される。このように木構造データの直列化処理を複数スレッドで並列的に処理することにより、木構造データの直列化処理を高速化することができる。 According to such a form, the division boundary is determined based on the weight based on the size of each partial tree of the tree structure data, and an independent thread is assigned to each node string divided at the division boundary. As a result, serialization processing for each node sequence is executed in parallel in each thread, and finally the serialization results obtained for each thread are combined and output. In this way, the tree structure data serialization process is processed in parallel by a plurality of threads, so that the tree structure data serialization process can be speeded up.

上記データ直列化装置では、直列化手段が、木構造データにおいて複数回出現する等価な部分木を示す重複部分木単位に所定の圧縮処理を実行した上で直列化を実行し、分割手段が、重複部分木に含まれるノードと、直列化順序における当該ノードの直前のノードとを分割しないように、木構造データを分割する構成としてもよい。 In the data serialization apparatus, the serialization means performs serialization after executing a predetermined compression process on overlapping subtree units indicating equivalent subtrees that appear multiple times in the tree-structured data, and the dividing means includes: The tree structure data may be divided so that the node included in the overlapping subtree and the node immediately before the node in the serialization order are not divided.

重複部分木単位の圧縮処理としては、例えば、辞書式圧縮及び連長圧縮（ＲＬＥ：Ｒｕｎ−Ｌｅｎｇｔｈ−Ｅｎｃｏｄｉｎｇ）等がある。直列化処理をこのような圧縮処理と併せて実行する場合において、連長圧縮対象となる複数の重複部分木が別々のスレッドに割り当てられると、各スレッドは独立して処理を行うため、連長圧縮が適切に実行されないおそれがある。また、辞書式圧縮対象となる同一の重複部分木に含まれる複数のノードが別々のスレッドに割り当てられると、一方のスレッドと他方のスレッドとで矛盾する処理がされてデータに矛盾が生じる問題を惹き起こすおそれがある。上記データ直列化装置では、分割手段が、重複部分木に含まれるノードと、直列化順序における当該ノードの直前のノードとを分割しないように木構造データを分割することにより、上述の問題を回避することができる。 Examples of the compression processing in units of overlapping subtrees include lexicographic compression and run-length-encoding (RLE). When serialization processing is executed in combination with such compression processing, if multiple overlapping subtrees that are subject to continuous length compression are assigned to different threads, each thread performs processing independently. Compression may not be performed properly. In addition, when multiple nodes included in the same overlapping subtree to be lexicographically compressed are assigned to different threads, there is a problem that data inconsistency occurs due to inconsistent processing between one thread and the other thread. There is a risk of being aroused. In the above data serialization device, the dividing means avoids the above-mentioned problem by dividing the tree structure data so as not to divide the node included in the overlapping subtree and the node immediately before the node in the serialization order. can do.

上記データ直列化装置では、直列化手段が、所定の圧縮処理のうちに含まれる圧縮処理の一つとして、重複部分木に対応するデータと符号とを予め対応付けて記憶している辞書テーブルに基づいて重複部分木に対応するデータを符号に置き換える辞書式圧縮を実行し、分割手段が、直列化手段により辞書式圧縮が実行される対象となる部分木の根ノードと、直列化順序における当該根ノードの直前のノードとを分割することを許容して、木構造データを分割する構成としてもよい。 In the data serialization apparatus, the serialization means stores a dictionary table in which data and codes corresponding to overlapping subtrees are stored in association with each other as one of the compression processes included in the predetermined compression process. Lexical compression that replaces the data corresponding to the overlapping subtree with a code based on the root node of the subtree for which the lexicographic compression is executed by the serializing unit, and the root node in the serialization order The tree structure data may be divided by allowing the node immediately before the node to be divided.

木構造データが、辞書式圧縮が実行される対象となる部分木の根ノードの直前位置で分割されたとしても、上述のような圧縮処理における問題は生じない。したがって、上記データ直列化装置によれば、このような位置での分割を許容することで、木構造データの分割位置をより柔軟に決定することができる。 Even if the tree-structured data is divided at a position immediately before the root node of the subtree to be subjected to lexicographic compression, the above-described problem in the compression processing does not occur. Therefore, according to the data serialization apparatus, by allowing the division at such a position, the division position of the tree structure data can be determined more flexibly.

上記データ直列化装置では、重み決定手段が、所定の圧縮処理が実行される対象となる重複部分木の重みを、所定の圧縮処理が実行されない場合よりも小さくする構成としてもよい。 In the data serialization apparatus, the weight determining unit may be configured to make the weight of the overlapping subtree for which the predetermined compression process is executed smaller than when the predetermined compression process is not executed.

直列化処理において併せて重複部分木に対する圧縮処理がされる場合には、重複部分木単位で所定の符号及び記号等に置き換えられる。すなわち、圧縮処理が実行される対象となる重複部分木に対する直列化処理は、実質的に当該重複部分木の根ノードに対してのみ行われることなるため、圧縮処理がされない場合と比較して、処理量が減少することが期待される。したがって、上記データ直列化装置によれば、所定の圧縮処理が実行される対象となる重複部分木に対して、直列化における実際の処理量に応じたより適切な重み付けを決定することができる。 When compression processing is performed on overlapping subtrees in the serialization processing, the code is replaced with a predetermined code, symbol, or the like in units of overlapping subtrees. That is, since the serialization processing for the overlapping subtree to be subjected to compression processing is substantially performed only on the root node of the overlapping subtree, the processing amount is compared with the case where the compression processing is not performed. Is expected to decrease. Therefore, according to the data serialization apparatus, it is possible to determine a more appropriate weight according to the actual processing amount in the serialization for the overlapping subtree to be subjected to the predetermined compression process.

上記データ直列化装置では、分割手段が、重み決定手段により決定された各部分木の重みに基づいて定まるノード列毎の重みがノード列間で均等化されるように、木構造データを分割する構成としてもよい。 In the data serialization apparatus, the dividing unit divides the tree structure data so that the weight for each node sequence determined based on the weight of each subtree determined by the weight determining unit is equalized between the node sequences. It is good also as a structure.

上記データ直列化装置によれば、分割手段によって各スレッドに割り当てられるノード列毎の重みを均等化するように木構造データを分割することで、各スレッドに割り当てられる処理量を均等化できる。これにより、特定のスレッドに処理負荷が偏ってしまうことを回避し、全スレッドの処理が完了するまでの時間を短縮することができる。 According to the above-described data serialization apparatus, the processing amount allocated to each thread can be equalized by dividing the tree structure data so as to equalize the weight for each node sequence allocated to each thread by the dividing unit. As a result, it is possible to avoid the processing load being biased toward a specific thread, and to shorten the time until the processing of all threads is completed.

上記データ直列化装置では、分割手段が、重み決定手段により決定された各部分木の重みに基づいて定まるノード列毎の重みが直列化順序の若いノード列ほど小さくなるように、木構造データを分割する構成としてもよい。 In the above data serialization apparatus, the dividing means reduces the tree structure data so that the weight for each node string determined based on the weight of each subtree determined by the weight determining means becomes smaller as the node string has a smaller serialization order. It is good also as a structure to divide | segment.

例えば、直列化結果同士の結合方法として、スレッド毎に得られた直列化結果を入力し、入力された直列化結果に対して所定の処理（例えば圧縮処理）を順次実行した上で、直列化結果を結合することが考えられる。このような場合には、例えば、木構造データの一部分の直列化結果に対する所定の処理が完了する前に、当該木構造データの一部分に結合される次の部分についての直列化処理が完了してしまうと、当該次の部分に対する所定の処理を直ちに開始することができないため処理待ち時間が発生する。上記データ直列化装置によれば、木構造データの直列化順序の若いノード列ほど処理量（重み）が小さくなるように木構造データを分割することで、木構造データの前方に位置するノード列に対する直列化処理が、後方に位置するノード列よりも先に完了するようになる。これにより、全体の処理において、直列化処理と所定の処理とを一部並列的に実行させることができ、全体の処理効率を向上させることができる。 For example, as a method of combining serialization results, serialization results obtained for each thread are input, serialized after sequentially executing predetermined processing (for example, compression processing) on the input serialization results. It is conceivable to combine the results. In such a case, for example, before the predetermined process for the serialization result of a part of the tree structure data is completed, the serialization process for the next part combined with the part of the tree structure data is completed. If this happens, a predetermined processing time for the next portion cannot be started immediately, and a processing waiting time occurs. According to the above data serialization device, the node sequence located in front of the tree structure data is divided by dividing the tree structure data so that the processing amount (weight) becomes smaller as the node sequence in the serialization order of the tree structure data becomes smaller. The serialization process for is completed before the node sequence located behind. Thereby, in the whole process, a serialization process and a predetermined process can be partially executed in parallel, and the whole process efficiency can be improved.

上記データ直列化装置では、結合手段が、スレッド毎の直列化結果を直列化順序で一つずつ入力して所定のデータ圧縮処理を実行した上で、スレッド毎の直列化結果を結合する構成としてもよい。 In the data serialization apparatus, the combining means inputs the serialization results for each thread one by one in the serialization order, executes a predetermined data compression process, and then combines the serialization results for each thread. Also good.

スレッド毎に得られた直列化結果同士を結合する方法としては、単純にこれらを結合する以外に、例えば、前方部分の直列化結果から順次ＺＩＰ圧縮等の従来のデータ圧縮処理を実行した上で結合する方法等がある。上記データ直列化装置によれば、木構造データの直列化順序の若い部分ほど処理量（重み）が小さくなるように木構造データを分割することで、木構造データの前方部分から先に直列化処理を完了させることができる。これにより、全体の処理において、直列化処理と所定のデータ圧縮処理とを一部並列的に実行させることができ、全体の処理効率を向上させることができる。 As a method of combining serialized results obtained for each thread, in addition to simply combining them, for example, after executing conventional data compression processing such as ZIP compression sequentially from the serialized result of the front part There are ways to combine them. According to the above data serialization apparatus, the tree structure data is divided so that the processing amount (weight) becomes smaller as the part of the tree structure data in the serialization order becomes smaller, so that the tree structure data is serialized first from the front part. Processing can be completed. Thereby, in the whole process, a serialization process and a predetermined data compression process can be partially executed in parallel, and the whole processing efficiency can be improved.

本発明によれば、木構造データの直列化処理を高速化することができる。 According to the present invention, it is possible to speed up the serialization processing of tree structure data.

本発明の一実施形態に係るデータ直列化装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the data serialization apparatus which concerns on one Embodiment of this invention. データ直列化装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a data serialization apparatus. 木構造データの例を示す図である。It is a figure which shows the example of tree structure data. 図３に示す木構造データを単一スレッドで直列化した場合における直列化結果を示す図である。It is a figure which shows the serialization result when the tree structure data shown in FIG. 3 are serialized by a single thread. 図３に示す木構造データを３つのノード列に分割する場合の分割境界の例を示す図である。It is a figure which shows the example of the division | segmentation boundary in the case of dividing | segmenting the tree-structure data shown in FIG. 3 into three node rows. 図３に示す木構造データを図５に示す分割境界で分割したスレッド毎に直列化した場合における直列化結果を示す図である。FIG. 6 is a diagram showing a serialization result when the tree structure data shown in FIG. 3 is serialized for each thread divided at the division boundary shown in FIG. 5. 図３に示す木構造データを単一スレッドで辞書式圧縮及び連長圧縮した上で直列化した場合における直列化結果を示す図である。FIG. 4 is a diagram showing a serialization result when the tree structure data shown in FIG. 3 is serialized after lexicographic compression and continuous length compression with a single thread. 図３に示す木構造データを図５に示す分割境界で分割したスレッド毎に辞書式圧縮及び連長圧縮した上で直列化した場合における直列化結果を示す図である。FIG. 6 is a diagram showing a serialization result when the tree structure data shown in FIG. 3 is serialized after lexicographic compression and continuous length compression for each of the threads divided at the division boundary shown in FIG. 5. 図３に示す木構造データを辞書式圧縮及び連長圧縮を考慮して分割する場合の分割境界の例を示す図である。It is a figure which shows the example of the division | segmentation boundary in the case of dividing | segmenting the tree structure data shown in FIG. 3 considering lexicographic compression and continuous length compression. 図３に示す木構造データを図９に示す分割境界で分割したスレッド毎に辞書式圧縮及び連長圧縮した上で直列化した場合における直列化結果を示す図である。FIG. 10 is a diagram showing a serialization result when the tree structure data shown in FIG. 3 is serialized after lexicographic compression and continuous length compression for each thread divided at the division boundary shown in FIG. 9. 各スレッド間で負荷が均等になるように決定した分割境界の例を示す図である。It is a figure which shows the example of the division | segmentation boundary determined so that load might be equal between each thread | sled. 図３に示す木構造データを図１１に示す分割境界で分割したスレッド毎に辞書式圧縮及び連長圧縮した上で直列化した場合における直列化結果を示す図である。FIG. 12 is a diagram illustrating a serialization result when the tree structure data illustrated in FIG. 3 is serialized after being lexicographically compressed and run-length-compressed for each thread divided at the division boundary illustrated in FIG. 11. スレッド間での負荷に傾斜をつけることによる効果を説明するための図である。It is a figure for demonstrating the effect by giving the inclination to the load between threads. データ直列化装置の動作を示すフロー図である。It is a flowchart which shows operation | movement of a data serialization apparatus. 図９に示す分割境界を決定する処理を含む分割処理を示すフロー図である。It is a flowchart which shows the division | segmentation process including the process which determines the division | segmentation boundary shown in FIG. 実施形態に係るデータ直列化プログラムの機能構成を示す図である。It is a figure which shows the function structure of the data serialization program which concerns on embodiment.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明において同一又は同等の要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and redundant description is omitted.

図１は、本発明の一実施形態に係るデータ直列化装置の機能構成を示すブロック図である。図１に示すように、本実施形態に係るデータ直列化装置１は、木構造構築部１０１、重複部分木検出部１０２、辞書テーブル生成部１０３、重み決定部１０４、分割部１０５、スレッド割当部１０６、直列化部１０７、結合部１０８、及び出力部１０９を備える。 FIG. 1 is a block diagram showing a functional configuration of a data serialization apparatus according to an embodiment of the present invention. As shown in FIG. 1, the data serialization apparatus 1 according to the present embodiment includes a tree structure construction unit 101, an overlapping subtree detection unit 102, a dictionary table generation unit 103, a weight determination unit 104, a division unit 105, and a thread allocation unit. 106, a serialization unit 107, a coupling unit 108, and an output unit 109.

図２は、データ直列化装置１のハードウェア構成を示す図である。図２に示すように、データ直列化装置１は、オペレーティングシステム及びアプリケーションプログラム等を実行するＣＰＵ１１と、ＲＯＭ及びＲＡＭで構成される主記憶部１２と、ハードディスクメモリ等で構成される補助記憶部１３と、データ通信を行う通信制御部１４と、液晶モニタ等で構成される出力部１５と、入力デバイスであるキーボード及びマウス等で構成される操作部１６と、ＣＤ−ＲＯＭ及びＤＶＤ等の記録媒体１８を読み取る記録媒体読取部１７とを備える。 FIG. 2 is a diagram illustrating a hardware configuration of the data serialization device 1. As shown in FIG. 2, the data serialization apparatus 1 includes a CPU 11 that executes an operating system, application programs, and the like, a main storage unit 12 that includes a ROM and a RAM, and an auxiliary storage unit 13 that includes a hard disk memory and the like. A communication control unit 14 that performs data communication, an output unit 15 including a liquid crystal monitor, an operation unit 16 including a keyboard and a mouse as input devices, and a recording medium such as a CD-ROM and a DVD And a recording medium reading unit 17 that reads 18.

図１に示すデータ直列化装置１の各機能は、ＣＰＵ１１の制御の下で、主記憶部１２に所定のソフトウェアプログラムを読み込ませて実行することにより実現される。その際、ＣＰＵ１１は、ソフトウェアプログラムの処理手順に従い、主記憶部１２及び補助記憶部１３におけるデータの読み出し及び書き込み動作を制御し、操作部１６、出力部１５及び通信制御部１４の動作を制御する。 Each function of the data serialization apparatus 1 shown in FIG. 1 is realized by reading and executing a predetermined software program in the main storage unit 12 under the control of the CPU 11. At that time, the CPU 11 controls the operations of the operation unit 16, the output unit 15, and the communication control unit 14 by controlling the data reading and writing operations in the main storage unit 12 and the auxiliary storage unit 13 according to the processing procedure of the software program. .

図１に戻り、データ直列化装置１を構成する各機能要素について説明する。 Returning to FIG. 1, each functional element constituting the data serialization device 1 will be described.

木構造構築部１０１は、データ直列化装置１による直列化処理を実行する対象となる木構造データを構築する機能要素である。木構造構築部１０１は、例えばオブジェクト構造を有するデータについて、各オブジェクトを木構造における各ノードに対応させるようにして、図３に示すような木構造データを構築する。 The tree structure construction unit 101 is a functional element that constructs tree structure data that is a target for executing serialization processing by the data serialization device 1. For example, for data having an object structure, the tree structure construction unit 101 constructs the tree structure data as shown in FIG. 3 so that each object corresponds to each node in the tree structure.

図３に示すように、木構造データは、値を保持する葉ノード（「１２３」、「４５６」、「９．８７」、「６．５４」、「ｈｏｇｅ」、「ｆｏｏ」）と、葉ノード又は他の中間ノードを子ノードとして保持する中間ノード（「中間」）と、親ノードを持たない唯一の中間ノードである根ノード（「根」）と、各ノード間を結ぶエッジとからなるデータ構造である。木構造データの実装方法としては、各ノードに子ノードへのポインタを保持させる方法、及び各ノードに親ノードへのポインタを保持させる方法等のいくつかの方法があるが、実装方法はこれらの方法のいずれかに限定されない。 As shown in FIG. 3, the tree structure data includes leaf nodes (“123”, “456”, “9.87”, “6.54”, “hoge”, “foo”) that hold values, It consists of an intermediate node that holds a node or other intermediate node as a child node (“intermediate”), a root node that is the only intermediate node that does not have a parent node (“root”), and an edge that connects each node It is a data structure. There are several methods for implementing tree-structured data, such as a method in which each node holds a pointer to a child node and a method in which each node holds a pointer to a parent node. It is not limited to any of the methods.

本明細書中において、「部分木」とは、木構造データにおける任意のノード以下の要素（ノード、エッジ）のみによって構成される木を意味する。すなわち、部分木とは、当該任意のノードを根ノードとする木を意味する。また、同一の木構造データにおいて複数回出現し、部分木の根ノード配下のノードの位置、型、及び値が互いに完全一致する部分木（等価な部分木）を「重複部分木」（「共通部分木」ともいう）という。 In the present specification, the “subtree” means a tree composed only of elements (nodes and edges) below an arbitrary node in the tree structure data. That is, the partial tree means a tree having the arbitrary node as a root node. In addition, a subtree (equivalent subtree) that appears multiple times in the same tree structure data and whose positions, types, and values of nodes under the root node of the subtree completely match each other is referred to as an “overlapping subtree” Is also called).

図３において、各ノードの左上に記載した数字は、親ノード優先かつ前方ノード優先かつ深さ優先といった条件に則って決定される各ノードの直列化順序の一例を示す。図３に示すように、この直列化順序は、任意の部分木について、親ノードが一番先となり、子ノード同士では前方（左側）に位置するノードの方が先となるような順序である。また、各ノードの右上に記載した数字は、当該ノードを根ノードとする部分木に含まれるノード数を示す。 In FIG. 3, the numbers described at the upper left of each node show an example of the serialization order of each node determined in accordance with conditions such as parent node priority, forward node priority, and depth priority. As shown in FIG. 3, this serialization order is an order in which a parent node is first in an arbitrary subtree, and a node positioned forward (left side) is first among child nodes. . Further, the number described at the upper right of each node indicates the number of nodes included in the subtree having the node as a root node.

重複部分木検出部１０２は、木構造データに含まれる重複部分木を検出する機能要素である。より具体的には、重複部分木検出部１０２は、例えば以下のようにして重複部分木を検出する。すなわち、重複部分木検出部１０２は、木構造構築部１０１により構築された木構造データを、例えば深さ優先探索等の予め定めた巡回順で巡回し、巡回先の各中間ノードにおいて、当該中間ノードを根ノードとする部分木の構成情報（当該部分木に含まれるノードの位置、型、及び値を示す情報）を取得する。重複部分木検出部１０２は、各中間ノードにおいて取得した部分木の構成情報同士を比較することにより、互いに構成情報が一致する部分木同士を重複部分木として検出する。ただし、上記方法は一例であり、重複部分木検出部１０２による重複部分木の検出方法は、特に限定されない。例えば、重複部分木検出部１０２は、特願２０１２−１６４０００号に記載されている方法等を用いて重複部分木を検出してもよい。 The overlapping subtree detection unit 102 is a functional element that detects an overlapping subtree included in the tree structure data. More specifically, the overlapping subtree detection unit 102 detects an overlapping subtree as follows, for example. In other words, the overlapping subtree detection unit 102 circulates the tree structure data constructed by the tree structure construction unit 101 in a predetermined circulation order such as a depth-first search, and the intermediate node at the circulation destination intermediate node Obtain subtree configuration information (information indicating the position, type, and value of a node included in the subtree) with the node as a root node. The overlapping subtree detection unit 102 detects subtrees having the same configuration information as overlapping subtrees by comparing the subtree configuration information acquired at each intermediate node. However, the above method is an example, and the method of detecting the overlapping subtree by the overlapping subtree detection unit 102 is not particularly limited. For example, the overlapping subtree detection unit 102 may detect an overlapping subtree using a method described in Japanese Patent Application No. 2012-164000.

図３の例では、直列化順序が２のノード（以下、「直列化順序がＸのノード」を単に「ノードＸ」と表記する）を根ノードとする部分木とノード１８を根ノードとする部分木とは、根ノード配下のノードの位置、型、及び値が互いに完全一致しており、互いに等価な重複部分木を成している。同様に、ノード８を根ノードとする部分木とノード１１を根ノードとする部分木とは、互いに等価な重複部分木を成している。重複部分木検出部１０２は、木構造データに含まれる互いに等価な重複部分木の組を検出する。 In the example of FIG. 3, a subtree whose root node is a node whose serialization order is 2 (hereinafter, “node whose serialization order is X” is simply referred to as “node X”) and node 18 is a root node. In the subtree, the positions, types, and values of the nodes under the root node are completely coincident with each other, forming an equivalent subtree. Similarly, the subtree having the node 8 as the root node and the subtree having the node 11 as the root node form an overlapping subtree equivalent to each other. The overlapping subtree detection unit 102 detects a pair of overlapping subtrees equivalent to each other included in the tree structure data.

辞書テーブル生成部１０３は、重複部分木検出部１０２により検出された重複部分木の各組について、当該組に関連付けられる重複部分木に対応するデータに対して、当該データよりもデータ長が短い符号が対応付けられた辞書テーブルを生成する機能要素である。図３の例において、辞書テーブル生成部１０３は、例えば、ノード２を根ノードとする部分木に対応するデータに対して符号Ｔ１が対応付けられ、ノード８を根ノードとする部分木に対応するデータに対して符号Ｔ２が対応付けられた辞書テーブルを生成する。 For each set of overlapping subtrees detected by the overlapping subtree detection unit 102, the dictionary table generation unit 103 is a code having a data length shorter than the data for the data corresponding to the overlapping subtree associated with the set. Is a functional element that generates a dictionary table associated with. In the example of FIG. 3, for example, the dictionary table generation unit 103 associates the code T1 with data corresponding to the subtree having the node 2 as the root node, and corresponds to the subtree having the node 8 as the root node. A dictionary table in which the code T2 is associated with the data is generated.

重複部分木検出部１０２による重複部分木の検出結果は、直列化部１０７が、木構造データを直列化する際に重複部分木単位の冗長性を利用した連長圧縮（ＲＬＥ：Ｒｕｎ−Ｌｅｎｇｈｔ−Ｅｎｃｏｄｉｎｇ）を行う場合に、分割部１０５及び直列化部１０７により参照される。また、辞書テーブル生成部１０３により生成された辞書テーブルは、直列化部１０７が、木構造データを直列化する際に重複部分木単位の冗長性を利用した辞書式圧縮を行う場合に、分割部１０５及び直列化部１０７により参照される。辞書式圧縮とは、重複部分木に対応するデータを辞書テーブルにおいて当該データに対応付けられた符号に置換することでデータサイズを縮小する圧縮処理である。連長圧縮とは、同一の根ノードを有し且つ連続して出現する互いに等価な重複部分木に対応するデータを、その連続長（繰り返し回数）により表現することでデータサイズを縮小する圧縮処理である。 The detection result of the overlapping subtree by the overlapping subtree detection unit 102 is obtained as a result of continuous length compression (RLE: Run-Length-) using redundancy in units of overlapping subtrees when the serializing unit 107 serializes the tree structure data. When performing (Encoding), it is referred to by the dividing unit 105 and the serializing unit 107. The dictionary table generated by the dictionary table generation unit 103 is divided into parts when the serialization unit 107 performs lexicographic compression using redundancy in units of overlapping subtrees when serializing tree structure data. 105 and serializer 107. The lexicographic compression is a compression process for reducing the data size by replacing data corresponding to the overlapping subtree with a code associated with the data in the dictionary table. Continuous length compression is a compression process that reduces the data size by expressing the data corresponding to overlapping subtrees that have the same root node and appear in succession by their continuous length (number of repetitions). It is.

重み決定部１０４は、木構造データにおける各ノードを根ノードとする部分木毎に、当該部分木の大きさに基づく重みを決定する重み決定手段である。重み決定部１０４は、各ノードを根ノードとする部分木に対して決定した重みを、当該ノードについての付加情報として当該ノードに関連付ける。各ノードに関連付けられた重みは、分割部１０５が木構造データを分割する際に参照される。 The weight determination unit 104 is a weight determination unit that determines a weight based on the size of the subtree for each subtree whose root node is each node in the tree structure data. The weight determination unit 104 associates the weight determined for the subtree having each node as the root node with the node as additional information about the node. The weight associated with each node is referred to when the dividing unit 105 divides the tree structure data.

重み決定部１０４による重みの決定方法は特に限定されないが、重み決定部１０４は、例えば部分木に含まれるノード数を当該部分木の重みとして決定する。この際、重み決定部１０４は、ノードの型に関わらずノード一つ分を「１」として部分木の重みを決定してもよいし、例えばノードの型に応じてノード毎の重み付けを変えて部分木の重みを決定してもよい。例えば、重み決定部１０４は、４バイト長の整数型の葉ノードには「４」、８バイト長の実数型の葉ノードには「８」、文字列型の葉ノードには文字列長に比例する大きさ、上記の型のようなデータ（整数値、実数値、文字列）を格納しない中間ノードには「１」といった重み付けを行った上で、部分木の重みを決定してもよい。 Although the weight determination method by the weight determination unit 104 is not particularly limited, the weight determination unit 104 determines, for example, the number of nodes included in the subtree as the weight of the subtree. At this time, the weight determining unit 104 may determine the weight of the subtree by setting one node to “1” regardless of the type of the node. For example, the weight for each node may be changed according to the type of the node. The weight of the subtree may be determined. For example, the weight determination unit 104 sets “4” for an integer type leaf node having a length of 4 bytes, “8” for a real type leaf node having an length of 8 bytes, and a string length for a leaf node of a character string type. The weight of the sub-tree may be determined after performing weighting such as “1” for intermediate nodes that do not store data of a proportional size or the above type (integer value, real value, character string). .

例えば図３のノード４を根ノードとする部分木について、重み決定部１０４が前者の方法で重みを決定する場合には、当該部分木の重みは、「３」となる。一方、重み決定部１０４が後者の方法で当該部分木の重みを決定する場合には、当該部分木の重みは、当該部分木に含まれる各ノード（「中間」、「ｆｏｏ」、「１２３」）の重み付け（１，３，４）を足し合わせた「８」となる。本実施形態では、前者の方法で部分木の重みを決定するものとして説明する。したがって、図３の各ノードの右上の数字は、前者の方法で決定された重みを示している。 For example, for the subtree having the node 4 in FIG. 3 as the root node, when the weight determination unit 104 determines the weight by the former method, the weight of the subtree is “3”. On the other hand, when the weight determination unit 104 determines the weight of the subtree by the latter method, the weight of the subtree is determined by each node (“intermediate”, “foo”, “123”) included in the subtree. ) Weighting (1, 3, 4) is added to “8”. In the present embodiment, description will be made assuming that the weight of the subtree is determined by the former method. Therefore, the numbers in the upper right of each node in FIG. 3 indicate the weights determined by the former method.

分割部１０５は、重み決定部１０４により決定された各部分木の重みに基づいて、木構造データを、それぞれ直列化順序が連続する一以上のノードからなる複数のノード列に分割する分割手段である。分割部１０５には、例えば分割数や各ノード列の重み（各ノード列に含まれる部分木の重み）の配分方法等を規定した分割戦略を示す情報が予め読み込まれている。 The dividing unit 105 is a dividing unit that divides the tree structure data into a plurality of node sequences each including one or more nodes in which the serialization order is continuous based on the weight of each subtree determined by the weight determining unit 104. is there. For example, information indicating a division strategy that prescribes, for example, the number of divisions, the distribution method of the weight of each node sequence (the weight of the subtree included in each node sequence), and the like is read in advance.

分割部１０５は、予め読み込まれている分割戦略を示す情報に基づいて木構造データを分割する。分割戦略を示す情報が分割部１０５に読み込まれる方法は何でもよい。例えば、分割戦略を示す情報は、分割部１０５による分割処理を実行するプログラムにおいてハードコーディングされてもよいし、分割処理実行時にオペレータ等に分割部１０５に入力させてもよいし、予め分割戦略を示す情報が指定された設定ファイルを分割部１０５に読み込ませてもよい。 The dividing unit 105 divides the tree structure data based on information indicating a division strategy read in advance. Any method may be used in which information indicating the division strategy is read by the division unit 105. For example, the information indicating the division strategy may be hard-coded in a program that executes the division process by the division unit 105, or may be input to the division unit 105 by an operator or the like when the division process is executed, A setting file in which information to be indicated is designated may be read by the dividing unit 105.

分割部１０５に読み込まれる分割戦略は特に限定されないが、例えば、分割部１０５は、分割数を「３」とし、各ノード列の重みがノード列間で均等化されるように分割するといった分割戦略に基づいて、木構造データを分割することができる。図３の例では、木構造データの総重み（根ノード１に関連付けられた重み）が２８であるため、分割部１０５は、例えば各ノード列の重みが「１０」、「９」、「９」となるように木構造データを分割する。すなわち、分割部１０５は、図５に示す分割境界Ｄ１，Ｄ２で木構造データを分割することにより、木構造データをノード１〜１０からなる重みが１０のノード列（以下「ノード列１」と表記）と、ノード１１〜１９からなる重みが９のノード列（以下「ノード列２」と表記）と、ノード２０〜２８からなる重みが９のノード列（以下「ノード列３」と表記）とに分割する。 The division strategy read by the dividing unit 105 is not particularly limited. For example, the dividing unit 105 sets the number of divisions to “3” and divides the node sequences so that the weights of the node sequences are equalized between the node sequences. Based on the above, the tree structure data can be divided. In the example of FIG. 3, since the total weight of the tree structure data (weight associated with the root node 1) is 28, the dividing unit 105 has, for example, a weight of “10”, “9”, “9” for each node sequence. The tree structure data is divided so that That is, the dividing unit 105 divides the tree structure data at the dividing boundaries D1 and D2 shown in FIG. Notation), a node string having a weight of 9 consisting of nodes 11 to 19 (hereinafter referred to as “node string 2”), and a node string having a weight of 9 consisting of nodes 20 to 28 (hereinafter referred to as “node string 3”). And split into

スレッド割当部１０６は、分割部１０５により分割されて生成された複数のノード列のそれぞれに対して１つのスレッドを割り当てるスレッド割当手段である。図５に示す分割境界Ｄ１，Ｄ２で分割部１０５によって木構造データが分割される場合には、スレッド割当部１０６は、ノード列１に対してスレッド１、ノード列２に対してスレッド２、ノード列３に対してスレッド３といったように各ノード列に対して異なるスレッドを割り当てる。スレッドとは、データ直列化装置１のＣＰＵ１１の制御の下で、それぞれ独立して並列的に処理を実行することが可能な実行単位をいう。各スレッドは、スレッド毎の処理結果を一時的に記憶するための記憶領域であるバッファを備える。 The thread allocation unit 106 is a thread allocation unit that allocates one thread to each of a plurality of node sequences generated by the division unit 105. When the tree structure data is divided by the dividing unit 105 at the dividing boundaries D1 and D2 illustrated in FIG. 5, the thread allocating unit 106 has the thread 1 for the node string 1, the thread 2 for the node string 2, and the node A different thread is assigned to each node column, such as thread 3 for column 3. A thread refers to an execution unit that can execute processing independently and in parallel under the control of the CPU 11 of the data serialization device 1. Each thread includes a buffer that is a storage area for temporarily storing a processing result for each thread.

直列化部１０７は、スレッド割当部１０６により複数のノード列のそれぞれに対して割り当てられたスレッド毎に並列して、ノード列に含まれる各ノードに対応するデータを直列化順序でバイト列に直列化する直列化手段である。直列化部１０７は、各スレッドにおける直列化処理において、例えばノード列に含まれる各ノードに対応するデータを型（Ｔｙｐｅ）、長さ（Ｌｅｎｇｔｈ）、値（Ｖａｌｕｅ）の３つのフィールドからなるＴＬＶ形式で直列化する。 The serialization unit 107 serializes the data corresponding to each node included in the node sequence in a serialized order into the byte sequence in parallel for each thread allocated to each of the plurality of node sequences by the thread allocation unit 106. It is the serialization means to make. In the serialization process in each thread, for example, the serialization unit 107 converts the data corresponding to each node included in the node string into a TLV format composed of three fields of type (Type), length (Length), and value (Value). Serialize with.

まず、重複部分木単位の辞書式圧縮及び連長圧縮を実行しない場合における、直列化部１０７による直列化について説明する。図４は、図３に示す木構造データを単一スレッドによりＴＬＶ形式で直列化した直列化結果の例を示す図である。図４の各ボックスの上に記載されている数字は、当該数字で示される直列化順序のノードに対応するデータの直列化結果の先頭位置を示している。また、各ボックス内の値は、ＴＬＶ形式における各フィールド値（型、長さ、値）を示している。 First, serialization by the serialization unit 107 when lexicographic compression and continuous length compression in units of overlapping subtrees are not executed will be described. FIG. 4 is a diagram illustrating an example of a serialization result obtained by serializing the tree structure data illustrated in FIG. 3 in a TLV format using a single thread. The numbers described above the boxes in FIG. 4 indicate the start positions of the serialization results of the data corresponding to the nodes in the serialization order indicated by the numbers. In addition, the value in each box indicates each field value (type, length, value) in the TLV format.

図４に示す直列化結果の例では、各ノードの型として、中間ノードを示す「Ｃ」、整数型を示す「Ｉ」、実数型を示す「Ｄ」、文字列型を示す「Ｓ」のいずれかが格納される。また、各ノードの長さとして、中間ノードでは「直接の子ノードの数」、文字列型では「文字列長」が格納され、４バイト固定長の整数型及び８バイト固定長の実数型ではノードの型とノードの長さとが一意に対応付けられるため省略される。また、各ノードの値としては、中間ノード以外の型では、それぞれの型に対応する値（整数値、実数値、文字列）が格納され、中間ノードでは省略される。 In the example of the serialization result shown in FIG. 4, the type of each node is “C” indicating an intermediate node, “I” indicating an integer type, “D” indicating a real number type, and “S” indicating a character string type. Either one is stored. In addition, as the length of each node, “the number of direct child nodes” is stored in the intermediate node, “character string length” is stored in the character string type, and the integer type of 4 bytes fixed length and the real number type of 8 bytes fixed length are stored. Since the node type and the node length are uniquely associated, they are omitted. In addition, as values of each node, values (integer value, real value, character string) corresponding to each type are stored in types other than the intermediate node, and are omitted in the intermediate node.

例えば、ノード１（「根」）は、中間ノードであり、直接の子ノードの数は４であるため、当該ノードに対応する直列化結果は「Ｃ，４」となる。また、ノード３（「ｈｏｇｅ」）は、文字列型であり、文字列長は４であるため、当該ノードに対応する直列化結果は「Ｓ，４，ｈｏｇｅ」となる。なお、直列化部１０７がノードの型、長さ、値等を示す情報を上記のように並べてＴＬＶ形式に直列化する処理自体は、従来周知の任意の手法により実行される。 For example, since node 1 (“root”) is an intermediate node and the number of direct child nodes is 4, the serialization result corresponding to the node is “C, 4”. Further, since the node 3 (“hoge”) is of a character string type and has a character string length of 4, the serialization result corresponding to the node is “S, 4, hoge”. Note that the processing itself in which the serialization unit 107 arranges information indicating the node type, length, value, and the like as described above and serializes the information in the TLV format is performed by any conventionally known method.

直列化部１０７がスレッド割当部１０６によって割り当てられたスレッド１〜３による並列処理によって、スレッド毎に取得した直列化結果を、図６に示す。このようなスレッド毎の直列化結果は、各スレッドが備えるバッファに一時的に記憶される。 The serialization result acquired for each thread by the parallel processing by the threads 1 to 3 assigned by the thread assignment unit 106 by the serialization unit 107 is shown in FIG. Such a serialization result for each thread is temporarily stored in a buffer included in each thread.

結合部１０８は、直列化部１０７によって得られたスレッド毎の直列化結果を直列化順序で結合する結合手段である。結合部１０８は、図６に示すようなスレッド毎の直列化結果を直列化順序（この場合には、スレッド１の直列化結果→スレッド２の直列化結果→スレッド３の直列化結果の順序）で結合することにより、図４に示すような分割前の木構造データに対応する一つの直列化結果を得る。また、結合部１０８は、用途に応じて様々な種類の結合処理を実行することができる。例えば、結合部１０８は、直列化結果を単に結合するのではなく、スレッド毎の直列化結果を直列化順序で一つずつ入力してＺＩＰ圧縮等のデータ圧縮処理（所定のデータ圧縮処理）を実行すると共に結合してもよい。このように具体的に実行する結合処理を示す情報は、結合部１０８による結合処理を実行するプログラムにおいてハードコーディングされてもよいし、結合処理実行時にオペレータ等によって結合部１０８に入力させてもよいし、予め結合処理を示す情報が指定された設定ファイルを結合部１０８に読み込ませてもよい。結合部１０８がＺＩＰ圧縮等のデータ圧縮処理を実行しつつ直列化結果を結合する場合の詳細については後述する。 The combining unit 108 is a combining unit that combines the serialized results for each thread obtained by the serializing unit 107 in the serialization order. The coupling unit 108 serializes the serialized results for each thread as shown in FIG. 6 (in this case, the serialized result of the thread 1 → the serialized result of the thread 2 → the order of the serialized result of the thread 3). In this way, one serialization result corresponding to the tree structure data before division as shown in FIG. 4 is obtained. Further, the combining unit 108 can execute various types of combining processes depending on the application. For example, the combining unit 108 does not simply combine the serialization results, but inputs the serialization results for each thread one by one in the serialization order, and performs data compression processing (predetermined data compression processing) such as ZIP compression. It may be combined with execution. Information specifically indicating the combining process to be executed in this way may be hard-coded in a program that executes the combining process by the combining unit 108, or may be input to the combining unit 108 by an operator or the like when the combining process is executed. Then, a setting file in which information indicating the combining process is designated in advance may be read by the combining unit 108. Details of the case where the combining unit 108 combines the serialized results while executing data compression processing such as ZIP compression will be described later.

出力部１０９は、結合部１０８により結合された直列化結果を出力する出力手段である。出力部１０９は、例えば、通信ネットワークを介して直列化結果を利用する所定のサーバ装置（不図示）等に出力したり、直列化結果を一定期間保存しておくためのストレージ（不図示）に出力したりといった出力処理を行う。 The output unit 109 is an output unit that outputs the serialization result combined by the combining unit 108. The output unit 109 outputs, for example, a predetermined server device (not shown) that uses the serialization result via a communication network, or a storage (not shown) for storing the serialization result for a certain period. Output processing such as output.

以上のように、データ直列化装置１によれば、重み決定部１０４によって決定された木構造データの各部分木の大きさ（本実施形態では「部分木に含まれるノード数」）に基づく重みに基づいて、分割部１０５によって分割境界Ｄ１，Ｄ２が決定される。続いて、当該分割境界Ｄ１，Ｄ２で分割されたノード列１〜３に対して、スレッド割当部１０６によってそれぞれ異なるスレッド１〜３が割り当てられる。これにより、直列化部１０７によって、ノード列毎の直列化処理が、各スレッドにおいて並列的に実行される。最終的に、スレッド毎に得られた直列化結果が結合部１０８によって結合され、結合された直列化結果が出力部１０９によって出力される。このように木構造データの直列化処理を複数スレッドで並列的に処理することにより、木構造データの直列化処理を高速化することができる。 As described above, according to the data serialization apparatus 1, the weight based on the size of each subtree of the tree structure data determined by the weight determination unit 104 (“the number of nodes included in the subtree” in the present embodiment). Based on the above, the dividing unit 105 determines the dividing boundaries D1 and D2. Subsequently, different threads 1 to 3 are assigned by the thread assignment unit 106 to the node rows 1 to 3 divided at the division boundaries D1 and D2. Thereby, the serialization unit 107 executes serialization processing for each node sequence in parallel in each thread. Finally, the serialization results obtained for each thread are combined by the combining unit 108, and the combined serialization result is output by the output unit 109. In this way, the tree structure data serialization process is processed in parallel by a plurality of threads, so that the tree structure data serialization process can be speeded up.

また、分割部１０５によって各スレッドに割り当てられるノード列毎の重みを均等化するように木構造データを分割することで、各スレッドに割り当てられる処理量（ノード数）を均等化できる。これにより、特定のスレッドに処理負荷が偏ってしまうことを回避し、全スレッドの処理が完了するまでの時間を短縮することができる。 Further, by dividing the tree structure data so that the weights for each node sequence assigned to each thread by the dividing unit 105 are equalized, the processing amount (number of nodes) assigned to each thread can be equalized. As a result, it is possible to avoid the processing load being biased toward a specific thread, and to shorten the time until the processing of all threads is completed.

次に、直列化部１０７が重複部分木単位に辞書式圧縮及び連長圧縮を実行した上で直列化を実行する場合について説明する。 Next, the case where the serialization unit 107 executes serialization after performing lexicographic compression and continuous length compression for each overlapping subtree will be described.

この場合、直列化部１０７は、直列化対象の各ノードを根ノードとする部分木について、例えば辞書テーブル生成部１０３によって生成された辞書テーブルを参照することで、当該部分木に対して辞書式圧縮を実行することができるか否かを判定する。また、直列化部１０７は、直列化対象の各ノードを根ノードとする部分木について、例えば当該部分木の構成を当該部分木と共通の親ノードを有し且つ当該部分木の直前に位置する部分木の構成と比較することによって、当該部分木に対して連長圧縮を実行することができるか否かを判定する。そして、直列化部１０７は、これらの圧縮処理を実行することが可能と判定した場合には、該当する部分木に対して当該圧縮処理を実行した上で直列化を実行する。 In this case, the serialization unit 107 refers to, for example, a dictionary table generated by the dictionary table generation unit 103 for a subtree having each node to be serialized as a root node, and lexicographically with respect to the subtree. It is determined whether or not compression can be performed. Also, the serialization unit 107 has, for example, a subtree whose root node is each node to be serialized, and the configuration of the subtree has a common parent node with the subtree and is positioned immediately before the subtree. By comparing with the configuration of the subtree, it is determined whether continuous length compression can be performed on the subtree. If the serialization unit 107 determines that these compression processes can be performed, the serialization unit 107 performs the serialization after executing the compression process on the corresponding subtree.

より具体的には、直列化部１０７は、ＴＬＶ形式への直列化において、辞書式圧縮により符号に置き換える対象となるデータ内容を示す辞書項目、辞書式圧縮対象の部分木に対応するデータ、及び連長圧縮対象の部分木に対応するデータに対して、特別な型を用いることにより圧縮処理を実現する。 More specifically, in the serialization to the TLV format, the serialization unit 107 includes a dictionary item indicating data contents to be replaced with codes by lexicographic compression, data corresponding to a subtree to be lexicographically compressed, and Compression processing is realized by using a special type for the data corresponding to the subtree to be subjected to continuous length compression.

また、本実施形態では、辞書式圧縮を用いた直列化の一例として、互いに等価な重複部分木の組において最初に出現する部分木に対応するデータ（図３の例では、ノード２又はノード８を根ノードとする部分木に対応するデータ）については、符号に置換せずに、辞書項目に対応する符号の符号長を示す符号長マーカを付与した上で辞書項目として直列化結果に含める方式を採用する。この方式では、復号側で符号長マーカと当該符号長マーカに続く辞書項目に対応するデータを抽出することで辞書テーブルを作成（復元）することができる。これにより、直列化結果を格納するデータのヘッダ部分等に辞書テーブルを含める必要がなくなり、その分直列化結果のデータサイズを縮小することができる（特願２０１２−１６２７００号参照）。 In this embodiment, as an example of serialization using lexicographic compression, data corresponding to a subtree that first appears in a set of overlapping subtrees equivalent to each other (in the example of FIG. 3, node 2 or node 8). For data that corresponds to a subtree having a root node as a root node), a code length marker indicating the code length of the code corresponding to the dictionary item is added to the serialization result without being replaced by the code and included in the serialization result Is adopted. In this system, a dictionary table can be created (restored) by extracting data corresponding to a code length marker and a dictionary item following the code length marker on the decoding side. Thereby, it is not necessary to include a dictionary table in the header portion of the data storing the serialization result, and the data size of the serialization result can be reduced accordingly (see Japanese Patent Application No. 2012-162700).

したがって、本実施形態では、辞書項目を示す型として符号長マーカ（「Ｍ」と表記）を用いる。また、辞書項目については、長さは省略し、当該重複部分木に含まれるノードをＴＬＶ形式で直列化したものを値とする。また、互いに等価な重複部分木の組において２回目以降に出現する重複部分木であって辞書式圧縮対象となる部分木に対応するデータについては、型として当該重複部分木に対応する符号を用い、長さと値は省略する。また、連長圧縮対象の部分木については、連長圧縮対象であることを示す「Ｒ」を型とし、直前の兄弟部分木の繰り返し回数を長さとし、値は省略する。 Therefore, in this embodiment, a code length marker (denoted as “M”) is used as a type indicating a dictionary item. For the dictionary item, the length is omitted, and a value obtained by serializing the nodes included in the overlapping subtree in the TLV format is used. In addition, for data corresponding to a subtree that is a lexicographic compression target that appears after the second time in a pair of overlapping subtrees equivalent to each other, a code corresponding to the overlapping subtree is used as a type. The length and value are omitted. For the subtree to be subjected to continuous length compression, “R” indicating that it is to be subjected to continuous length compression is used as a model, the number of repetitions of the immediately preceding sibling subtree is the length, and the value is omitted.

図７は、直列化部１０７が図３に示す木構造データを単一スレッドで辞書式圧縮及び連長圧縮を実行した上で直列化した場合における直列化結果を示す図である。図７に示すように、例えばノード２を根ノードとする部分木に対応する辞書項目については、型として「Ｍ」が用いられ、長さは省略され、当該部分木に含まれるノードを直列化したもの（Ｃ，２，Ｓ，４，ｈｏｇｅ，Ｃ，２，Ｓ，３，ｆｏｏ，Ｉ，１２３）が値となる。 FIG. 7 is a diagram showing a serialization result when the serialization unit 107 serializes the tree structure data shown in FIG. 3 after executing lexicographic compression and continuous length compression with a single thread. As shown in FIG. 7, for example, “M” is used as a type for a dictionary item corresponding to a subtree having node 2 as a root node, the length is omitted, and nodes included in the subtree are serialized. The value (C, 2, S, 4, hoge, C, 2, S, 3, foo, I, 123) is the value.

また、辞書式圧縮対象となる部分木（ノード１８を根ノードとする部分木及びノード２３を根ノードとする部分木）に含まれる全てのノードが、それぞれ対応する符号「Ｔ１」及び「Ｔ２」に置き換えられる。また、連長圧縮対象となる部分木（ノード１１を根ノードとする部分木）に含まれる全てのノードが、「Ｒ，１」に置き換えられる。以上のように、直列化部１０７は、辞書式圧縮及び連長圧縮を実行した上で木構造データを直列化することにより、辞書式圧縮及び連長圧縮を実行せずに木構造データを直列化した場合（図４参照）と比較して、直列化結果のサイズを短くすることができる。 Further, all the nodes included in the subtrees to be lexicographically compressed (the subtree having the node 18 as a root node and the subtree having the node 23 as a root node) are associated with the codes “T1” and “T2”, respectively. Is replaced by In addition, all nodes included in the subtree (subtree having the node 11 as a root node) to be subjected to continuous length compression are replaced with “R, 1”. As described above, the serialization unit 107 serializes tree structure data without executing lexicographic compression and continuous length compression by serializing the tree structure data after performing lexicographic compression and continuous length compression. Compared to the case (see FIG. 4), the size of the serialization result can be shortened.

ここで、辞書式圧縮及び連長圧縮を実行した上で直列化を実行する場合の失敗例として、分割部１０５が図５に示す分割境界Ｄ１，Ｄ２を分割境界として木構造データを分割し、スレッド割当部１０６が各ノード列に対してそれぞれ１つのスレッドを割り当て、直列化部１０７が当該スレッド毎に並列して、辞書式圧縮及び連長圧縮を実行した上で直列化を実行する場合について説明する。この場合におけるスレッド毎の直列化結果を図８に示す。図８に示すように、分割境界Ｄ１，Ｄ２を分割境界として分割された各ノード列に対する各スレッド（スレッド１〜３）における直列化処理が適切に実行されておらず、データに矛盾が生じていることがわかる。具体的には以下の２つの問題点が発生している。 Here, as an example of failure when serialization is performed after executing lexicographic compression and run length compression, the dividing unit 105 divides the tree structure data using the dividing boundaries D1 and D2 shown in FIG. A case where the thread allocation unit 106 allocates one thread to each node sequence, and the serialization unit 107 executes serialization after executing lexicographic compression and continuous length compression in parallel for each thread. explain. FIG. 8 shows the serialization result for each thread in this case. As shown in FIG. 8, the serialization processing in each thread (threads 1 to 3) for each node sequence divided using the division boundaries D1 and D2 as the division boundaries is not properly executed, resulting in data inconsistency. I understand that. Specifically, the following two problems have occurred.

１つ目の問題点は、連長圧縮対象となるノード１１を根ノードとする部分木について連長圧縮されていない点である。これは、同一の根ノードを有し且つ連続して出現する互いに等価な重複部分木同士（ノード８を根ノードとする部分木及びノード１１を根ノードとする部分木）が、互いに別のスレッド（スレッド１及びスレッド２）に割り当てられたためである。すなわち、スレッド２は、ノード１１を根ノードとする部分木がスレッド１に割り当てられたノード８を根ノードとする部分木と同一の構成であることを認識することができないため、連長圧縮が適切に実行されていない。 The first problem is that the partial length compression is not performed on the subtree whose root node is the node 11 to be subjected to the continuous length compression. This is because the mutually equivalent overlapping subtrees having the same root node and appearing in succession (the subtree having the node 8 as the root node and the subtree having the node 11 as the root node) are mutually different threads. This is because they are assigned to (thread 1 and thread 2). That is, the thread 2 cannot recognize that the subtree having the node 11 as the root node has the same configuration as the subtree having the node 8 assigned to the thread 1 as the root node. It is not running properly.

２つ目の問題点は、辞書式圧縮対象となるノード１８を根ノードとする部分木に含まれるノード１８〜２２について、ノード１８，１９がスレッド２に割り当てられ、ノード２０〜２２がスレッド３に割り当てられたことによりデータの矛盾が生じている点である。これについて、以下詳細に説明する。 The second problem is that nodes 18 and 19 are assigned to thread 2 and nodes 20 to 22 are assigned to thread 3 for nodes 18 to 22 included in a subtree whose root node is node 18 that is a lexicographic compression target. Data inconsistency is caused by being assigned to. This will be described in detail below.

重複部分木検出部１０２によって、ノード１８を根ノードとする部分木がノード２を根ノードとする部分木に対応する重複部分木であることが検出されている。したがって、スレッド２は、当該検出結果及び辞書テーブル生成部１０３によって生成された辞書テーブルに基づいて、ノード１８を根ノードとする部分木に含まれるノード１８，１９を、ノード２を根ノードとする部分木に対応する符号Ｔ１に置換する辞書式圧縮を実行する。ここで、符号Ｔ１は、ノード２を根ノードとする部分木に対応する符号であり、ノード１８〜２２の全てのノードに対応するデータを示すものである。一方、スレッド３は、ノード２０〜２２について辞書式圧縮処理を実行せずにノード毎に直列化する。したがって、スレッド２の直列化結果及びスレッド３の直列化結果の両方にノード２０〜２２に対応するデータを示す部分が含まれ、元のデータとの矛盾が生じている。 The overlapping subtree detection unit 102 detects that the subtree having the node 18 as a root node is an overlapping subtree corresponding to the subtree having the node 2 as a root node. Therefore, based on the detection result and the dictionary table generated by the dictionary table generation unit 103, the thread 2 uses the nodes 18 and 19 included in the subtree having the node 18 as a root node as the root node. Perform lexicographic compression to replace the code T1 corresponding to the subtree. Here, the code T1 is a code corresponding to a subtree having the node 2 as a root node, and indicates data corresponding to all the nodes 18 to 22. On the other hand, the thread 3 serializes the nodes 20 to 22 for each node without executing the lexicographic compression processing. Therefore, both the serialization result of the thread 2 and the serialization result of the thread 3 include a portion indicating data corresponding to the nodes 20 to 22, and there is a contradiction with the original data.

なお、直列化部１０７がスレッド間を跨いだノードのチェック処理等を実行するようにすれば上記の問題点を解消できる可能性があるが、そのようにした場合にはスレッド毎に独立した並列処理によって全体の処理効率を高めるというメリットが損なわれてしまうという問題がある。 Note that the above problem may be solved if the serializing unit 107 executes node check processing between threads, in which case independent parallel processing is performed for each thread. There is a problem in that the merit of improving the overall processing efficiency is impaired by the processing.

そこで、直列化部１０７が辞書式圧縮及び連長圧縮等の圧縮処理を実行した上で直列化を実行する場合において上述の２つの問題点を回避するために、データ直列化装置１は、以下の処理を実行する。すなわち、分割部１０５は、重複部分木に含まれるノードと直列化順序における当該ノードの直前のノードとを分割しないように、木構造データを分割する。ただし、辞書式圧縮対象の部分木の根ノードと直列化順序における当該根ノードの直前のノードとを分割しても上述のような問題点は生じないため、分割部１０５は、このように分割することについては許容してもよい。 Therefore, in order to avoid the above-described two problems when the serialization unit 107 performs serialization after performing compression processing such as lexicographic compression and run length compression, the data serialization device 1 includes the following: Execute the process. That is, the dividing unit 105 divides the tree structure data so as not to divide the node included in the overlapping subtree and the node immediately before the node in the serialization order. However, since the above-mentioned problem does not occur even if the root node of the subtree of the lexicographic compression target and the node immediately before the root node in the serialization order are divided, the dividing unit 105 performs the division in this way. May be acceptable.

図９に、分割部１０５が上述のように木構造データを分割する場合における分割境界の例（分割境界Ｄ３，Ｄ４）を示す。分割部１０５は、図９に示す分割境界Ｄ３，Ｄ４で木構造データを分割することにより、木構造データをノード１〜１３からなるノード列（以下「ノード列４」と表記）と、ノード１４〜２２からなるノード列（以下「ノード列５」と表記）と、ノード２３〜２８からなるノード列（以下「ノード列６」と表記）とに分割する。なお、分割部１０５が分割境界Ｄ３，Ｄ４を決定するための具体的な処理については、後述するデータ直列化装置１のフローを説明する際に併せて説明する。 FIG. 9 shows examples of division boundaries (division boundaries D3 and D4) when the division unit 105 divides the tree structure data as described above. The dividing unit 105 divides the tree structure data at the division boundaries D3 and D4 shown in FIG. 9, thereby dividing the tree structure data into a node string composed of nodes 1 to 13 (hereinafter referred to as “node string 4”) and a node 14. To 22 (hereinafter referred to as “node string 5”) and a node string consisting of nodes 23 to 28 (hereinafter referred to as “node string 6”). Note that specific processing for the division unit 105 to determine the division boundaries D3 and D4 will be described together with the description of the flow of the data serialization device 1 described later.

図１０は、木構造データを図９に示す分割境界Ｄ３，Ｄ４で分割したスレッド毎に辞書式圧縮及び連長圧縮した上で直列化した場合における直列化結果を示す図である。図１０に示すように、ノード列４に割り当てられたスレッド１において、ノード１１を根ノードとする部分木について連長圧縮されている。また、ノード列５が割り当てられたスレッド２及びノード列６が割り当てられたスレッド３において、ノードの重複はなく、データに矛盾は生じていない。すなわち、上述した２つの問題点は発生していない。 FIG. 10 is a diagram illustrating a serialization result when the tree structure data is serialized after lexicographic compression and continuous length compression for each of the threads divided by the division boundaries D3 and D4 shown in FIG. As shown in FIG. 10, in the thread 1 assigned to the node string 4, the subtree having the node 11 as the root node is subjected to continuous length compression. Further, in the thread 2 to which the node string 5 is assigned and the thread 3 to which the node string 6 is assigned, there is no duplication of nodes and no contradiction occurs in the data. That is, the above two problems do not occur.

以上のように、データ直列化装置１によれば、分割部１０５が重複部分木に含まれるノードと直列化順序における当該ノードの直前のノードとを分割しないように木構造データを分割することにより、上述した２つの問題を回避することができる。また、分割部１０５は、辞書式圧縮が実行される対象となる部分木の根ノードの直前位置での分割については許容することで、木構造データの分割位置をより柔軟に決定することができる。 As described above, according to the data serialization apparatus 1, the dividing unit 105 divides the tree structure data so as not to divide the node included in the overlapping subtree and the node immediately before the node in the serialization order. The two problems described above can be avoided. In addition, the dividing unit 105 can determine the division position of the tree structure data more flexibly by allowing the division at the position immediately before the root node of the sub-tree to be subjected to lexicographic compression.

ただし、図１０においては、スレッド間において直列化の処理対象とされたノードの数が均等化されていない。すなわち、スレッド１では、１１個のノードに対する直列化処理がされているのに対し、スレッド２及び３では、それぞれ５個及び４個のノードに対する直列化処理しかされておらず、各スレッドが処理するノード数について不均衡が生じている。スレッド毎の並列処理によって全体の処理をより高速に完了するには、全てのスレッドにほぼ均等な数のノードが割り当てられることがより好ましい。 However, in FIG. 10, the number of nodes to be serialized is not equalized between threads. That is, while thread 1 is serialized for 11 nodes, threads 2 and 3 are only serialized for 5 and 4 nodes, respectively. There is an imbalance in the number of nodes In order to complete the entire processing at a higher speed by parallel processing for each thread, it is more preferable that an almost equal number of nodes are allocated to all the threads.

本実施形態において、直列化部１０７による直列化処理において併せて重複部分木に対する辞書式圧縮及び連長圧縮が実行される場合には、互いに等価な重複部分木のうち直列化順序において２回目以降に出現する重複部分木が、当該重複部分木単位で所定の符号及び記号等に置き換えられる。すなわち、本実施形態では、互いに等価な重複部分木のうち直列化順序において２回目以降に出現する重複部分木が、圧縮処理が実行される対象となる重複部分木（以下「圧縮対象重複部分木」と表記する）に相当する。ここで、圧縮対象重複部分木に対する直列化処理は、実質的に当該圧縮対象重複部分木の根ノードに対してのみ行われることなるため、圧縮処理がされない場合と比較して、処理量が減少することが期待される。 In the present embodiment, when lexicographic compression and continuous length compression are performed on overlapping subtrees in the serialization processing by the serialization unit 107, the second and subsequent times in the serialization order among mutually equivalent subtrees. The overlapping subtrees appearing in are replaced with predetermined codes and symbols or the like in units of the overlapping subtrees. That is, in the present embodiment, among the overlapping subtrees equivalent to each other, the overlapping subtree that appears in the second or later in the serialization order is an overlapping subtree to be subjected to compression processing (hereinafter referred to as “compression target overlapping subtree”). "). Here, since the serialization process for the compression target overlapping subtree is substantially performed only for the root node of the compression target overlapping subtree, the processing amount is reduced as compared with the case where the compression process is not performed. There is expected.

そこで、重み決定部１０４は、圧縮対象重複部分木の重みを、圧縮処理が実行されない場合よりも小さい値に決定してもよい。重み決定部１０４は、例えば重複部分木検出部１０２の検出結果を参照することにより、圧縮対象重複部分木を特定することができる。 Therefore, the weight determination unit 104 may determine the weight of the compression target overlapping subtree to a value smaller than that when the compression process is not executed. The weight determination unit 104 can specify the compression target overlapping subtree by referring to the detection result of the overlapping subtree detection unit 102, for example.

重み決定部１０４による圧縮対象重複部分木に対する重み決定方法の最も簡単な方法の一つは、重みを「１」に決定することである。図１１に、このようにして重み決定部１０４が決定した部分木毎の重みを示す。図１１に示すように、重み決定部１０４は、圧縮対象重複部分木の根ノード（ノード１１、ノード１８、ノード２３）の重みを「１」としている。また、重み決定部１０４は、圧縮対象重複部分木の根ノード以外のノードについては、重みの決定を省略している。これは、圧縮対象重複部分木の根ノード以外のノードについては実質的に直列化処理が実行されないため、処理量に相当する重みを関連付ける必要がないからである。 One of the simplest weight determination methods for the compression target overlapping subtree by the weight determination unit 104 is to determine the weight as “1”. FIG. 11 shows the weight for each subtree determined by the weight determination unit 104 in this way. As shown in FIG. 11, the weight determination unit 104 sets the weight of the root node (node 11, node 18, node 23) of the compression target overlapping subtree to “1”. In addition, the weight determination unit 104 omits the determination of the weight for nodes other than the root node of the compression target overlapping subtree. This is because the serialization processing is not substantially executed for nodes other than the root node of the compression target overlapping subtree, and it is not necessary to associate a weight corresponding to the processing amount.

このような重み決定部１０４による重み決定方法によれば、木構造データの総重みは２０となる。分割部１０５は、例えば、辞書式圧縮対象の部分木の根ノードと直列化順序における当該根ノードの直前のノードとを分割することを許容しつつ、重複部分木に含まれるノードと直列化順序における当該ノードの直前のノードとを分割しないようにすると共に、ノード列毎の重みがノード列間で均等化されるように木構造データを分割する。すなわち、分割部１０５は、図１１に示す分割境界Ｄ５，Ｄ６で木構造データを分割することにより、各ノード列の重みが「７」、「７」、「６」と均等になるように木構造データを分割する。 According to such a weight determination method by the weight determination unit 104, the total weight of the tree structure data is 20. The dividing unit 105 allows, for example, dividing the root node of the subtree of the lexicographic compression target and the node immediately before the root node in the serialization order, and the nodes included in the overlapping subtree and the serial node in the serialization order. The tree structure data is divided so that the node immediately before the node is not divided and the weight for each node row is equalized between the node rows. That is, the dividing unit 105 divides the tree structure data at the dividing boundaries D5 and D6 shown in FIG. 11 so that the weights of the respective node strings are equal to “7”, “7”, and “6”. Divide the structure data.

図１２は、木構造データを図１１に示す分割境界Ｄ５，Ｄ６で分割したスレッド毎に辞書式圧縮及び連長圧縮した上で直列化した場合における直列化結果を示す図である。図１２に示すように、重み決定部１０４が上述のように決定した重みに基づいて分割部１０５が分割境界Ｄ５，Ｄ６で木構造データを分割することによって、スレッド間における直列化の処理対象とされたノードの数が、図１０に示す場合と比較して均等化されている。 FIG. 12 is a diagram showing a serialization result when the tree structure data is serialized after lexicographic compression and continuous length compression for each thread divided at the division boundaries D5 and D6 shown in FIG. As shown in FIG. 12, the dividing unit 105 divides the tree-structured data at the dividing boundaries D5 and D6 based on the weights determined by the weight determining unit 104 as described above. The number of performed nodes is equalized compared to the case shown in FIG.

以上のように、データ直列化装置１によれば、圧縮対象重複部分木に対して、処理量に応じたより適切な重み付けを決定することができる。そして、分割部１０５がこのような重み付けの結果を利用して、ノード列毎の重みがノード列間で均等化されるように木構造データを分割することができる。これにより、スレッド間の処理量を均等化し、全体の処理時間をより短縮することができる。 As described above, according to the data serialization apparatus 1, it is possible to determine a more appropriate weight according to the processing amount for the compression target overlapping subtree. Then, the division unit 105 can divide the tree structure data so that the weight for each node sequence is equalized between the node sequences by using the result of such weighting. Thereby, the processing amount between threads can be equalized, and the overall processing time can be further shortened.

次に、結合部１０８が、スレッド毎の直列化結果を直列化順序で一つずつ入力して所定のデータ圧縮処理を実行した上で、スレッド毎の直列化結果を結合する場合を考える。より具体的には、結合部１０８が、直列化部１０７による各スレッドにおける直列化処理により得られた直列化結果について、前方部分の直列化結果から順次ＺＩＰ圧縮等のデータ圧縮処理を実行した上で、スレッド毎の直列化結果を結合する場合を考える。 Next, consider a case where the combining unit 108 inputs serialized results for each thread one by one in the serialization order, executes a predetermined data compression process, and then combines the serialized results for each thread. More specifically, the combining unit 108 executes data compression processing such as ZIP compression sequentially from the serialization result of the front portion of the serialization result obtained by the serialization processing in each thread by the serialization unit 107. Now consider the case of combining the serialized results for each thread.

例えば、部分バイト列を順番に入力し、入力したバイト列をＺＩＰ圧縮した上で結合した結果を出力する技術として、Ｊａｖａ（登録商標）言語におけるＧＺＩＰＯｕｔｐｕｔＳｔｒｅａｍオブジェクト等が存在する。結合部１０８は、例えば、ＧＺＩＰＯｕｔｐｕｔＳｔｒｅａｍオブジェクトにスレッド毎の直列化結果を一つずつ入力することにより、ＺＩＰ圧縮がされた上でスレッド毎の直列化結果が全て結合された一つの直列化結果を取得することができる。ここで、結合部１０８は、スレッド毎の直列化結果をいったん結合してからＧＺＩＰＯｕｔｐｕｔＳｔｒｅａｍオブジェクトに入力する必要がなく、スレッド毎の直列化結果をＧＺＩＰＯｕｔｐｕｔＳｔｒｅａｍオブジェクトに順次入力するだけでよい。 For example, there is a GZIPOutputStream object or the like in Java (registered trademark) as a technique for inputting partial byte sequences in order, outputting the result of combining the input byte sequences after ZIP compression. The combining unit 108, for example, inputs one serialized result for each thread to the GZIPOutputStream object, thereby obtaining one serialized result in which all the serialized results for each thread are combined after being ZIP compressed. can do. Here, the combining unit 108 does not need to combine the serialized results for each thread and then input the serialized results for each thread to the GZIPOutputStream object in order, instead of inputting them to the GZIPOutputStream object.

本実施形態では、結合部１０８が、上記のようなＧＺＩＰＯｕｔｐｕｔＳｔｒｅａｍオブジェクト等を用いて、スレッド毎の直列化結果をＺＩＰ圧縮した上で結合した一つの直列化結果を取得する場合について考える。 In the present embodiment, consider a case where the combining unit 108 obtains one serialized result obtained by combining the serialized results for each thread after ZIP compression using the GZIPOutputStream object as described above.

ＧＺＩＰＯｕｔｐｕｔＳｔｒｅａｍオブジェクトによるＺＩＰ圧縮処理は、先に入力されたバイト列から順次直列的に実行される。したがって、スレッド毎の処理負荷が均等化されていると、図１３（ａ）に示すように、スレッド１の直列化結果に対するＺＩＰ圧縮処理（所定のデータ圧縮処理）が完了する前に、スレッド２の直列化処理が完了してしまう場合が生じ得る。この場合、スレッド２の直列化結果に対するＺＩＰ圧縮処理を直ちに開始することができないため「１」の処理待ち時間が発生する。スレッド３についても同様に「２」の処理待ち時間が発生する。図１３（ａ）に示すように、スレッド間の処理負荷が均等化されてスレッド毎の直列化処理時間が「３」と等しくなる場合には、全スレッドの直列化結果に対するＺＩＰ圧縮処理が完了するまでに合計で「６」の処理時間がかかる。ここで、ＺＩＰ圧縮処理にかかる処理時間は「１」としている。 The ZIP compression processing by the GZIPOutputStream object is sequentially executed serially from the previously input byte string. Therefore, when the processing load for each thread is equalized, as shown in FIG. 13A, before the ZIP compression process (predetermined data compression process) for the serialized result of the thread 1 is completed, the thread 2 The serialization process may be completed. In this case, since the ZIP compression process for the serialized result of the thread 2 cannot be started immediately, a processing waiting time of “1” occurs. Similarly, a processing wait time of “2” occurs for the thread 3 as well. As shown in FIG. 13A, when the processing load between threads is equalized and the serialization processing time for each thread becomes equal to “3”, the ZIP compression processing for the serialization results of all threads is completed. It takes a total of “6” processing time. Here, the processing time required for the ZIP compression process is “1”.

そこで、分割部１０５は、重み決定部１０４により決定された各部分木の重みに基づいて定まるノード列毎の重みが直列化順序の若いノード列（すなわち番号が若いスレッドに割り当てられるノード列）ほど小さくなるように、木構造データを分割してもよい。この場合、図１３（ｂ）に示すように、スレッド番号が若いスレッドほど、直列化処理の処理負荷が少なくなり、直列化処理が早く完了することとなる。これにより、直列化部１０７による直列化処理と結合部１０８によるＺＩＰ圧縮処理とが部分的に並列実行される。図１３（ｂ）に示すように、スレッド間の処理負荷に傾斜がつけられてスレッド１〜３の直列化処理時間がそれぞれ「２」、「３」、「４」となる場合には、全スレッドの直列化結果に対するＺＩＰ圧縮処理が完了するまでの処理時間を「５」に短縮することができる。 Therefore, the dividing unit 105 determines that the weight of each node sequence determined based on the weight of each subtree determined by the weight determining unit 104 is a node sequence with a younger serialization order (that is, a node sequence assigned to a thread with a lower number). The tree structure data may be divided so as to be smaller. In this case, as shown in FIG. 13B, the processing load of the serialization process decreases as the thread number becomes smaller, and the serialization process is completed earlier. Thereby, the serialization processing by the serialization unit 107 and the ZIP compression processing by the combining unit 108 are partially executed in parallel. As shown in FIG. 13B, when the processing load between threads is inclined and the serialization processing times of the threads 1 to 3 are “2”, “3”, and “4”, respectively, The processing time until the ZIP compression processing for the serialized result of the thread is completed can be shortened to “5”.

以上のように、分割部１０５が、スレッド間の処理負荷にあえて傾斜をつけるように木構造データを分割することにより、全体の処理において、直列化処理とＺＩＰ圧縮処理とを一部並列的に実行させることができ、全体の処理効率を向上させることができる。 As described above, the dividing unit 105 divides the tree-structured data so as to incline the processing load between threads, so that serialization processing and ZIP compression processing are partially performed in parallel in the overall processing. It can be executed, and the overall processing efficiency can be improved.

次に、図１４及び図１５を用いて、データ直列化装置１の動作（本実施形態に係るデータ直列化方法を含む）について説明する。図１４は、データ直列化装置の動作を示すフロー図である。図１５は、分割部１０５が図９に示す分割境界Ｄ３，Ｄ４を決定する処理を含む分割処理を示すフロー図である。 Next, the operation of the data serialization apparatus 1 (including the data serialization method according to the present embodiment) will be described using FIG. 14 and FIG. FIG. 14 is a flowchart showing the operation of the data serializer. FIG. 15 is a flowchart showing a dividing process including a process in which the dividing unit 105 determines the dividing boundaries D3 and D4 shown in FIG.

まず、木構造構築部１０１によって木構造データが構築される（ステップＳ１０１）。続いて、重複部分木検出部１０２によって木構造データに含まれる重複部分木が検出され（ステップＳ１０２）、重複部分木検出部１０２により検出された重複部分木に対応するデータと符号とを対応付けた辞書テーブルが、辞書テーブル生成部１０３によって生成される（ステップＳ１０３）。 First, tree structure data is constructed by the tree structure construction unit 101 (step S101). Subsequently, an overlapping subtree included in the tree structure data is detected by the overlapping subtree detection unit 102 (step S102), and data corresponding to the overlapping subtree detected by the overlapping subtree detection unit 102 is associated with a code. The dictionary table is generated by the dictionary table generation unit 103 (step S103).

続いて、重み決定部１０４によって、各ノードを根ノードとする部分木毎に、当該部分木の大きさに基づく重みが決定される（ステップＳ１０４、重み決定ステップ）。直列化部１０７による直列化において辞書式圧縮及び連長圧縮が併せて実行される場合には、ステップＳ１０４において、重み決定部１０４は、これらの圧縮処理が実行される対象となる重複部分木の重みを、これらの圧縮処理が実行されない場合よりも小さくしてもよい。 Subsequently, the weight determination unit 104 determines a weight based on the size of the subtree for each subtree having each node as a root node (step S104, weight determination step). When serialization by the serialization unit 107 is performed together with lexicographic compression and run length compression, in step S104, the weight determination unit 104 determines that the overlapping subtree to be subjected to these compression processes is executed. The weight may be made smaller than when these compression processes are not executed.

続いて、重み決定部１０４により決定された各部分木の重みに基づいて、木構造データをそれぞれ直列化順序が連続する一以上のノードからなる複数のノード列に分割する分割処理が、分割部１０５によって実行される（ステップＳ１０５、分割ステップ）。 Subsequently, based on the weight of each subtree determined by the weight determination unit 104, a division process for dividing the tree structure data into a plurality of node sequences each including one or more nodes in which the serialization order is continuous is performed by the division unit. (Step S105, division step).

図１５を用いて、分割部１０５が図９に示す分割境界Ｄ３，Ｄ４を決定し、当該分割境界Ｄ３，Ｄ４で木構造データを分割する処理について説明する。 With reference to FIG. 15, a description will be given of a process in which the dividing unit 105 determines the division boundaries D3 and D4 shown in FIG. 9 and divides the tree structure data at the division boundaries D3 and D4.

まず、分割部１０５は、分割境界の候補を仮決定する（ステップＳ２０１）。ここでは一例として、図５に示す分割境界Ｄ１，Ｄ２が分割境界の候補として仮決定されるものとする。続いて、分割部１０５は、未確定の分割境界の候補（分割境界Ｄ１，Ｄ２）のうち最も前方の候補（分割境界Ｄ１）を選択する（ステップＳ２０２）。続いて、分割部１０５は、確定済みの分割境界があるか否かを判定する（ステップＳ２０３）。ここでは、確定済みの分割境界はないため（ステップＳ２０３：ＮＯ）、分割部１０５は、ステップＳ２０２で選択した分割境界Ｄ１の直後のノード１１（候補ノード）の先祖系列を抽出する（ステップＳ２０６）。 First, the dividing unit 105 provisionally determines a division boundary candidate (step S201). Here, as an example, the division boundaries D1 and D2 shown in FIG. 5 are provisionally determined as division boundary candidates. Subsequently, the dividing unit 105 selects the foremost candidate (division boundary D1) among the undetermined division boundary candidates (division boundaries D1 and D2) (step S202). Subsequently, the dividing unit 105 determines whether there is a determined division boundary (step S203). Here, since there is no confirmed division boundary (step S203: NO), the division unit 105 extracts the ancestor series of the node 11 (candidate node) immediately after the division boundary D1 selected in step S202 (step S206). .

先祖系列とは、木構造データの根ノードであるノード１から候補ノードであるノード１１までの接続関係を直列化順序の系列で表したものである。ただし、ノード１を根ノードとする木（木構造データそのもの）全体が辞書式圧縮及び連長圧縮等の対象となることはないため、先祖系列から除外する。したがって、例えば、ノード１１の先祖系列は、「ノード７、ノード１１」となり、ノード２０の先祖系列は、「ノード１７、ノード１８、ノード２０」となる。 An ancestor series represents a connection relationship from a node 1 that is a root node of tree-structured data to a node 11 that is a candidate node in a series of serialization order. However, the entire tree having the node 1 as the root node (the tree structure data itself) is not subject to lexicographic compression, run length compression, or the like, and is therefore excluded from the ancestor series. Therefore, for example, the ancestor series of the node 11 is “node 7, node 11”, and the ancestor series of the node 20 is “node 17, node 18, node 20”.

続いて、分割部１０５は、候補ノードであるノード１１の先祖系列から自ノード（ノード１１）を除いた先祖（ノード７）のうちに、辞書式圧縮又は連長圧縮の対象となる部分木の根ノードがあるか否かを判定する（ステップＳ２０７）。ここで、ノード７は、重複部分木の根ノードではなく、いずれの圧縮対象ともならないため（ステップＳ２０７：ＮＯ）、分割部１０５は、候補ノードであるノード１１が連長圧縮対象となる部分木の根ノードであるか否かを判定する（ステップＳ２０８）。ここで、ノード１１を根ノードとする部分木は、ノード８を根ノードとする部分木との連長圧縮対象となるため（ステップＳ２０８：ＹＥＳ）、分割部１０５は、分割境界の候補（分割境界Ｄ１）を連長圧縮対象の部分木の直後、すなわちノード１３とノード１４とを分割する位置に移動する（ステップＳ２０９）。なお、分割部１０５は、例えば辞書テーブルの参照及び木構造データの巡回走査等により、ステップＳ２０７及びステップＳ２０８の判定を実行する。 Subsequently, the dividing unit 105 includes a root node of a subtree to be subjected to lexicographic compression or run length compression in an ancestor (node 7) obtained by removing the own node (node 11) from an ancestor sequence of the node 11 that is a candidate node. It is determined whether or not there is (step S207). Here, since the node 7 is not the root node of the overlapping subtree and cannot be any compression target (step S207: NO), the dividing unit 105 is the root node of the subtree in which the node 11 that is the candidate node is the target of the continuous length compression. It is determined whether or not there is (step S208). Here, the subtree having the node 11 as the root node is subject to continuous length compression with the subtree having the node 8 as the root node (step S208: YES). The boundary D1) is moved to a position immediately after the subtree to be subjected to continuous length compression, that is, the position at which the node 13 and the node 14 are divided (step S209). Note that the dividing unit 105 executes the determinations in step S207 and step S208, for example, by referring to a dictionary table and cyclic scanning of tree structure data.

続いて、分割部１０５は、移動させた後の分割境界（図９の分割境界Ｄ３）に基づいて定まる候補ノード（ノード１４）について、ステップＳ２０６〜ステップＳ２０８を実行する。ノード１４の先祖系列はノード１４のみであり（ステップＳ２０７：ＮＯ）、ノード１４は連長圧縮対象となる部分木の根ノードではないため（ステップＳ２０８：ＮＯ）、分割部１０５は、移動させた後の分割境界Ｄ３を分割境界として確定させる（ステップＳ２１０）。 Subsequently, the dividing unit 105 executes Steps S206 to S208 for the candidate node (node 14) determined based on the moved dividing boundary (the dividing boundary D3 in FIG. 9). Since the ancestor series of the node 14 is only the node 14 (step S207: NO), and the node 14 is not the root node of the subtree to be subjected to the continuous length compression (step S208: NO), the dividing unit 105 The division boundary D3 is determined as the division boundary (step S210).

続いて、未確定の分割境界の候補Ｄ２が残っているため（ステップＳ２１１：ＹＥＳ）、分割部１０５は、分割境界Ｄ２を選択する（ステップＳ２０２）。続いて、分割部１０５は、確定済みの分割境界があるか否かを判定する（ステップＳ２０３）。ここでは、確定済みの分割境界Ｄ３があるため（ステップＳ２０３：ＹＥＳ）、分割部１０５は、候補ノードであるノード２０が、直前の確定済み分割境界Ｄ３よりも後方にあるか否かを判定する（ステップＳ２０４）。ここで、ノード２０は分割境界Ｄ３よりも後方にあるため（ステップＳ２０４：ＹＥＳ）、分割部１０５は、候補ノードであるノード２０の先祖系列を抽出する（ステップＳ２０６）。 Subsequently, since there are still undecided division boundary candidates D2 (step S211: YES), the division unit 105 selects the division boundary D2 (step S202). Subsequently, the dividing unit 105 determines whether there is a determined division boundary (step S203). Here, since there is a confirmed division boundary D3 (step S203: YES), the division unit 105 determines whether or not the node 20, which is a candidate node, is behind the previous decision division boundary D3. (Step S204). Here, since the node 20 is behind the division boundary D3 (step S204: YES), the division unit 105 extracts the ancestor series of the node 20 that is a candidate node (step S206).

なお、ステップＳ２０４の判定において仮に候補ノードが直前の確定済み分割境界よりも後方にないと判定された場合（ステップＳ２０４：ＮＯ）には、分割部１０５は、分割境界の候補を直前の確定済み分割境界よりも後方の任意の位置に移動させた上で（ステップＳ２０５）、ステップＳ２０６の処理を実行する。 If it is determined in step S204 that the candidate node is not behind the previous determined division boundary (step S204: NO), the division unit 105 determines the division boundary candidate immediately before. After moving to an arbitrary position behind the division boundary (step S205), the process of step S206 is executed.

続いて、分割部１０５は、候補ノードであるノード２０の先祖系列から自ノード（ノード２０）を除いた先祖（ノード１７、ノード１８）のうちに、辞書式圧縮又は連長圧縮の対象となる部分木の根ノードがあるか否かを判定する（ステップＳ２０７）。ここで、ノード１８を根ノードとする部分木は、ノード２を根ノードとする部分木と重複する重複部分木であり、辞書式圧縮の対象となっている（ステップＳ２０７：ＹＥＳ）。したがって、分割部１０５は、分割境界の候補（分割境界Ｄ２）を辞書式圧縮対象の部分木の直後、すなわちノード２２とノード２３とを分割する位置に移動する（ステップＳ２０９）。 Subsequently, the dividing unit 105 becomes a target of lexicographic compression or run length compression in the ancestors (nodes 17 and 18) obtained by excluding the own node (node 20) from the ancestor series of the node 20 that is a candidate node. It is determined whether there is a root node of the subtree (step S207). Here, the subtree having the node 18 as the root node is an overlapping subtree that overlaps the subtree having the node 2 as the root node, and is a target of lexicographic compression (step S207: YES). Therefore, the division unit 105 moves the division boundary candidate (division boundary D2) to a position immediately after the partial tree to be lexicographically compressed, that is, the position where the node 22 and the node 23 are divided (step S209).

続いて、分割部１０５は、移動させた後の分割境界（図９の分割境界Ｄ４）に基づいて定まる候補ノード（ノード２３）について、ステップＳ２０６〜ステップＳ２０８を実行する。ノード２３の先祖系列は（ノード１７、ノード２３）であるが、いずれのノードも辞書式圧縮又は連長圧縮の対象となる部分木の根ノードではない（ステップＳ２０７：ＮＯ）。また、ノード２３を根ノードとする部分木は、ノード８を根ノードとする部分木と重複する重複部分木であり、辞書式圧縮の対象となるものの、連長圧縮対象となる部分木の根ノードではない（ステップＳ２０８：ＮＯ）。したがって、分割部１０５は、移動させた後の分割境界Ｄ４を分割境界として確定させる（ステップＳ２１０）。 Subsequently, the dividing unit 105 executes Steps S206 to S208 for the candidate node (node 23) determined based on the moved dividing boundary (the dividing boundary D4 in FIG. 9). The ancestor series of the node 23 is (node 17, node 23), but none of the nodes is the root node of the subtree subject to lexicographic compression or run length compression (step S207: NO). Further, the subtree having the node 23 as a root node is an overlapping subtree that overlaps the subtree having the node 8 as a root node, and is a lexicographic compression target. No (step S208: NO). Therefore, the dividing unit 105 determines the divided boundary D4 after the movement as a divided boundary (step S210).

以上の処理により、未確定の分割境界の候補がなくなったため（ステップＳ２１１：ＮＯ）、分割部１０５は、確定した分割境界Ｄ３，Ｄ４で木構造データを分割する（ステップＳ２１２）。 As a result of the above processing, there are no more undetermined division boundary candidates (step S211: NO), and the division unit 105 divides the tree structure data at the decided division boundaries D3 and D4 (step S212).

続いて、スレッド割当部１０６によって、分割部１０５により分割されて生成された複数のノード列のそれぞれに対して１つのスレッドが割り当てられる（ステップＳ１０６、スレッド割当ステップ）。 Subsequently, one thread is allocated to each of the plurality of node sequences generated by the division unit 105 by the thread allocation unit 106 (step S106, thread allocation step).

続いて、直列化部１０７によって、スレッド毎に並列して、ノード列に含まれる各ノードに対応するデータが直列化順序でバイト列に直列化される（ステップＳ１０７、直列化ステップ）。ステップＳ１０５において分割部１０５がステップＳ２０１〜Ｓ２１２に示される分割処理を実行して木構造データを分割している場合には、ステップＳ１０７において、直列化部１０７は、重複部分木単位に辞書式圧縮及び連長圧縮（所定の圧縮処理）を実行した上で直列化を実行することができる。 Subsequently, the serialization unit 107 serializes the data corresponding to each node included in the node sequence into a byte sequence in the serialization order in parallel for each thread (step S107, serialization step). In step S105, when the dividing unit 105 executes the dividing process shown in steps S201 to S212 to divide the tree structure data, in step S107, the serializing unit 107 performs lexicographic compression in units of overlapping subtrees. In addition, serialization can be performed after executing continuous length compression (predetermined compression processing).

続いて、結合部１０８によって、直列化部１０７によって得られたスレッド毎の直列化結果が、直列化順序で結合される（ステップＳ１０８〜Ｓ１１０、結合ステップ）。例えばスレッドが３つ（処理対象のノード列の直列化順序が若い順にスレッド１、スレッド２、スレッド３）であり、スレッド毎の直列化結果が３つ得られた場合を考える。この場合、結合部１０８は、まず先頭の直列化結果（スレッド１の直列化結果）に対する未結合部分（スレッド２以降の直列化結果）のうち最も前方のスレッド２の直列化結果を選択し（ステップＳ１０８）、選択したスレッド２の直列化結果をスレッド１の直列化結果の後方に結合する（ステップＳ１０９）。続いて、結合部１０８は、未結合部分があるか否かを判定する（ステップＳ１１０）。ここで、未結合部分（スレッド３の直列化結果）が存在するため（ステップＳ１１０：ＹＥＳ）、結合部１０８は、既に結合済みの部分（スレッド１及び２の直列化結果を結合した部分）の後方に、スレッド３の直列化結果を結合する（ステップＳ１０８，Ｓ１０９）。これにより、全ての直列化結果が結合され、未結合部分がなくなるため（ステップＳ１１０：ＮＯ）、結合部１０８は、結合処理を完了する。 Subsequently, the serialization results for each thread obtained by the serialization unit 107 are combined in the serialization order by the combining unit 108 (steps S108 to S110, combining step). For example, consider a case where there are three threads (thread 1, thread 2, and thread 3 in ascending order of serialization order of processing target node strings) and three serialization results for each thread are obtained. In this case, the combining unit 108 first selects the serialization result of the frontmost thread 2 from the unconnected parts (serialization results after the thread 2) with respect to the first serialization result (serialization result of the thread 1) ( In step S108, the serialized result of the selected thread 2 is coupled to the rear of the serialized result of thread 1 (step S109). Subsequently, the combining unit 108 determines whether there is an unconnected part (step S110). Here, since there is an uncombined part (result of serialization of the thread 3) (step S110: YES), the joint unit 108 is a part of the part that has already been joined (part where the serialization results of the threads 1 and 2 are joined). The serialized results of the thread 3 are combined backward (steps S108 and S109). As a result, all serialization results are combined and there is no unbonded portion (step S110: NO), so the combining unit 108 completes the combining process.

ステップＳ１０９において、結合部１０８は、結合対象の直列化結果を入力してＺＩＰ圧縮等の所定のデータ圧縮処理を実行した上で結合することができる。 In step S109, the combining unit 108 can input the serialization result to be combined and execute a predetermined data compression process such as ZIP compression to combine them.

最後に、出力部１０９によって、結合部１０８により結合された直列化結果が、直列化結果を利用する所定のサーバ装置（不図示）及び直列化結果を一定期間保存しておくためのストレージ（不図示）等に出力される（ステップＳ１１１、出力ステップ）。 Finally, the serialization result combined by the combining unit 108 by the output unit 109 is a predetermined server device (not shown) that uses the serialization result, and storage for storing the serialization result for a certain period of time. (Step S111, output step).

次に、図１６を用いて、コンピュータをデータ直列化装置１として機能させるためのデータ直列化プログラムＰ１について説明する。図１６は、データ直列化プログラムＰ１のモジュール構成を示すブロック図である。図１６に示すように、データ直列化プログラムＰ１は、重み決定モジュールＰ１１、分割モジュールＰ１２、スレッド割当モジュールＰ１３、直列化モジュールＰ１４、結合モジュールＰ１５、及び出力モジュールＰ１６を備える。上記重み決定モジュールＰ１１、分割モジュールＰ１２、スレッド割当モジュールＰ１３、直列化モジュールＰ１４、結合モジュールＰ１５、及び出力モジュールＰ１６が実行されることにより実現される機能は、上述したデータ直列化装置１において対応する重み決定部１０４、分割部１０５、スレッド割当部１０６、直列化部１０７、結合部１０８、及び出力部１０９の機能と同様である。 Next, a data serialization program P1 for causing a computer to function as the data serialization device 1 will be described with reference to FIG. FIG. 16 is a block diagram showing a module configuration of the data serialization program P1. As shown in FIG. 16, the data serialization program P1 includes a weight determination module P11, a division module P12, a thread assignment module P13, a serialization module P14, a combination module P15, and an output module P16. The functions realized by executing the weight determination module P11, the division module P12, the thread allocation module P13, the serialization module P14, the combination module P15, and the output module P16 correspond to the data serialization apparatus 1 described above. The functions are the same as those of the weight determination unit 104, the division unit 105, the thread allocation unit 106, the serialization unit 107, the combination unit 108, and the output unit 109.

このように構成されたデータ直列化プログラムＰ１は、図２に示す記録媒体１８に記憶され、データ直列化装置１として用いられるコンピュータにより実行される。当該コンピュータは、記録媒体１８が記録媒体読取部１７に挿入されると、記録媒体読取部１７から記録媒体１８に格納されたデータ直列化プログラムＰ１にアクセス可能となり、当該データ直列化プログラムＰ１を実行することによって、本実施形態に係るデータ直列化装置１として動作することが可能となる。 The data serialization program P1 thus configured is stored in the recording medium 18 shown in FIG. 2 and is executed by a computer used as the data serialization device 1. When the recording medium 18 is inserted into the recording medium reading unit 17, the computer can access the data serialization program P1 stored in the recording medium 18 from the recording medium reading unit 17, and executes the data serialization program P1. By doing so, it becomes possible to operate as the data serialization apparatus 1 according to the present embodiment.

データ直列化プログラムＰ１は、搬送波に重畳されたコンピュータデータ信号としてネットワークを介して提供されるものであってもよい。この場合、データ直列化装置１として用いられるコンピュータは、通信制御部１４によって受信したデータ直列化プログラムＰ１を主記憶部１２に格納することにより、当該データ直列化プログラムＰ１を実行することができる。 The data serialization program P1 may be provided via a network as a computer data signal superimposed on a carrier wave. In this case, the computer used as the data serialization device 1 can execute the data serialization program P1 by storing the data serialization program P1 received by the communication control unit 14 in the main storage unit 12.

以上、本実施形態に係るデータ直列化装置１について詳細に説明した。データ直列化装置１によれば、木構造データのバイナリ形式への直列化処理を複数スレッドで並列的に処理することにより、木構造データの直列化処理を高速化することができる。データ直列化装置１は、木構造データをバイナリ形式で直列化するあらゆる用途に用いることができる。例えば、ビッグデータを保存する用途では、保存するデータサイズを削減するために圧縮処理（辞書式圧縮、連長圧縮、ＺＩＰ圧縮等）を併用した直列化を適切かつ高速に実行することができる。また、データをネットワーク上で伝送する用途では、元データ（木構造データ）からバイト列に高速に直列化することにより、伝送するバイナリデータ（直列化結果）を高速に生成することができる。 Heretofore, the data serialization apparatus 1 according to the present embodiment has been described in detail. According to the data serialization apparatus 1, the serialization processing of the tree structure data can be speeded up by processing the serialization processing of the tree structure data into the binary format in parallel by a plurality of threads. The data serialization device 1 can be used for any application that serializes tree structure data in binary format. For example, in an application for storing big data, serialization using compression processing (lexicographic compression, run length compression, ZIP compression, etc.) in combination to reduce the data size to be stored can be executed appropriately and at high speed. In applications where data is transmitted over a network, binary data (serialization result) to be transmitted can be generated at high speed by serializing the original data (tree structure data) into a byte string at high speed.

１…データ直列化装置、１０１…木構造構築部、１０２…重複部分木検出部、１０３…辞書テーブル生成部、１０４…重み決定部、１０５…分割部、１０６…スレッド割当部、１０７…直列化部、１０８…結合部、１０９…出力部、Ｄ１〜Ｄ６…分割境界、Ｐ１…データ直列化プログラム、Ｐ１１…重み決定モジュール、Ｐ１２…分割モジュール、Ｐ１３…スレッド割当モジュール、Ｐ１４…直列化モジュール、Ｐ１５…結合モジュ
ール、Ｐ１６…出力モジュール。 DESCRIPTION OF SYMBOLS 1 ... Data serialization apparatus, 101 ... Tree structure construction part, 102 ... Duplicate subtree detection part, 103 ... Dictionary table production | generation part, 104 ... Weight determination part, 105 ... Dividing part, 106 ... Thread allocation part, 107 ... Serialization 108, coupling unit, 109 ... output unit, D1 to D6 ... division boundary, P1 ... data serialization program, P11 ... weight determination module, P12 ... division module, P13 ... thread allocation module, P14 ... serialization module, P15 ... coupling module, P16 ... output module.

Claims

Predetermined tree structure data having leaf nodes that hold values, intermediate nodes that hold leaf nodes or other intermediate nodes as child nodes, and root nodes that are the only intermediate nodes that do not have a parent node A data serializer that serializes in a serialization order,
For each subtree having each node as a root node, weight determining means for determining a weight based on the size of the subtree;
A dividing unit configured to divide the tree structure data into a plurality of node sequences each including one or more nodes in which the serialization order is continuous based on the weight of each subtree determined by the weight determining unit;
Thread assignment means for assigning one thread to each of the plurality of node sequences generated by being divided by the dividing means;
Serializing means for serializing data corresponding to each node included in the node sequence into a byte sequence in the serialization order in parallel for each thread;
Coupling means for coupling the serialization results for each thread obtained by the serialization means in the serialization order;
Output means for outputting the serialized result combined by the combining means;
A data serialization device comprising:

The serialization means performs serialization after executing a predetermined compression process on overlapping subtree units indicating equivalent subtrees that appear multiple times in the tree structure data,
The dividing unit divides the tree structure data so as not to divide a node included in the overlapping subtree and a node immediately before the node in the serialization order;
The data serialization device according to claim 1.

As one of the compression processes included in the predetermined compression process, the duplication unit performs the duplication based on a dictionary table in which data and codes corresponding to the duplication subtree are stored in association with each other in advance. Performing lexicographic compression replacing the data corresponding to the subtree with the code;
Allowing the dividing means to divide a root node of a subtree for which the lexicographic compression is executed by the serializing means and a node immediately before the root node in the serialization order; Split structural data,
The data serialization apparatus according to claim 2.

The weight determination means makes the weight of the overlapping subtree to be subjected to the predetermined compression processing smaller than when the predetermined compression processing is not performed;
The data serialization device according to claim 2 or 3.

The dividing unit divides the tree structure data so that a weight for each node sequence determined based on a weight of each subtree determined by the weight determining unit is equalized between the node sequences;
The data serialization apparatus as described in any one of Claims 1-4.

The dividing unit divides the tree-structured data so that the weight of each node sequence determined based on the weight of each subtree determined by the weight determining unit is smaller for the node sequence having a lower serialization order. To
The data serialization apparatus as described in any one of Claims 1-4.

The combining means inputs serialized results for each thread one by one in the serialization order and executes a predetermined data compression process, and then combines the serialized results for each thread.
The data serialization apparatus according to claim 6.

Predetermined tree structure data having leaf nodes that hold values, intermediate nodes that hold leaf nodes or other intermediate nodes as child nodes, and root nodes that are the only intermediate nodes that do not have a parent node A data serialization method performed by a data serializer that serializes in a serialization order, comprising:
For each subtree having each node as a root node, a weight determination step for determining a weight based on the size of the subtree;
A division step of dividing the tree-structured data into a plurality of node sequences each including one or more nodes in which the serialization order is continuous based on the weight of each subtree determined in the weight determination step;
A thread allocation step of allocating one thread to each of the plurality of node sequences generated by the division in the division step;
A serialization step of serializing data corresponding to each node included in the node sequence into a byte sequence in the serialization order in parallel for each thread;
A combining step of combining the serialized results for each thread obtained in the serializing step in the serialization order;
An output step for outputting the serialized result combined in the combining step;
A data serialization method including:

Predetermined tree structure data having leaf nodes that hold values, intermediate nodes that hold leaf nodes or other intermediate nodes as child nodes, and root nodes that are the only intermediate nodes that do not have a parent node A computer provided in a data serialization device that serializes in a serialization order,
For each subtree having each node as a root node, weight determining means for determining a weight based on the size of the subtree;
A dividing unit configured to divide the tree structure data into a plurality of node sequences each including one or more nodes in which the serialization order is continuous based on the weight of each subtree determined by the weight determining unit;
Thread assignment means for assigning one thread to each of the plurality of node sequences generated by being divided by the dividing means;
Serializing means for serializing data corresponding to each node included in the node sequence into a byte sequence in the serialization order in parallel for each thread;
Coupling means for coupling the serialization results for each thread obtained by the serialization means in the serialization order;
Output means for outputting the serialized result combined by the combining means;
Data serialization program to function as