JP6898778B2

JP6898778B2 - Machine learning system and machine learning method

Info

Publication number: JP6898778B2
Application number: JP2017110108A
Authority: JP
Inventors: 昌史高橋; 横山　徹; 徹横山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2017-06-02
Filing date: 2017-06-02
Publication date: 2021-07-07
Anticipated expiration: 2037-06-02
Also published as: JP2018206016A

Description

本発明は、機械学習システム及び機械学習方法に関する。 The present invention relates to a machine learning system and a machine learning method.

特許文献１には、「機械学習装置は、層間アクセラレータを含む。層間アクセラレータは、３層以上のニューラルネットワークに含まれる第１の層の入力ベクトルと当該第１の層の学習重み行列とに基づいて当該第１の層の次の第２の層の入力ベクトルを生成する複数の層間ユニットを含む。複数の層間ユニットの各々は、結合発振器アレイと、活性化関数適用器とを含む。結合発振器アレイは、第１の層の入力ベクトルの複数の要素と学習重み行列のいずれかの行に相当する行ベクトルの複数の要素との差分に応じた周波数で発振する複数の発振器を含み、当該複数の発振器によって生成された発振信号を結合して演算信号を得る。活性化関数適用器は、演算信号に活性化関数を適用することによって、第２の層の入力ベクトルのいずれかの要素を生成する。」と記載されている。 Patent Document 1 states, "A machine learning device includes an interlayer accelerator. The interlayer accelerator is based on an input vector of a first layer included in a neural network having three or more layers and a learning weight matrix of the first layer. It contains a plurality of interlayer units that generate an input vector of the second layer next to the first layer. Each of the plurality of interlayer units includes a coupled oscillator array and an activation function applicator. The array includes a plurality of oscillators that oscillate at a frequency corresponding to the difference between the plurality of elements of the input vector of the first layer and the plurality of elements of the row vector corresponding to any row of the learning weight matrix. The oscillating signal generated by the oscillator of the above is combined to obtain an arithmetic signal. The activation function applicator generates any element of the input vector of the second layer by applying the activation function to the arithmetic signal. To do. "

非特許文献１には、高性能のハードウェアを有する複数の計算ノードを用いて並列分散処理を行うことにより、ディープラーニングにおける学習時間を削減する技術に関して記載されている。 Non-Patent Document 1 describes a technique for reducing learning time in deep learning by performing parallel distributed processing using a plurality of computing nodes having high-performance hardware.

特開２０１７−３３３８５号公報JP-A-2017-333385

Jeffrey Dean, Greg S. Corrado, RajatMonga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z.Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. "Large scale distributed deep networks." In NIPS, 2012.Jeffrey Dean, Greg S. Corrado, RajatMonga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. "Large scale distributed deep networks" . "In NIPS, 2012.

昨今、ディープラーニング（深層学習）の開発／研究が急速に進められており、画像認識、文字認識、音声認識、スマートフォン、自律ロボット、ドローン等の様々な分野への応用が期待されている。ディープラーニングにおいては、畳み込みニューラルネットワーク（ＣＮＮ（Convolutional Neural Network）やフィードフォワードニューラルネットワーク（ＦＦＮＮ（Feedforward Neural Network））等のモデルを用いて学習が行われる。 Recently, the development / research of deep learning is rapidly progressing, and it is expected to be applied to various fields such as image recognition, character recognition, voice recognition, smartphones, autonomous robots, and drones. In deep learning, learning is performed using a model such as a convolutional neural network (CNN (Convolutional Neural Network)) or a feedforward neural network (FFNN (Feedforward Neural Network)).

ここでモデルを用いた学習においては取り扱う重み係数の数が膨大であるため、学習に多大な時間を要する。そこで例えば、非特許文献１に開示されているように、訓練データや重み係数の更新を複数の計算ノードに分散させて行う方式（以下、並列分散方式と称する。）が提案されている。 Here, in learning using a model, since the number of weighting coefficients to be handled is enormous, a large amount of time is required for learning. Therefore, for example, as disclosed in Non-Patent Document 1, a method has been proposed in which training data and weighting coefficients are updated in a distributed manner among a plurality of calculation nodes (hereinafter, referred to as a parallel distributed method).

しかし並列分散方式では、計算ノード間で重み係数や重み係数の更新量を共有する必要があり、学習に際して計算ノード間で行われるデータ転送により大量のトラフィックが発生し、とくに計算ノードの数が多いと転送待ちの状態が頻発し、学習時間の短縮化を図る上でボトルネックとなる。 However, in the parallel distribution method, it is necessary to share the weighting coefficient and the update amount of the weighting coefficient between the computing nodes, and a large amount of traffic is generated due to the data transfer performed between the computing nodes during learning, and the number of computing nodes is particularly large. The state of waiting for transfer frequently occurs, which becomes a bottleneck in shortening the learning time.

本発明はこのような背景に鑑みてなされたものであり、機械学習を効率よく行うことが
可能な、機械学習システム及び機械学習方法を提供することを目的とする。 The present invention has been made in view of such a background, and an object of the present invention is to provide a machine learning system and a machine learning method capable of efficiently performing machine learning.

上記目的を達成するための本発明の一つは、通信可能に接続された複数の情報処理装置を含み、前記複数の情報処理装置を用いた並列分散方式により機械学習を行う機械学習システムであって、前記情報処理装置は、前記機械学習におけるモデルの学習に用いるパラメータの送受信を他の前記情報処理装置との間で行う送受信部と、前記パラメータを他の前記情報処理装置に送信する際、前記パラメータを差分符号化して圧縮する圧縮部と、他の前記情報処理装置から受信した前記パラメータを伸張する伸張部と、前記情報処理装置の夫々と通信可能に接続する制御装置と、を備え、前記制御装置は、前記機械学習が進むにつれ、前記情報処理装置の夫々が前記差分符号化に際して用いる量子化パラメータの値を減少させる量子化パラメータ制御部を有する。 One of the present inventions for achieving the above object is a machine learning system that includes a plurality of information processing devices connected so as to be communicable and performs machine learning by a parallel distribution method using the plurality of information processing devices. When the information processing apparatus transmits and receives parameters used for learning a model in the machine learning to and from the other information processing apparatus, and transmits the parameters to the other information processing apparatus. A compression unit that differentially encodes and compresses the parameters, an expansion unit that expands the parameters received from the other information processing device, and a control device that is communicably connected to each of the information processing devices are provided. The control device has a quantization parameter control unit that reduces the value of the quantization parameter used by each of the information processing devices for the difference coding as the machine learning progresses .

その他、本願が開示する課題、及びその解決方法は、発明を実施するための形態の欄、及び図面により明らかにされる。 In addition, the problems disclosed in the present application and the solutions thereof will be clarified by the column of the form for carrying out the invention and the drawings.

本発明によれば、機械学習を効率よく行うことができる。 According to the present invention, machine learning can be performed efficiently.

学習システムの概略的な構成を示す図である。It is a figure which shows the schematic structure of the learning system. 計算ノードの実現に用いる情報処理装置の一例を示す図である。It is a figure which shows an example of the information processing apparatus used for the realization of a calculation node. ワーカ及びパラメータサーバの概略的な動作を説明する図である。It is a figure explaining the schematic operation of a worker and a parameter server. ワーカ及びパラメータサーバの機能の概略を説明するブロック図である。It is a block diagram explaining the outline of the function of a worker and a parameter server. 機械学習のモデルの空間的相関を説明する図である。It is a figure explaining the spatial correlation of a machine learning model. 機械学習のモデルの時間的相関を説明する図である。It is a figure explaining the temporal correlation of the machine learning model. 機械学習のモデル空間の隣接する位置の夫々について連続する更新時刻と重み係数Ｗとの関係を示す図である。It is a figure which shows the relationship between the continuous update time and the weighting coefficient W for each of the adjacent positions of the machine learning model space. 可変長符号表の一例である。This is an example of a variable length code table. 学習処理（パラメータサーバ側）を説明するフローチャートである。It is a flowchart explaining a learning process (parameter server side). 学習処理（ワーカ側）を説明するフローチャートである。It is a flowchart explaining a learning process (worker side). パラメータサーバのＷ圧縮部の詳細を説明する処理ブロック図である。It is a processing block diagram explaining the details of the W compression part of a parameter server. ワーカのＷ伸張部の詳細を説明する処理ブロック図である。It is a processing block diagram explaining the details of the W extension part of a worker. ワーカのΔＷ圧縮部の詳細を説明する処理ブロック図である。It is a processing block diagram explaining the details of the ΔW compression part of a worker. パラメータサーバのΔＷ伸張部の詳細を説明する処理ブロック図である。It is a processing block diagram explaining the details of the ΔW extension part of a parameter server. （ａ）は更新回数の増大に対する重み係数Ｗの変化の一例を示すグラフ、（ｂ）は更新回数の増大に対するワーカ数の変化を示すグラフ、（ｃ）は更新回数に対する量子化パラメータｑの変化を示すグラフである。(A) is a graph showing an example of a change in the weighting coefficient W with respect to an increase in the number of updates, (b) is a graph showing a change in the number of workers with an increase in the number of updates, and (c) is a graph showing a change in the quantization parameter q with respect to the number of updates. It is a graph which shows. 学習システムの他の構成例である。This is another configuration example of the learning system. 学習システムの他の構成例であるAnother configuration example of the learning system 学習システムの他の構成例であるAnother configuration example of the learning system

以下、図面を参照しつつ実施形態について説明する。以下の説明において、同一の又は類似する構成について同一の符号を付して重複した説明を省略することがある。 Hereinafter, embodiments will be described with reference to the drawings. In the following description, the same or similar configurations may be designated by the same reference numerals and duplicate description may be omitted.

［第１実施形態］
図１に第１実施形態として説明する、機械学習を行う情報処理システム（以下、学習システム１と称する。）の概略的な構成を示している。同図に示すように、学習システム１は、１つのパラメータサーバ２００と複数のワーカ１００とを含む。以下の説明において、ワーカ１００及びパラメータサーバ２００の夫々を計算ノードと称することがある。また以下の説明において、機械学習はディープラーニングであるものとして説明する。 [First Embodiment]
FIG. 1 shows a schematic configuration of an information processing system (hereinafter, referred to as learning system 1) that performs machine learning, which will be described as the first embodiment. As shown in the figure, the learning system 1 includes one parameter server 200 and a plurality of workers 100. In the following description, each of the worker 100 and the parameter server 200 may be referred to as a calculation node. Further, in the following description, machine learning will be described as being deep learning.

パラメータサーバ２００及び複数のワーカ１００は、通信回線５０を介して互いに通信可能に接続されている。学習システム１は、機械学習における処理を複数のワーカ１００及びパラメータサーバ２００の夫々に分担させる方式（以下、並列分散方式と称する）で行う。 The parameter server 200 and the plurality of workers 100 are connected to each other so as to be able to communicate with each other via the communication line 50. The learning system 1 performs the processing in machine learning by a method (hereinafter, referred to as a parallel distribution method) in which a plurality of workers 100 and a parameter server 200 each share the processing.

図２は計算ノードの実現に用いる情報処理装置１０の一例である。同図に示すように、情報処理装置１０は、プロセッサ１１、主記憶装置１２、補助記憶装置１３、入力装置１４、出力装置１５、及び通信装置１６の各構成を備える。これらはバス等の通信手段を介して互いに通信可能に接続されている。 FIG. 2 is an example of the information processing device 10 used to realize the calculation node. As shown in the figure, the information processing device 10 includes a processor 11, a main storage device 12, an auxiliary storage device 13, an input device 14, an output device 15, and a communication device 16. These are connected to each other so as to be able to communicate with each other via a communication means such as a bus.

尚、情報処理装置１０の全部又は一部が、例えば、クラウドシステムにおけるクラウドサーバ等の仮想的な資源を用いて構成されていてもよい。同図に示す構成のうち、補助記憶装置１３、入力装置１４、及び出力装置１５については必須の構成ではない。 In addition, all or a part of the information processing apparatus 10 may be configured by using a virtual resource such as a cloud server in a cloud system, for example. Of the configurations shown in the figure, the auxiliary storage device 13, the input device 14, and the output device 15 are not essential configurations.

プロセッサ１１は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）等を用いて構成されている。プロセッサ１１は、例えば、マルチコアプロセッサを構成するコアの一つであってもよい。またプロセッサ１１は、マルチプロセッサシステムを構成するプロセッサの一つでもよい。 The processor 11 is configured by using, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), a GPU (Graphics Processing Unit), a DSP (Digital Signal Processor), and the like. The processor 11 may be, for example, one of the cores constituting the multi-core processor. Further, the processor 11 may be one of the processors constituting the multiprocessor system.

主記憶装置１２は、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、不揮発性半導体メモリ（ＮＶＲＡＭ（Non Volatile RAM））等であり、プログラムやデータを記憶する。主記憶装置１２は、プロセッサ１１と同一のモジュールやパッケージに実装されたものであってもよい。 The main storage device 12 is, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a non-volatile semiconductor memory (NVRAM (Non Volatile RAM)), or the like, and stores programs and data. The main storage device 12 may be mounted in the same module or package as the processor 11.

補助記憶装置１３は、例えば、ハードディスクドライブ、ＳＳＤ（Solid State Drive
）、光学式記憶装置（ＣＤ（Compact Disc）、ＤＶＤ(Digital Versatile Disc)等）、ストレージシステム、ＩＣカード、ＳＤメモリカードや光学式記録媒体等の記録媒体に対するデータの読取／書込装置等である。補助記憶装置１３に格納されているプログラムやデータは主記憶装置１２に随時ロードされる。補助記憶装置１３は、例えば、ネットワークストレージのように情報処理装置１０とは独立していてもよい。 The auxiliary storage device 13 is, for example, a hard disk drive or an SSD (Solid State Drive).
), Optical storage devices (CD (Compact Disc), DVD (Digital Versatile Disc), etc.), storage systems, IC cards, SD memory cards, data reading / writing devices for recording media such as optical recording media, etc. is there. Programs and data stored in the auxiliary storage device 13 are loaded into the main storage device 12 at any time. The auxiliary storage device 13 may be independent of the information processing device 10 such as network storage.

入力装置１４は、外部からのデータの入力を受け付けるインタフェースであり、例えば、記録媒体（不揮発性メモリ、光学式記録媒体、磁気記録媒体、光磁気記録媒体等）からのデータの読取装置、キーボード、マウス、タッチパネル等である。尚、例えば、情報処理装置１０が、通信装置１６を介して他の装置からデータの入力を受け付ける構成としてもよい。 The input device 14 is an interface that accepts data input from the outside, and is, for example, a data reading device from a recording medium (nonvolatile memory, optical recording medium, magnetic recording medium, optical magnetic recording medium, etc.), a keyboard, and the like. Mouse, touch panel, etc. Note that, for example, the information processing device 10 may be configured to receive data input from another device via the communication device 16.

出力装置１５は、処理経過や処理結果等のデータや情報を外部に提供するユーザインタフェースであり、例えば、画面表示装置（液晶ディスプレイ（Liquid Crystal Display）、プロジェクタ、グラフィックカード等）、印字装置、記録媒体へのデータの記録装置等である。尚、例えば、情報処理装置１０が、処理経過や処理結果等のデータを通信装置１６を介して他の装置に提供する構成としてもよい。 The output device 15 is a user interface that provides data and information such as processing progress and processing results to the outside, and is, for example, a screen display device (liquid crystal display (Liquid Crystal Display), projector, graphic card, etc.), printing device, recording. A device for recording data on a medium. In addition, for example, the information processing device 10 may be configured to provide data such as a processing progress and a processing result to another device via the communication device 16.

通信装置１６は、他の装置や素子との間の通信を実現する、有線方式又は無線方式の通信インタフェースであり、例えば、ＮＩＣ（Network Interface Card）、無線通信モジュール等である。 The communication device 16 is a wired or wireless communication interface that realizes communication with other devices or elements, and is, for example, a NIC (Network Interface Card), a wireless communication module, or the like.

図１に戻り、通信回線５０は、例えば、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide
Area Network）、インターネット、公衆通信網、専用線等である。例えば、通信回線５
０として、低レイテンシでギガbps単位のスループットが出せるInfiniBandを利用すると
効果的である。学習システム１が、例えば、マルチコアシステムやマルチプロセッサシステム等を用いて実現される場合、通信回線５０はバス（bus）を用いて構成されていても
よい。通信回線５０の全部または一部は、有線方式又は無線方式のいずれの方式で実現されていてもよい。 Returning to FIG. 1, the communication line 50 is, for example, a LAN (Local Area Network) or a WAN (Wide).
Area Network), the Internet, public communication networks, leased lines, etc. For example, communication line 5
As 0, it is effective to use InfiniBand, which has low latency and can output throughput in giga bps units. When the learning system 1 is realized by using, for example, a multi-core system, a multi-processor system, or the like, the communication line 50 may be configured by using a bus. All or part of the communication line 50 may be realized by either a wired system or a wireless system.

例えば、１つのＣＰＵと２つのＧＰＵを備えた計算ノードをＭ個用意し、いずれかの計算ノードのＣＰＵをパラメータサーバ２００として動作させ、各計算ノードのＧＰＵをワーカ１００として動作させると、最大で２×Ｍ個のワーカ１００を含む学習システム１を構成することができる。また例えば、１つのＣＰＵと８つのＧＰＵを備えた計算ノードを１つ用意し、上記ＣＰＵをパラメータサーバ２００として動作させ、各計算ノードのＧＰＵをワーカ１００として動作させると、最大で８個のワーカ１００を含む学習システム１を構成することができる。またこの場合、計算ノードにおける高速な内部バスを利用してデータ転送を行うことができる。尚、ＣＰＵやＧＰＵは、いずれもワーカ１００としてもパラメータサーバ２００としても利用することができる。 For example, if M computing nodes having one CPU and two GPUs are prepared, the CPU of one of the computing nodes is operated as the parameter server 200, and the GPU of each computing node is operated as the worker 100, the maximum is large. A learning system 1 including 2 × M workers 100 can be configured. Further, for example, if one computing node having one CPU and eight GPUs is prepared, the CPU is operated as a parameter server 200, and the GPU of each computing node is operated as a worker 100, a maximum of eight workers are used. A learning system 1 including 100 can be configured. In this case, data transfer can be performed using the high-speed internal bus at the computing node. Both the CPU and GPU can be used as both the worker 100 and the parameter server 200.

図３は、パラメータサーバ２００及びワーカ１００の概略的な動作を説明する図である。前述したように、学習システム１は、機械学習における処理を並列分散方式で行う。具体的には、各ワーカ１００には、機械学習に用いる訓練データを分割（例えば、分配先のワーカ１００の数（以下、Ｎとする。）に分割）したデータが分配される。各ワーカ１００は、分配された訓練データを機械学習のモデルに適用することによりモデルの重み係数の更新量ΔＷを求め、求めた重み係数の更新量ΔＷを通信回線５０を介してパラメータサーバ２００に送信する。一方、パラメータサーバ２００は、各ワーカ１００から送られてくる重み係数の更新量ΔＷを用いて重み係数Ｗを更新し、更新後の最新の重み係数Ｗを通信回線５０を介して各ワーカ１００に送信する。学習システム１は、以上の処理を繰り返し行うことにより機械学習を進行させる。 FIG. 3 is a diagram illustrating a schematic operation of the parameter server 200 and the worker 100. As described above, the learning system 1 performs the processing in machine learning in a parallel distributed system. Specifically, the training data used for machine learning is divided into each worker 100 (for example, the data is divided into the number of the distribution destination workers 100 (hereinafter referred to as N)). Each worker 100 obtains the update amount ΔW of the weighting coefficient of the model by applying the distributed training data to the machine learning model, and transfers the obtained update amount ΔW of the weighting coefficient to the parameter server 200 via the communication line 50. Send. On the other hand, the parameter server 200 updates the weighting coefficient W using the update amount ΔW of the weighting coefficient sent from each worker 100, and transfers the updated latest weighting coefficient W to each worker 100 via the communication line 50. Send. The learning system 1 advances machine learning by repeating the above processing.

尚、重み係数の更新量ΔＷの算出方法は必ずしも限定されない。例えば、ワーカ１００は、非特許文献１に記載されている確率的勾配降下法（SGD:Stochastic Gradient Descent）、誤差逆伝搬法、準ニュートン法、遺伝的アルゴリズム等により重み係数の更新量Δ
Ｗを求める。各ワーカ１００が分担する訓練データ１０４は、予め各ワーカ１００に送信しておいてもよいし、通信回線５０等を介して計算ノード間で共有するサーバ装置等に格納しておいてもよい。 The method of calculating the update amount ΔW of the weighting coefficient is not necessarily limited. For example, the worker 100 uses the stochastic gradient descent method (SGD: Stochastic Gradient Descent), the error back propagation method, the quasi-Newton method, the genetic algorithm, and the like described in Non-Patent Document 1 to update the weighting coefficient Δ.
Find W. The training data 104 shared by each worker 100 may be transmitted to each worker 100 in advance, or may be stored in a server device or the like shared between calculation nodes via a communication line 50 or the like.

ここで並列分散方式により機械学習を行う際は、パラメータサーバ２００とワーカ１００との間で転送されるデータ（重み係数Ｗや重み係数の更新量ΔＷ）の量が多いことが問題となる。例えば、データ転送量が回線速度を上回れば転送待ちが発生し、スループットの低下につながる。こうした傾向はワーカ１００数が多いほど顕著となり、転送されるデータの量が多いことはワーカ１００の数（並列数）を増やして処理時間の短縮化を図る際のボトルネックとなる。 Here, when machine learning is performed by the parallel distribution method, there is a problem that the amount of data (weight coefficient W and update amount ΔW of the weight coefficient) transferred between the parameter server 200 and the worker 100 is large. For example, if the amount of data transferred exceeds the line speed, a transfer wait will occur, leading to a decrease in throughput. This tendency becomes more remarkable as the number of workers 100 increases, and the large amount of data to be transferred becomes a bottleneck when increasing the number of workers 100 (parallel number) to shorten the processing time.

そこで本実施形態の学習システム１は、パラメータサーバ２００とワーカ１００との間で転送されるデータ（重み係数Ｗや重み係数の更新量ΔＷ）を差分符号化して圧縮することにより計算ノード間のトラフィック量を減らして上記ボトルネックの解消を図る。また差分符号化の効率（圧縮率）は機械学習のモデルに依存するため、学習システム１は機械学習のモデルに応じて差分符号化の方法として適切なものを選択する。 Therefore, the learning system 1 of the present embodiment differentially encodes and compresses the data (weight coefficient W and update amount ΔW of the weight coefficient) transferred between the parameter server 200 and the worker 100, thereby compressing the traffic between the calculation nodes. Reduce the amount to eliminate the above bottleneck. Further, since the efficiency (compression rate) of the difference coding depends on the machine learning model, the learning system 1 selects an appropriate method of the difference coding according to the machine learning model.

図４はワーカ１００及びパラメータサーバ２００の機能の概略を説明するブロック図である。同図に示すように、ワーカ１００は、Ｗ受信部１１１、Ｗ伸張部１１２、ΔＷ算出部１１３、ΔＷ圧縮部１１４、及びΔＷ送信部１１５の各機能を備える。またパラメータ
サーバ２００は、ΔＷ受信部２１１、ΔＷ伸張部２１２、Ｗ更新部２１３、Ｗ圧縮部２１４、及びＷ送信部２１５の各機能を備える。 FIG. 4 is a block diagram illustrating the outline of the functions of the worker 100 and the parameter server 200. As shown in the figure, the worker 100 includes the functions of the W receiving unit 111, the W expanding unit 112, the ΔW calculation unit 113, the ΔW compression unit 114, and the ΔW transmitting unit 115. Further, the parameter server 200 includes the functions of the ΔW receiving unit 211, the ΔW expanding unit 212, the W updating unit 213, the W compression unit 214, and the W transmitting unit 215.

ワーカ１００やパラメータサーバ２００が備える上記の各機能は、例えば、プロセッサ１１が、主記憶装置１２や補助記憶装置１３に格納されているプログラムを読み出して実行することにより実現される。またこれらの機能は、例えば、情報処理装置１０が備えるハードウェア（ＦＰＧＡ（Field-Programmable Gate Array）、ＡＳＩＣ（Application Specific Integrated Circuit）等）によって実現される。 Each of the above-mentioned functions included in the worker 100 and the parameter server 200 is realized, for example, by the processor 11 reading and executing a program stored in the main storage device 12 and the auxiliary storage device 13. Further, these functions are realized by, for example, hardware (FPGA (Field-Programmable Gate Array), ASIC (Application Specific Integrated Circuit), etc.) included in the information processing apparatus 10.

ワーカ１００の各機能について順に説明する。Ｗ受信部１１１は、通信回線５０を介してパラメータサーバ２００から送られてくる重み係数Ｗを受信する。Ｗ伸張部１１２は、Ｗ受信部１１１が受信した重み係数Ｗを伸張する。ΔＷ算出部１１３は、Ｗ伸張部１１２により伸張された重み係数Ｗを用いて重み係数の更新量ΔＷを求める。ΔＷ圧縮部１１４は、ΔＷ算出部１１３が求めた重み係数の更新量ΔＷを差分符号化して圧縮した圧縮データ（以下、圧縮データ（ΔＷ）とも称する。）を生成する。ΔＷ送信部１１５は、ΔＷ圧縮部１１４が生成した圧縮データ（ΔＷ）を通信回線５０を介してパラメータサーバ２００に送信する。 Each function of the worker 100 will be described in order. The W receiving unit 111 receives the weighting coefficient W sent from the parameter server 200 via the communication line 50. The W stretching unit 112 stretches the weighting coefficient W received by the W receiving unit 111. The ΔW calculation unit 113 obtains the update amount ΔW of the weight coefficient using the weight coefficient W extended by the W extension unit 112. The ΔW compression unit 114 generates compressed data (hereinafter, also referred to as compression data (ΔW)) in which the update amount ΔW of the weighting coefficient obtained by the ΔW calculation unit 113 is differentially encoded and compressed. The ΔW transmission unit 115 transmits the compressed data (ΔW) generated by the ΔW compression unit 114 to the parameter server 200 via the communication line 50.

続いて、パラメータサーバ２００の各機能について順に説明する。ΔＷ受信部２１１は、通信回線５０を介して各ワーカ１００から送られてくる圧縮データ（ΔＷ）を受信する。ΔＷ伸張部２１２は、ΔＷ受信部２１１が受信した圧縮データ（ΔＷ）を伸張し、重み係数の更新量ΔＷを復元する。Ｗ更新部２１３は、ΔＷ伸張部２１２が復元した重み係数の更新量ΔＷに基づき重み係数Ｗを更新する。Ｗ圧縮部２１４は、Ｗ更新部２１３による更新後の重み係数Ｗを圧縮したデータ（以下、圧縮データ（Ｗ）とも称する。）を生成する。Ｗ送信部２１５は、Ｗ圧縮部２１４が生成した圧縮データ（Ｗ）を通信回線５０を介して各ワーカ１００に送信する。 Subsequently, each function of the parameter server 200 will be described in order. The ΔW receiving unit 211 receives the compressed data (ΔW) sent from each worker 100 via the communication line 50. The ΔW expansion unit 212 expands the compressed data (ΔW) received by the ΔW reception unit 211, and restores the update amount ΔW of the weighting coefficient. The W update unit 213 updates the weight coefficient W based on the update amount ΔW of the weight coefficient restored by the ΔW extension unit 212. The W compression unit 214 generates data in which the weight coefficient W after updating by the W update unit 213 is compressed (hereinafter, also referred to as compressed data (W)). The W transmission unit 215 transmits the compressed data (W) generated by the W compression unit 214 to each worker 100 via the communication line 50.

尚、Ｗ更新部２１３による重み係数Ｗの更新の方法はとくに限定されないが、一例を示せば、Ｗ更新部２１３は次式により重み係数Ｗを更新する。
Wx,n+1 = Wx,n - ηΔＷx,n ・・・式１
ここでxはディープラーニングのモデル空間において重み係数が付与されている位置を
特定する情報、ｎは重み係数の更新回数(即ち更新時刻)、ηは学習係数である。学習係数ηは定数としてもよいが、例えば、重み係数Ｗの収束を早めるため、学習が進むにつれ値が小さくなるように変化させてもよい。 The method of updating the weighting coefficient W by the W updating unit 213 is not particularly limited, but to give an example, the W updating unit 213 updates the weighting coefficient W by the following equation.
Wx, n + 1 = Wx, n --ηΔWx, n ・・・ Equation 1
Here, x is information for specifying the position where the weighting coefficient is given in the deep learning model space, n is the number of times the weighting coefficient is updated (that is, the updating time), and η is the learning coefficient. The learning coefficient η may be a constant, but for example, in order to accelerate the convergence of the weighting coefficient W, the value may be changed so as to decrease as the learning progresses.

続いて、Ｗ伸張部１１２、ΔＷ圧縮部１１４、ΔＷ伸張部２１２、Ｗ圧縮部２１４による、重み係数Ｗや重み係数の更新量ΔＷの圧縮及び伸張方法について説明する。尚、学習システム１は、重み係数Ｗの空間的相関や時間的相関を利用した差分符号化を行うことにより、重み係数Ｗや重み係数の更新量ΔＷの圧縮並びに伸張を効率的に行う。 Subsequently, a method of compressing and decompressing the weight coefficient W and the update amount ΔW of the weighting coefficient by the W stretching unit 112, the ΔW compression unit 114, the ΔW stretching unit 212, and the W compression unit 214 will be described. The learning system 1 efficiently compresses and expands the weighting coefficient W and the update amount ΔW of the weighting coefficient W by performing differential coding using the spatial correlation and the temporal correlation of the weighting coefficient W.

図５は、ディープラーニングのモデルの一例として示す畳み込みニューラルネットワーク（以下、ＣＮＮ（Convolutional Neural Network）とも称する。）の空間的相関を説明する図である。図５（ａ）は、ＣＮＮのモデル空間のイメージ図である。また図５（ｂ）は、更新回数ｎ（横軸）と位置ｘ−１における重み係数Ｗx-1,n（縦軸）との関係を表し
たグラフであり、図５（ｃ）は、更新回数ｎ（横軸）と位置ｘ−１に隣接する位置ｘおける重み係数Ｗx,n（縦軸）との関係を表したグラフである。これらのグラフから、ディー
プラーニングのモデルとしてＣＮＮを用いた場合、隣接する位置の間では重み係数Ｗの相関（空間的相関）が高くなる性質（同一の時刻（更新回数ｎ）において両者の差分が０に偏る性質）があることがわかる。つまりこの性質を利用して上記差分を符号化すれば圧縮率を高めることができる。 FIG. 5 is a diagram for explaining the spatial correlation of a convolutional neural network (hereinafter, also referred to as CNN (Convolutional Neural Network)) shown as an example of a deep learning model. FIG. 5A is an image diagram of the model space of CNN. Further, FIG. 5B is a graph showing the relationship between the number of updates n (horizontal axis) and the weighting coefficients Wx-1, n (vertical axis) at the position x-1, and FIG. 5C is an update. It is a graph showing the relationship between the number of times n (horizontal axis) and the weighting coefficients Wx, n (vertical axis) at the position x adjacent to the position x-1. From these graphs, when CNN is used as a deep learning model, the correlation (spatial correlation) of the weighting coefficient W becomes high between adjacent positions (difference between the two at the same time (update count n)). It can be seen that there is a property that is biased toward 0). That is, if the above difference is encoded by utilizing this property, the compression rate can be increased.

一方、図６は、ディープラーニングのモデルの一例として示すフィードフォワードニューラルネットワーク（以下、ＦＦＮＮ（Feedforward Neural Network）と称する。）の時間的相関を説明する図である。図６（ａ）は、ＦＦＮＮのモデル空間のイメージ図である。また図６（ｂ）は、更新回数ｎ（横軸）と位置ｘ−１における重み係数Ｗx-1,n（縦軸
）との関係を表したグラフであり、図６（ｃ）は、更新回数ｎ（横軸）と位置ｘ−１に隣接する位置ｘおける重み係数Ｗx,n（縦軸）との関係を表したグラフである。図６（ｂ）
，（ｃ）に示すように、ディープラーニングのモデルとしてＦＦＮＮを用いた場合、隣接する位置の間で重み係数Ｗの空間的相関は低いが、重み係数の時間的変動は両者とも概してなだらかであり、連続する時刻（例えば、ｎとｎ−１。）間で重み係数Ｗの時間的相関が高くなる性質（連続する時刻における係数値の差分が０に偏る性質）があることがわかる。つまりの性質を利用して上記差分を符号化すれば圧縮率を高めることができる。 On the other hand, FIG. 6 is a diagram for explaining the temporal correlation of a feedforward neural network (hereinafter, referred to as FFNN (Feedforward Neural Network)) shown as an example of a deep learning model. FIG. 6A is an image diagram of the model space of FFNN. Further, FIG. 6B is a graph showing the relationship between the number of updates n (horizontal axis) and the weighting coefficients Wx-1, n (vertical axis) at the position x-1, and FIG. 6C is an update. It is a graph showing the relationship between the number of times n (horizontal axis) and the weighting coefficients Wx, n (vertical axis) at the position x adjacent to the position x-1. FIG. 6 (b)
As shown in (c), when FFNN is used as a deep learning model, the spatial correlation of the weighting coefficient W is low between adjacent positions, but the temporal variation of the weighting coefficient is generally gentle in both cases. , It can be seen that there is a property that the temporal correlation of the weighting coefficient W is high between consecutive times (for example, n and n-1) (the difference between the coefficient values at consecutive times is biased to 0). In other words, the compression rate can be increased by coding the above difference using the property of.

尚、ＦＦＮＮと同様に時間的相関が高くなる傾向があるネットワークとして、他にも再帰型ネットワーク（ＲＮＮ（Recurrent Neural Network））、制限付きボルツマンマシン（ＲＢＭ（Restricted Boltzmann Machine））、オートエンコーダ（AutoEncoder）、全
結合型ネットワーク等がある。これらについてもＦＦＮＮと同様に時間的相関を利用して圧縮率を高めることができる。 Similar to FFNN, other networks that tend to have a high temporal correlation include recurrent networks (RNNs (Recurrent Neural Networks)), restricted Boltzmann machines (RBMs), and autoencoders (AutoEncoder). ), Fully coupled network, etc. Similar to FFNN, the compression rate can be increased by utilizing the temporal correlation for these as well.

以上のように、ディープラーニングにおけるモデルの重み係数Ｗは、モデルの種類によって空間的相関や時間的相関が異なる。そこで学習システム１は、モデルの種類に応じて差分符号化の方法を切り替える（例えば、モデルがＣＮＮである場合は空間的相関に着目した算出方式で差分符号化を行い、モデルがＣＮＮ以外である場合は時間的相関に着目した算出方式で差分符号化を行う）ことにより圧縮率の向上を図る。 As described above, the weighting coefficient W of the model in deep learning has different spatial correlation and temporal correlation depending on the type of model. Therefore, the learning system 1 switches the difference coding method according to the type of the model (for example, when the model is CNN, the difference coding is performed by a calculation method focusing on spatial correlation, and the model is other than CNN. In this case, differential coding is performed by a calculation method that focuses on temporal correlation) to improve the compression rate.

尚、こうした仕組みの実施に際し、例えば、各ワーカ１００とパラメータサーバ２００との間（圧縮側と伸張側）でディープラーニングにおけるモデルの種類を特定する情報（以下、モデル特定情報とも称する。）を共有するようにし、各ワーカ１００やパラメータサーバ２００は、重み係数Ｗや重み係数の更新量ΔＷの圧縮又は伸張に際してモデル特定情報を参照して差分の算出方法を切り換えるようにする。 In implementing such a mechanism, for example, information for specifying the type of model in deep learning (hereinafter, also referred to as model specific information) is shared between each worker 100 and the parameter server 200 (compression side and decompression side). Each worker 100 or the parameter server 200 switches the difference calculation method with reference to the model specific information when compressing or decompressing the weight coefficient W and the update amount ΔW of the weight coefficient.

続いて、図７乃至図９に示す図とともに、差分符号化による重み係数Ｗや重み係数の更新量ΔＷの圧縮及び伸張について詳述する。 Subsequently, the compression and decompression of the weighting coefficient W and the update amount ΔW of the weighting coefficient by the difference coding will be described in detail together with the figures shown in FIGS. 7 to 9.

図７は、ディープラーニングのモデル空間の隣接する位置（ｘ−１とｘ）の夫々について連続する更新時刻（ｎ−１，ｎ）と重み係数Ｗとの関係を示す図である。図７（ａ）は、Wx-1,n-1とWx-1,nとの関係を示すグラフ、図７（ｂ）はWx,n-1とWx,nとの関係を示すグラフである。 FIG. 7 is a diagram showing the relationship between the continuous update time (n-1, n) and the weighting coefficient W for each of the adjacent positions (x-1 and x) in the deep learning model space. FIG. 7A is a graph showing the relationship between Wx-1, n-1 and Wx-1, n, and FIG. 7B is a graph showing the relationship between Wx, n-1 and Wx, n. ..

例えば、ディープラーニングのモデルがＣＮＮである場合、学習システム１は、空間的相関を利用して、重み係数Ｗについて、次式のようにWx-1,nを一度圧縮して伸張した値<Wx-1,n>（以下、このように一度圧縮して伸張した値を<>で囲んで表記する。）をWx,nから差し引くことにより差分を求める。
重みＷの差分＝Wx,n−<Wx-1,n> ・・・式２ For example, when the deep learning model is CNN, the learning system 1 uses spatial correlation to compress and expand Wx-1, n once for the weighting coefficient W as shown in the following equation <Wx. The difference is obtained by subtracting -1, n> (hereinafter, the value once compressed and decompressed in this way is enclosed in <>) from Wx, n.
Difference of weight W = Wx, n− <Wx-1, n> ・・・ Equation 2

また学習システム１は、次式に示すように、重み係数の更新量ΔＷについても同様に差分を求める。
重みの更新量ΔＷの差分＝ΔWx,n−<ΔWx-1,n> ・・・式３ Further, as shown in the following equation, the learning system 1 similarly obtains a difference for the update amount ΔW of the weighting coefficient.
Difference in weight update amount ΔW = ΔWx, n− <ΔWx-1, n> ・・・ Equation 3

尚、Wx-1,nやΔWx-1,nそのものではなく、これらを圧縮して伸張した値（ <Wx-1,n>や
<ΔWx-1,n>）を差し引いているのは、量子化誤差を考慮した上で圧縮側と伸張側とで値を合わせるためである。即ち、伸張側では量子化誤差を含んだ値を受信するため、圧縮側と伸張側とで値を合わせるには圧縮側においても圧縮及び伸張を行った値（量子化誤差を反映した値）を生成しておく必要がある。以下、圧縮側が圧縮及び伸張を行って量子化誤差を反映した値のことを「参照値」とも称する。 It should be noted that it is not the Wx-1, n or ΔWx-1, n itself, but the compressed and decompressed value (<Wx-1, n> or
<ΔWx-1, n>) is subtracted in order to match the values on the compression side and the decompression side in consideration of the quantization error. That is, since the decompression side receives the value including the quantization error, in order to match the value between the compression side and the decompression side, the value obtained by compressing and decompressing on the compression side (value reflecting the quantization error) is used. Must be generated. Hereinafter, the value that the compression side compresses and decompresses and reflects the quantization error is also referred to as a “reference value”.

モデルがＣＮＮ以外である場合、学習システム１は、時間的相関を利用して、重み係数Ｗについて次式に示すようにWx,n-1を一度圧縮して伸張した値<Wx,n-1>をWx,nから差し引くことにより差分を求める。
重みＷの差分＝Wx,n−<Wx,n-1> ・・・式４ When the model is other than CNN, the learning system 1 uses the temporal correlation to compress and expand Wx, n-1 once for the weighting coefficient W as shown in the following equation <Wx, n-1. Find the difference by subtracting> from Wx, n.
Difference of weight W = Wx, n− <Wx, n-1> ・・・ Equation 4

また学習システム１は、次式に示すように、重み係数の更新量ΔＷについても同様にして差分を求める。
重みの更新量ΔＷの差分＝ΔWx,n−<ΔWx,n-1> ・・・式５ Further, as shown in the following equation, the learning system 1 obtains the difference for the update amount ΔW of the weighting coefficient in the same manner.
Difference in weight update amount ΔW = ΔWx, n− <ΔWx, n-1> ・・・ Equation 5

続いて、学習システム１は、以上のようにして求めた差分について、例えば、量子化パラメータｑを用いて次式に基づき量子化データを生成する。
量子化データ＝差分を量子化パラメータｑで割った商・・・式６ Subsequently, the learning system 1 generates quantization data for the difference obtained as described above based on the following equation using, for example, the quantization parameter q.
Quantization data = quotient of difference divided by quantization parameter q ... Equation 6

尚、量子化パラメータｑの値は全ての重み係数Ｗについて同一としてもよいが、例えば、重み係数Ｗ毎に異なる値としてもよい。例えば、１回の更新あたりの変化が激しい重み係数Ｗについては量子化パラメータｑを大きな値に設定し、一方、１回の更新あたりの変化が小さな重み係数Ｗについては量子化パラメータｑを小さな値に設定することで、圧縮率の向上を図ることができる。 The value of the quantization parameter q may be the same for all the weighting coefficients W, but may be different for each weighting coefficient W, for example. For example, the quantization parameter q is set to a large value for the weighting coefficient W that changes drastically per update, while the quantization parameter q is set to a small value for the weighting coefficient W that changes little per update. By setting to, the compression coefficient can be improved.

図８は、量子化データの符号化に際して学習システム１が参照する可変長符号表８００（量子化テーブル）の一例である。同図に示すように、可変長符号表８００には、量子化データの値８１１と符号８１２とを対応づけた情報を含む。学習システム１は、可変長符号表８００を参照して量子化データを可変長符号に変換する。尚、可変長符号表８００において、量子化パラメータｑの値が小さいほど短い符号を割り当てることで圧縮率を向上させることができる。 FIG. 8 is an example of a variable length code table 800 (quantization table) referred to by the learning system 1 when coding the quantization data. As shown in the figure, the variable length code table 800 includes information in which the values 811 of the quantized data and the code 812 are associated with each other. The learning system 1 converts the quantization data into a variable length code with reference to the variable length code table 800. In the variable length code table 800, the smaller the value of the quantization parameter q, the shorter the code can be assigned to improve the compression rate.

尚、伸張に際しては以上に説明した圧縮の場合と逆の手順を辿ることになる。例えば、学習システム１は、圧縮データ（Ｗ）や圧縮データ（ΔＷ）に対し、可変長符号表８００を参照して量子化データを復元し、量子化パラメータｑを掛け合わせた後、参照値を足し合わせることにより圧縮データ（Ｗ）や圧縮データ（ΔＷ）を伸張する。 In addition, when decompressing, the procedure opposite to that in the case of compression described above is followed. For example, the learning system 1 restores the compressed data (W) and the compressed data (ΔW) by referring to the variable-length code table 800, multiplies the compressed data by the quantization parameter q, and then sets the reference value. Compressed data (W) and compressed data (ΔW) are decompressed by adding them together.

続いて、パラメータサーバ２００並びにワーカ１００の夫々の機能について詳述する。 Subsequently, the functions of the parameter server 200 and the worker 100 will be described in detail.

図９は、ディープラーニングの１回分の学習に際してパラメータサーバ２００が行う処理（以下、学習処理Ｓ９００と称する。）を説明するフローチャートである。以下、同図とともに学習処理Ｓ９００について説明する。 FIG. 9 is a flowchart illustrating a process (hereinafter, referred to as a learning process S900) performed by the parameter server 200 in the case of learning one time of deep learning. Hereinafter, the learning process S900 will be described with reference to the figure.

まずパラメータサーバ２００は、ワーカ１００から通信回線５０を介して重み係数の更新量ΔWx,nの圧縮データ（圧縮データ（ΔＷ））を受信する(Ｓ９１１)。 First, the parameter server 200 receives compressed data (compressed data (ΔW)) of the weight coefficient update amount ΔWx, n from the worker 100 via the communication line 50 (S911).

続いて、パラメータサーバ２００は、ディープラーニングのモデルがＣＮＮであるか否か（圧縮データ（ΔＷ）が空間的相関又は時間的相関のいずれを利用して符号化されたか）を判定する（Ｓ９１２）。パラメータサーバ２００がモデルはＣＮＮであると判定した場合（Ｓ９１２：ＹＥＳ）、処理はＳ９１３に進む。一方、パラメータサーバ２００がモ
デルはＣＮＮでないと判定した場合（Ｓ９１２：ＮＯ）、処理はＳ９１４に進む。 Subsequently, the parameter server 200 determines whether or not the deep learning model is CNN (whether the compressed data (ΔW) is encoded using spatial correlation or temporal correlation) (S912). .. When the parameter server 200 determines that the model is CNN (S912: YES), the process proceeds to S913. On the other hand, when the parameter server 200 determines that the model is not CNN (S912: NO), the process proceeds to S914.

Ｓ９１３の処理に進んだ場合、パラメータサーバ２００は、参照値<ΔWx-1,n>を用いて圧縮データ（ΔWx,n−<ΔWx-1,n>）を伸張し、<ΔWx,n>を取得する（Ｓ９１５）。 When proceeding to the process of S913, the parameter server 200 decompresses the compressed data (ΔWx, n− <ΔWx-1, n>) using the reference value <ΔWx-1, n> and outputs <ΔWx, n>. Acquire (S915).

Ｓ９１４の処理に進んだ場合、パラメータサーバ２００は、参照値<ΔWx,n-1>を取得して圧縮データ（ΔWx,n−<ΔWx,n-1>）を伸張し、<ΔWx,n>を取得する（Ｓ９１５）。 When proceeding to the process of S914, the parameter server 200 acquires the reference value <ΔWx, n-1>, decompresses the compressed data (ΔWx, n− <ΔWx, n-1>), and <ΔWx, n>. Is acquired (S915).

続いて、パラメータサーバ２００は、<ΔWx,n>に基づき、式１等により重み係数Ｗx,n
を更新してWx,n+1を取得する（Ｓ９１６）。 Subsequently, the parameter server 200 uses the weighting coefficient Wx, n according to Equation 1 and the like based on <ΔWx, n>.
Is updated to obtain Wx, n + 1 (S916).

続いて、パラメータサーバ２００は、取得したWx,n+1を圧縮する。パラメータサーバ２００は、ディープラーニングのモデルがＣＮＮであるか否かを判定（空間的相関又は時間的相関のいずれを利用して符号化するか）し（Ｓ９１７）、判定結果に応じて参照値 <Wx-1,n+1>又は参照値<Wx,n>を取得し、取得した参照値をWx,n+1から差し引くことにより差
分を求め、求めた差分を圧縮して圧縮データ（Ｗ）を生成する（Ｓ９１８，Ｓ９１９）。 Subsequently, the parameter server 200 compresses the acquired Wx, n + 1. The parameter server 200 determines whether or not the deep learning model is CNN (whether it is encoded using spatial correlation or temporal correlation) (S917), and a reference value < Obtain the Wx-1, n + 1> or reference value <Wx, n>, subtract the obtained reference value from Wx, n + 1, obtain the difference, compress the obtained difference, and compress the data (W). Is generated (S918, S919).

続いて、パラメータサーバ２００は、生成した圧縮データ（Ｗ）をワーカ１００に送信する(Ｓ９２０)。 Subsequently, the parameter server 200 transmits the generated compressed data (W) to the worker 100 (S920).

図１０は、ディープラーニングの1回文の学習に際してワーカ１００が行う処理（以下
、学習処理Ｓ１０００と称する。）を説明するフローチャートである。以下、同図とともに学習処理Ｓ１０００について説明する。 FIG. 10 is a flowchart illustrating a process (hereinafter, referred to as a learning process S1000) performed by the worker 100 when learning a single sentence of deep learning. Hereinafter, the learning process S1000 will be described with reference to the figure.

まずワーカ１００は、パラメータサーバ２００から通信回線５０を介して重み係数Wx,nの圧縮データ（圧縮データ（Ｗ））を受信する(Ｓ１０１１)。続いて、ワーカ１００は、ディープラーニングのモデルがＣＮＮであるか否か（圧縮データが空間的相関又は時間的相関のいずれを利用して符号化されたか）を判定する（１０１２）。ワーカ１００がモデルがＣＮＮであると判定した場合（Ｓ１０１２：ＹＥＳ）、処理はＳ１０１３に進む。ワーカ１００がモデルがＣＮＮでないと判定した場合（Ｓ１０１２：ＮＯ）、処理はＳ１０１４に進む。 First, the worker 100 receives compressed data (compressed data (W)) having weighting coefficients Wx and n from the parameter server 200 via the communication line 50 (S1011). The worker 100 then determines whether the deep learning model is CNN (whether the compressed data was encoded using spatial or temporal correlation) (1012). If the worker 100 determines that the model is CNN (S1012: YES), the process proceeds to S1013. If the worker 100 determines that the model is not a CNN (S1012: NO), the process proceeds to S1014.

Ｓ１０１３の処理に進んだ場合、ワーカ１００は、参照値<Wｘ-1,n>を取得して圧縮デ
ータ（Wｘ,n - <Wｘ-1,n>）を伸張し、<Wｘ,n>を取得する（Ｓ１０１５）。 When proceeding to the process of S1013, the worker 100 acquires the reference value <Wx-1, n>, decompresses the compressed data (Wx, n-<Wx-1, n>), and performs <Wx, n>. Acquire (S1015).

Ｓ１０１４の処理に進んだ場合、ワーカ１００は、参照値<Wx,n-1>を取得して圧縮データ（Wx,n - <Wx,n-1>）を伸張し、<Wx,n>を取得する（Ｓ１０１５）。 When proceeding to the processing of S1014, the worker 100 acquires the reference value <Wx, n-1>, decompresses the compressed data (Wx, n- <Wx, n-1>), and performs <Wx, n>. Acquire (S1015).

続いて、ワーカ１００は、取得した<Wx,n>に基づき、重み係数の更新量ΔWx,nを確率的勾配降下法等により更新し、ΔWx,n+1を求める（Ｓ１０１６）。 Subsequently, the worker 100 updates the update amount ΔWx, n of the weighting coefficient based on the acquired <Wx, n> by a stochastic gradient descent method or the like to obtain ΔWx, n + 1 (S1016).

続いて、ワーカ１００は、求めたΔWx,n+1を圧縮する。ワーカ１００は、対象とするモデルがＣＮＮであるか否かを判定（空間的相関又は時間的相関のいずれを利用して符号化するか）を判定し（Ｓ１０１７）、判定結果に応じて参照値 <ΔWx-1,n+1>又は参照値<ΔWx,n>を取得し、取得した参照値をΔWx,n+1から差し引くことにより差分を求め、求めた
差分を圧縮して圧縮データ（ΔＷ）を生成する（Ｓ１０１８，Ｓ１０１９）。 Subsequently, the worker 100 compresses the obtained ΔWx, n + 1. The worker 100 determines whether or not the target model is CNN (whether it is encoded using spatial correlation or temporal correlation) (S1017), and a reference value is determined according to the determination result. Obtain <ΔWx-1, n + 1> or reference value <ΔWx, n>, obtain the difference by subtracting the obtained reference value from ΔWx, n + 1, compress the obtained difference, and compress the data (ΔW). ) Is generated (S1018, S1019).

続いて、ワーカ１００は、生成した圧縮データ（ΔＷ）をパラメータサーバ２００に送信する(Ｓ１０２０)。 Subsequently, the worker 100 transmits the generated compressed data (ΔW) to the parameter server 200 (S1020).

次に、前述したパラメータサーバ２００のＷ圧縮部２１４、ワーカ１００のＷ伸張部１１２、ワーカ１００のΔＷ圧縮部１１４、及びパラメータサーバ２００のΔＷ伸張部２１２の各機能について詳述する。 Next, each function of the W compression unit 214 of the parameter server 200, the W decompression unit 112 of the worker 100, the ΔW compression unit 114 of the worker 100, and the ΔW decompression unit 212 of the parameter server 200 will be described in detail.

図１１は、パラメータサーバ２００のＷ圧縮部２１４の詳細を説明する処理ブロック図である。同図に示すように、Ｗ圧縮部２１４は、参照値特定部２１４１、差分算出部２１４２、量子化部２１４３、可変長符号化部２１４４、逆量子化部２１４５、及び伸張部２１４６を備える。またＷ圧縮部２１４は、前述したモデル特定情報、量子化パラメータｑ、及びワーカ１００毎の参照値を記憶する。 FIG. 11 is a processing block diagram illustrating details of the W compression unit 214 of the parameter server 200. As shown in the figure, the W compression unit 214 includes a reference value specifying unit 2141, a difference calculation unit 2142, a quantization unit 2143, a variable length coding unit 2144, an inverse quantization unit 2145, and an expansion unit 2146. Further, the W compression unit 214 stores the model specific information, the quantization parameter q, and the reference value for each worker 100 described above.

参照値特定部２１４１は、モデル特定情報に基づきモデルの種類（空間的相関、時間的相関）を特定する。 The reference value specifying unit 2141 specifies the model type (spatial correlation, temporal correlation) based on the model specifying information.

差分算出部２１４２は、Ｓ９１６で更新された重み係数Wx,nと参照値との差分を求める。 The difference calculation unit 2142 obtains the difference between the weighting coefficients Wx, n updated in S916 and the reference value.

量子化部２１４３は、量子化パラメータｑを用いて上記差分を量子化した量子化データを生成する。 The quantization unit 2143 generates quantization data obtained by quantizing the above difference using the quantization parameter q.

可変長符号化部２１４４は、可変長符号表８００を参照して量子化データを圧縮し、圧縮データ（Ｗ）を出力する。 The variable-length coding unit 2144 compresses the quantized data with reference to the variable-length code table 800, and outputs the compressed data (W).

逆量子化部２１４５は、量子化データの逆量子化を行う。 The dequantization unit 2145 performs dequantization of the quantized data.

伸張部２１４６は、上記参照値を用いて逆量子化後の重み係数Wx,nを伸張し、伸張した値を新たな参照値として記憶する。 The stretching unit 2146 stretches the weighting coefficients Wx and n after dequantization using the reference value, and stores the stretched value as a new reference value.

図１２は、ワーカ１００のＷ伸張部１１２の詳細を説明する処理ブロック図である。 FIG. 12 is a processing block diagram illustrating the details of the W extension portion 112 of the worker 100.

同図に示すように、Ｗ伸張部１１２は、可変長復元部１１２１、逆量子化部１１２２、参照値特定部１１２３、及び伸張部１１２４を備える。またＷ伸張部１１２は、モデル特定情報、量子化パラメータｑ、及び当該ワーカの参照値を記憶する。 As shown in the figure, the W extension unit 112 includes a variable length restoration unit 1121, an inverse quantization unit 1122, a reference value specifying unit 1123, and an extension unit 1124. Further, the W expansion unit 112 stores the model specific information, the quantization parameter q, and the reference value of the worker.

可変長復元部１１２１は、可変長符号表８００を参照してＷ受信部１１１から入力された重み係数Ｗの圧縮データ（Ｗ）を復元し、量子化データを生成する。 The variable length restoration unit 1121 restores the compressed data (W) of the weighting coefficient W input from the W reception unit 111 with reference to the variable length code table 800, and generates the quantization data.

逆量子化部１１２２は、量子化パラメータｑを用いて量子化データを逆量子化する。 The dequantization unit 1122 dequantizes the quantization data using the quantization parameter q.

参照値特定部１１２３は、モデル特定情報に基づきモデルの種類（空間的相関、時間的相関）を特定する。 The reference value specifying unit 1123 specifies the model type (spatial correlation, temporal correlation) based on the model specifying information.

伸張部１１２４は、記憶されている参照値との和を算出することにより重み係数Wx,nを伸張し、伸張した値を出力するとともに、記憶している参照値を更新する。 The expansion unit 1124 expands the weighting coefficients Wx and n by calculating the sum with the stored reference value, outputs the expanded value, and updates the stored reference value.

図１３は、ワーカ１００のΔＷ圧縮部１１４の詳細を説明する処理ブロック図である。 FIG. 13 is a processing block diagram illustrating the details of the ΔW compression unit 114 of the worker 100.

同図に示すように、ΔＷ圧縮部１１４は、参照値特定部１１４１、差分算出部１１４２、量子化部１１４３、可変長符号化部１１４４、逆量子化部１１４５、及び伸張部１１４６を備える。またΔＷ圧縮部１１４は、モデル特定情報、量子化パラメータｑ、及び当該ワーカ１００の参照値を記憶する。 As shown in the figure, the ΔW compression unit 114 includes a reference value specifying unit 1141, a difference calculation unit 1142, a quantization unit 1143, a variable length coding unit 1144, an inverse quantization unit 1145, and an expansion unit 1146. Further, the ΔW compression unit 114 stores the model identification information, the quantization parameter q, and the reference value of the worker 100.

参照値特定部１１４１は、モデル特定情報に基づきモデルの種類（空間的相関、時間的相関）を特定する。 The reference value specifying unit 1141 specifies the model type (spatial correlation, temporal correlation) based on the model specifying information.

差分算出部１１４２は、ΔＷ算出部１１３が算出した重み係数の更新量ΔＷと参照値との差分を求める。 The difference calculation unit 1142 obtains the difference between the update amount ΔW of the weighting coefficient calculated by the ΔW calculation unit 113 and the reference value.

量子化部１１４３は、量子化パラメータｑを用いて上記差分を量子化した量子化データを生成する。 The quantization unit 1143 generates quantization data obtained by quantizing the above difference using the quantization parameter q.

可変長符号化部１１４４は、可変長符号表８００を参照して量子化データを圧縮し、圧縮データ（Ｗ）を出力する。 The variable-length coding unit 1144 compresses the quantized data with reference to the variable-length code table 800, and outputs the compressed data (W).

逆量子化部１１４５は、量子化データに対して逆量子化を行う。 The dequantization unit 1145 performs dequantization on the quantized data.

伸張部１１４６は、上記参照値を足し合わせて重み係数の更新量ΔＷx,nの伸張値を取
得し、参照値を更新する。 The extension unit 1146 adds the reference values to obtain an extension value of the weight coefficient update amount ΔWx, n, and updates the reference value.

図１４は、パラメータサーバ２００のΔＷ伸張部２１２の詳細を説明する処理ブロック図である。 FIG. 14 is a processing block diagram illustrating details of the ΔW extension unit 212 of the parameter server 200.

同図に示すように、ΔＷ伸張部２１２は、可変長復元部２１２１、逆量子化部２１２２、参照値特定部２１２３、及び伸張部２１２４を備える。またΔＷ伸張部２１２は、モデル特定情報、量子化パラメータｑ、及びワーカ１００毎の参照値を記憶する。 As shown in the figure, the ΔW expansion unit 212 includes a variable length restoration unit 2121, an inverse quantization unit 2122, a reference value identification unit 2123, and an expansion unit 2124. Further, the ΔW expansion unit 212 stores the model specific information, the quantization parameter q, and the reference value for each worker 100.

可変長復元部２１２１は、可変長符号表８００を参照して入力された重み係数の更新量ΔＷx,nの圧縮データ（ΔＷ）を復元し、量子化データを生成する。 The variable-length restoration unit 2121 restores the compressed data (ΔW) of the update amount ΔWx, n of the weighting coefficient input with reference to the variable-length code table 800, and generates the quantization data.

逆量子化部２１２２は、量子化パラメータｑを用いて量子化データを逆量子化する。 The dequantization unit 2122 dequantizes the quantization data using the quantization parameter q.

参照値特定部２１２３は、モデル特定情報に基づきモデルの種類（空間的相関、時間的相関）を特定する。 The reference value specifying unit 2123 specifies the type of model (spatial correlation, temporal correlation) based on the model specifying information.

伸張部１１２４は、ワーカ１００毎に記憶されている参照値との和を算出することによりΔＷx,nを伸張し、伸張した値を出力するとともに、記憶しているワーカ１００毎の参
照値を更新する。 The expansion unit 1124 expands ΔWx, n by calculating the sum with the reference value stored for each worker 100, outputs the expanded value, and updates the stored reference value for each worker 100. To do.

ところで、図１５（ａ）は、学習時に値の変動が激しい重み係数Wx,nについて、その値の変化の一例を示すグラフであるが、同図に示すように、ディープラーニングの学習においては、学習が進むにつれ（更新回数ｎが増えるにつれ）重み係数Ｗの変動が緩やかになる性質があり、学習が進むにつれて圧縮率は徐々に高まる。 By the way, FIG. 15A is a graph showing an example of the change in the value of the weighting coefficients Wx and n whose values fluctuate sharply during learning. As the learning progresses (as the number of updates n increases), the weighting coefficient W has a property of becoming more gradual, and the compression rate gradually increases as the learning progresses.

そこで例えば、図１５（ｂ）に示すように、学習が進むにつれ並列分散に参加するワーカ１００数の数を増やしていくことにより、処理性能の向上や計算資源の有効活用が期待できる。またこの場合、後述するように複数のパラメータサーバ２００を用いた場合（図１７、図１８）、未稼働の計算ノードや稼働率の低い計算ノードの数が少なくなるように並列度を制御すれば、処理性能の向上や計算資源の有効活用を図ることができる。 Therefore, for example, as shown in FIG. 15B, by increasing the number of 100 workers participating in parallel distribution as learning progresses, improvement in processing performance and effective utilization of computational resources can be expected. In this case, when a plurality of parameter servers 200 are used as described later (FIGS. 17 and 18), if the degree of parallelism is controlled so that the number of unutilized computing nodes and low operating rate computing nodes is reduced. , Processing performance can be improved and computational resources can be effectively used.

さらに例えば、図１５（ｃ）に示すように、学習が進むにつれ量子化パラメータｑの値を小さくして細かく量子化を行うようにすれば、圧縮率の変動を抑えつつ学習精度を徐々
に高めていくことができる。 Further, for example, as shown in FIG. 15C, if the value of the quantization parameter q is reduced as the learning progresses and the quantization is performed finely, the learning accuracy is gradually increased while suppressing the fluctuation of the compression rate. You can go.

尚、図１６に示すように、例えば、通信回線５０に接続する情報処理装置（例えば、図２に示すハードウェアを備えた装置。以下、制御装置３００と称する。）に、並列分散に参加するワーカ１００の数を制御する機能（以下、ワーカ数制御部３１１と称する。）や、量子化パラメータｑの値を制御する機能（以下、量子化パラメータ制御部３１２と称する。）を設けることで、上記のように並列分散に参加するワーカ１００の数の制御や量子化パラメータｑの値を制御する仕組みを容易に実現することができる。この場合、例えば、ワーカ数制御部３１１は、データ転送量が通信回線５０の回線速度を超えない程度にワーカ１００の数を増大させる。また量子化パラメータ制御部３１２は、例えば、データ転送量の変化が少なくなるように量子化パラメータｑの値を決定し、決定した値を通信回線５０等を介してパラメータサーバ２００及びワーカ１００に随時通知する。 As shown in FIG. 16, for example, an information processing device connected to the communication line 50 (for example, a device provided with the hardware shown in FIG. 2, hereinafter referred to as a control device 300) participates in parallel distribution. By providing a function for controlling the number of workers 100 (hereinafter referred to as a worker number control unit 311) and a function for controlling the value of the quantization parameter q (hereinafter referred to as a quantization parameter control unit 312), a mechanism for controlling the value of the control and the quantization parameter q number of worker 100 to participate in the parallel distributed as described above can be easily realized. In this case, for example, the worker number control unit 311 increases the number of workers 100 to the extent that the data transfer amount does not exceed the line speed of the communication line 50. Further, the quantization parameter control unit 312 determines, for example, the value of the quantization parameter q so that the change in the amount of data transfer is small, and transmits the determined value to the parameter server 200 and the worker 100 at any time via the communication line 50 or the like. Notice.

図１７又は図１８は学習システム１のハードウェア構成の他の例である。これらの図に示すように、パラメータサーバ２００は複数存在していてもよい。図１７は、１種類の訓練データに対して、複数（本例では２つ）のパラメータサーバ２００を用いて学習を行う場合である。この場合、例えば、学習が終了した後に全てのパラメータサーバ２００が持つ重み係数の平均値や重み付き平均値等を最終的な学習結果とする。一方、図１８は、複数(本例では２つ)の訓練データＡ，Ｂを夫々に対応づけられたパラメータサーバ２００で学習する場合である。この場合、例えば、学習が終了した後に各パラメータサーバ２００が持つ重み係数を各訓練データの学習結果とする。 17 or 18 is another example of the hardware configuration of the learning system 1. As shown in these figures, a plurality of parameter servers 200 may exist. FIG. 17 shows a case where one type of training data is learned by using a plurality of (two in this example) parameter servers 200. In this case, for example, the average value of the weighting coefficients and the weighted average value of all the parameter servers 200 after the learning is completed are set as the final learning result. On the other hand, FIG. 18 shows a case where a plurality of (two in this example) training data A and B are learned by the parameter server 200 associated with each of them. In this case, for example, the weighting coefficient of each parameter server 200 after the learning is completed is used as the learning result of each training data.

以上、実施形態につき詳細に説明したが、本発明は上記の実施形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。例えば、上記の実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また上記実施形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 Although the embodiments have been described in detail above, it goes without saying that the present invention is not limited to the above embodiments and can be variously modified without departing from the gist thereof. For example, the above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the described configurations. Further, it is possible to add / delete / replace a part of the configuration of the above embodiment with another configuration.

例えば、以上の方法では、ディープラーニングのモデルに応じて参照値を決定したが（図９のＳ９１７、図１０のＳ１０１７等）、ワーカ１００やパラメータサーバ２００に、圧縮率が高くなるモデルを自動的に選択して圧縮を行う機能（モデル選択部）を設けてよい。その場合、伸張側ではどのモデルを選択して圧縮を行ったかを特定する必要があるが、これは例えば、ワーカ１００やパラメータサーバ２００に、圧縮側で選択したモデルを特定する情報（例えば、ビット）を圧縮データ（Ｗ）や圧縮データ（ΔＷ）に付与し、伸張側では、上記情報を参照していずれの方法が選択されたかを特定し、特定した方法により圧縮データ（Ｗ）や圧縮データ（ΔＷ）を伸張する機能（モデル特定情報共有部）を設ければよい。これによればより効率よく通信回線５０を介したデータ転送を行うことができる。 For example, in the above method, the reference value is determined according to the deep learning model (S917 in FIG. 9, S1017 in FIG. 10 and the like), but the worker 100 and the parameter server 200 are automatically provided with a model having a high compression ratio. A function (model selection unit) for selecting and compressing may be provided. In that case, it is necessary to specify which model is selected and compressed on the decompression side. For example, this is information (for example, a bit) that identifies the model selected on the compression side to the worker 100 or the parameter server 200. ) Is added to the compressed data (W) and the compressed data (ΔW), and on the decompression side, which method is selected is specified by referring to the above information, and the compressed data (W) and the compressed data are specified by the specified method. A function (model specific information sharing unit) for extending (ΔW) may be provided. According to this, data transfer via the communication line 50 can be performed more efficiently.

また以上の方法では、隣接する位置の間や隣接する時間（連続する更新）の間で重み係数Ｗや重み係数ΔＷの差分を求めていたが、より遠い位置間やより離れた時間の間で重み係数Ｗや重み係数ΔＷの差分を求めるようにしてもよい。 Further, in the above method, the difference between the weighting coefficient W and the weighting coefficient ΔW is obtained between adjacent positions and between adjacent times (continuous updates), but between more distant positions and more distant times. The difference between the weighting coefficient W and the weighting coefficient ΔW may be obtained.

また上記の各構成、機能部、処理部、処理手段等は、それらの一部または全部を、例えば、集積回路で設計する等によりハードウェアで実現してもよい。また上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、また
はＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 Further, each of the above configurations, functional units, processing units, processing means and the like may be realized by hardware by designing a part or all of them by, for example, an integrated circuit. Further, each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files that realize each function can be placed in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

また上記の各図において、制御線や情報線は説明上必要と考えられるものを示しており、必ずしも実装上の全ての制御線や情報線を示しているとは限らない。例えば、実際にはほとんど全ての構成が相互に接続されていると考えてもよい。 Further, in each of the above figures, the control lines and information lines are shown as necessary for explanation, and not all the control lines and information lines in the implementation are necessarily shown. For example, in practice almost all configurations may be considered interconnected.

また以上に説明した各種装置の各種機能部、各種処理部、各種データベースの配置形態は一例に過ぎない。各種機能部、各種処理部、各種データベースの配置形態は、各種装置が備えるハードウェアやソフトウェアの性能、処理効率、通信効率等の観点から最適な配置形態に変更し得る。 Further, the arrangement form of various functional units, various processing units, and various databases of various devices described above is only an example. The arrangement form of various function units, various processing units, and various databases can be changed to the optimum arrangement form from the viewpoints of hardware and software performance, processing efficiency, communication efficiency, and the like provided in various devices.

また前述したデータベースの構成（スキーマ（Schema）等）は、リソースの効率的な利用、処理効率向上、アクセス効率向上、検索効率向上等の観点から柔軟に変更し得る。 Further, the above-mentioned database configuration (schema, etc.) can be flexibly changed from the viewpoints of efficient use of resources, improvement of processing efficiency, improvement of access efficiency, improvement of search efficiency, and the like.

１学習システム、１０情報処理装置、５０通信回線、１００ワーカ、１１１Ｗ受信部、１１２Ｗ伸張部、１１２１可変長復元部、１１２２逆量子化部、１１２３
参照値特定部、１１２４伸張部、１１２１可変長復元部、１１２２逆量子化部、１１２３参照値特定部、１１３ ΔＷ算出部、１１４ ΔＷ圧縮部、１１４１参照値特定部、１１４２差分算出部、１１４３量子化部、１１４４可変長符号化部、１１４５逆量子化部、１１４６伸張部、１１５ ΔＷ送信部、２００パラメータサーバ、２１１ ΔＷ受信部、２１２ ΔＷ伸張部、２１２１可変長復元部、２１２２逆量子化部、２１２３参照値特定部、２１２４伸張部、２１３Ｗ更新部、２１４Ｗ圧縮部、２１４１参照値特定部、２１４２差分算出部、２１４３量子化部、２１４４可変長符号化部、２１４５逆量子化部、２１４６伸張部、２１５Ｗ送信部、３００制御装置、３１１ワーカ数制御部、３１２量子化パラメータ制御部、８００可変長符号表、Ｓ９００学習処理、Ｓ１０００学習処理 1 Learning system, 10 information processing equipment, 50 communication lines, 100 workers, 111 W receiver, 112 W extension unit, 1121 variable length restoration unit, 1122 inverse quantization unit, 1123
Reference value identification unit, 1124 extension unit, 1121 variable length restoration unit, 1122 inverse quantization unit, 1123 reference value identification unit, 113 ΔW calculation unit, 114 ΔW compression unit, 1141 reference value identification unit, 1142 difference calculation unit, 1143 quantum Quantization unit, 1144 variable length coding unit, 1145 inverse quantization unit, 1146 expansion unit, 115 ΔW transmission unit, 200 parameter server, 211 ΔW reception unit, 212 ΔW expansion unit, 2121 variable length restoration unit, 2122 inverse quantization unit , 2123 Reference value identification unit, 2124 Decompression unit, 213 W update unit, 214 W compression unit, 2141 Reference value identification unit, 2142 Difference calculation unit, 2143 quantization unit, 2144 variable length coding unit, 2145 inverse quantization unit, 2146 Expansion unit, 215 W transmission unit, 300 controller, 311 worker number control unit, 312 quantization parameter control unit, 800 variable length code table, S900 learning process, S1000 learning process

Claims

A machine learning system that includes a plurality of information processing devices that are communicably connected and performs machine learning by a parallel distribution method using the plurality of information processing devices.
The information processing device
A transmission / reception unit that transmits / receives parameters used for learning a model in the machine learning to / from another information processing device.
When transmitting the parameter to the other information processing apparatus, a compression unit that differentially encodes and compresses the parameter, and a compression unit.
An extension unit that extends the parameter received from the other information processing device, and an extension unit.
A control device that is communicatively connected to each of the information processing devices,
With
The control device has a quantization parameter control unit that reduces the value of the quantization parameter used by each of the information processing devices in the differential coding as the machine learning progresses.
Machine learning system.

The machine learning system according to claim 1.
The compression unit performs the difference coding by utilizing the spatial correlation or the temporal correlation of the model.
Machine learning system.

The machine learning system according to claim 2.
The compression unit performs the difference coding by calculating the difference of the parameters at the respective adjacent positions in the model space of the model at the same time.
Machine learning system.

The machine learning system according to claim 3.
The model is a convolutional neural network,
Machine learning system.

The machine learning system according to claim 2.
The compression unit performs the difference coding by calculating the difference of the parameters at each of the consecutive times at the same position in the model space of the model.
Machine learning system.

The machine learning system according to claim 5.
The models include feedforward neural networks (FFNNs (Feedforward Neural Networks)), recurrent networks (RNNs (Recurrent Neural Networks)), restricted Boltzmann Machines (RBMs), autoencoders, and all. One of the combined networks,
Machine learning system.

The machine learning system according to any one of claims 1 to 6.
The parameter is at least one of the weighting factor of the model and the update amount of the weighting factor.
Machine learning system.

The machine learning system according to any one of claims 1 to 6.
The control device includes a control unit that increases the number of information processing devices that participate in machine learning by the parallel distribution method as the machine learning progresses.
Machine learning system.

The machine learning system according to any one of claims 1 to 6.
The information processing device
A model selection unit that selects and trains a model with a high compression rate by differential coding, and
A model-specific information sharing unit that shares model-specific information, which is information that identifies the selected model, by transmitting and receiving to and from other information processing devices.
With
The compression unit performs the difference coding of the parameters by a calculation method according to the model specific information.
The stretched portion stretches the parameter by a calculation method according to the model specific information.
Machine learning system.

The machine learning system according to claim 7.
At least one of the information processing devices functions as a parameter server that updates the weighting factor.
At least one of the information processing devices functions as a worker for obtaining the update amount of the weighting coefficient.
The information processing device that functions as the parameter server updates the weighting coefficient based on the updating amount of the weighting coefficient sent from the worker.
The information processing device that functions as the worker obtains an update amount of the weighting coefficient based on the weighting coefficient sent from the parameter server.
Machine learning system.

It is a machine learning method using a machine learning system that includes a plurality of information processing devices connected so as to be able to communicate and performs machine learning by a parallel distribution method using the plurality of information processing devices.
The information processing device
A step of transmitting and receiving parameters used for learning a model in the machine learning with the other information processing device, and
When the parameter is transmitted to the other information processing apparatus, the step of differentially encoding and compressing the parameter, and
A step of decompressing the parameter received from the other information processing device, and
And
A step in which the control device communicably connected to each of the information processing devices reduces the value of the quantization parameter used by each of the information processing devices in the differential coding as the machine learning progresses.
A machine learning method to perform.

The machine learning method according to claim 11.
The information processing device performs the difference coding by utilizing the spatial correlation or the temporal correlation of the model.
Machine learning method.

The machine learning method according to claim 12.
The information processing apparatus performs the difference coding by calculating the difference of the parameters at the adjacent positions at the same time in the model space of the model.
Machine learning method.

The machine learning method according to claim 12.
The information processing apparatus performs the difference coding by calculating the difference of the parameters at each of the consecutive times at the same position in the model space of the model.
Machine learning method.

The machine learning method according to claim 11.
The control device increases the number of information processing devices that participate in machine learning by the parallel distribution method as the machine learning progresses.
Machine learning method.