JP7403739B2

JP7403739B2 - Decision-making device and method for controlling the decision-making device

Info

Publication number: JP7403739B2
Application number: JP2020548524A
Authority: JP
Inventors: 成主金; 真士青野; 広幸秋永; 久島; 泰久内藤
Original assignee: Keio University
Current assignee: Keio University
Priority date: 2018-09-18
Filing date: 2019-09-17
Publication date: 2023-12-25
Anticipated expiration: 2039-09-17
Also published as: WO2020059723A1; JPWO2020059723A1

Description

本発明は、意思決定装置、及び意思決定装置の制御方法に関する。 The present invention relates to a decision-making device and a method of controlling the decision-making device.

確率的に獲得される報酬量を最大化する解を探索する問題の代表例として、多本腕バンディット問題（Multi-Armed Bandit problem: MAB）について様々な提案がなされている。バンディット問題とは、確率的に報酬を提供する装置の例としてスロットマシーンを用いて表現される仮想的な問題であり、プレイヤーが複数種類の異なる行動選択肢から一つを選択する行動（どのスロットマシーンをプレイするか）を繰り返した後、選択した行動により定められた確率で報酬がプレイヤーに付与されるシステムにおいて、最終的にどのスロットマシーンを使用することにより報酬が最大になるかを、できるだけ短い時間で決定するという問題である。 Various proposals have been made regarding the Multi-Armed Bandit problem (MAB) as a representative example of a problem that searches for a solution that maximizes the amount of reward obtained stochastically. The bandit problem is a hypothetical problem expressed using a slot machine as an example of a device that provides rewards stochastically. In a system where the player is given a reward with a predetermined probability based on the selected action after repeating the process (whether to play or not), the player is given a reward in the shortest possible time by determining which slot machine to use to maximize the reward. It is a matter of time to decide.

特開２０１７－１３００２４号公報JP 2017-130024 Publication 特開２０１８－１２４７９０号公報Japanese Patent Application Publication No. 2018-124790

S.-J. Kim, M. Aono, & M. Hara (2010): “Tug-of-war model for the two-bandit problem: Nonlocally-correlated parallel exploration via resource conservation”, BioSystems 101 29-36.S.-J. Kim, M. Aono, & M. Hara (2010): “Tug-of-war model for the two-bandit problem: Nonlocally-correlated parallel exploration via resource conservation”, BioSystems 101 29-36. S.-J. Kim, M. Aono & E. Nameda (2014): “Efficient decision-making by volume-conserving physical object”, New J. Phys. 17(2015)08302.S.-J. Kim, M. Aono & E. Nameda (2014): “Efficient decision-making by volume-conserving physical object”, New J. Phys. 17(2015)08302. M. Naruse, M. Berthel, A. Drezet, S. Huant, M. Aono, H. Hori, S.-J. Kim (2015): "Single-photon decision maker", Scientific Reports 5, 13253.M. Naruse, M. Berthel, A. Drezet, S. Huant, M. Aono, H. Hori, S.-J. Kim (2015): "Single-photon decision maker", Scientific Reports 5, 13253. M. Naruse, M. Berthel, A. Drezet, S. Huant, H. Hori, & S.-J. Kim (2016): "Single photon in hierarchical architecture for physical reinforcement learning: Photon intelligence", ACS Photonics 2016, 3, 2505-2514.M. Naruse, M. Berthel, A. Drezet, S. Huant, H. Hori, & S.-J. Kim (2016): "Single photon in hierarchical architecture for physical reinforcement learning: Photon intelligence", ACS Photonics 2016, 3, 2505-2514. S.-J. Kim, T. Tsuruoka, T. Hasegawa, M. Aono, K. Terabe, & M. Aono, (2016): "Decision maker based on atomic switches", AIMS Materials Science 3(1): 245-259.S.-J. Kim, T. Tsuruoka, T. Hasegawa, M. Aono, K. Terabe, & M. Aono, (2016): "Decision maker based on atomic switches", AIMS Materials Science 3(1): 245 -259. C.Lutz, T.Hasegawa, & T.Chikyow (2016): "Ag2S atomic switch-based ‘tug of war’ for decision making", Nanoscale, 2016, 8, 14031-14036.C.Lutz, T.Hasegawa, & T.Chikyow (2016): “Ag2S atomic switch-based ‘tug of war’ for decision making”, Nanoscale, 2016, 8, 14031-14036. Takashi Tsuchiya, Tohru Tsuruoka, Song-Ju Kim, Kazuya Terabe, Masakazu Aono (2018): “Ionic decision-maker created as novel, solid-state devices”, Sci. Adv. 2018;4: eaau2057.Takashi Tsuchiya, Tohru Tsuruoka, Song-Ju Kim, Kazuya Terabe, Masakazu Aono (2018): “Ionic decision-maker created as novel, solid-state devices”, Sci. Adv. 2018;4: eaau2057.

このようなバンディット問題は、ウェブ広告の提示（どの広告を提示することが効果的であるかを見出す）等、種々の分野で応用されている。バンディット問題には、報酬が二値（「有り」または「無し」）をとるＮ台のスロットマシーン（選択肢）から報酬確率が最大となる選択肢がどれであるかを判断する問題（以下、二値バンディット問題という、例えば、非特許文献１、非特許文献２参照）や、報酬が連続値（実数）をとるＮ台のスロットマシーン（選択肢）から期待報酬量が最大となる選択肢がどれであるかを判断する問題（以下、連続値バンディット問題という、例えば、非特許文献４参照）、さらには、プレイヤー数が２以上で競合が生じる状況を表現したバンディット問題（以下、競合的二値バンディット問題という、例えば、特許文献５参照）などがある。 Such bandit problems are applied in various fields, such as presenting web advertisements (finding out which advertisements are effective to present). The bandit problem involves determining which option has the maximum reward probability from among N slot machines (options) with binary rewards (“yes” or “absent”) (hereinafter referred to as “binary”). The bandit problem (for example, see Non-Patent Literature 1 and Non-Patent Literature 2), and which option has the maximum expected reward amount from among N slot machines (options) whose rewards are continuous values (real numbers). (hereinafter referred to as the continuous-valued bandit problem; see, for example, Non-Patent Document 4), as well as bandit problems that represent situations in which competition occurs when the number of players is two or more (hereinafter referred to as the competitive binary bandit problem). , for example, see Patent Document 5).

このような様々なバンディット問題に対する解を探索する手法として、従来、ε－ｇｒｅｅｄｙアルゴリズムやＳＯＦＴＭＡＸアルゴリズムなどが知られていた。また近年では、綱引き（Ｔｕｇ－ｏｆ－Ｗａｒ；ＴＯＷ）モデルと呼ばれる解法も提案されている。 Conventionally, the ε-greedy algorithm, the SOFTMAX algorithm, and the like have been known as methods for searching for solutions to various bandit problems. In recent years, a solution method called the Tug-of-War (TOW) model has also been proposed.

このＴＯＷモデルは、単細胞のアメーバ状生物の振舞いから着想を得た動力学的なアルゴリズムであり、有限の体積をもつ細胞内資源を複数の端末が綱引きするよう奪い合う物理的な特性から意思決定機能を得るものである。この綱引き的な現象の動力学における物理的な制約である体積保存則によって、ＴＯＷモデルの動作はε－ｇｒｅｅｄｙアルゴリズムやＳＯＦＴＭＡＸアルゴリズムのような、他の良く知られているアルゴリズムよりも高い性能を示しており、実際に従来のアルゴリズムとシミュレーションによりその高い性能が確認されている（非特許文献１）。 This TOW model is a dynamic algorithm inspired by the behavior of single-celled amoeba-like organisms, and the decision-making function is derived from the physical characteristics of multiple terminals competing for intracellular resources with a finite volume like a tug-of-war. This is what you get. Due to the volume conservation law, which is a physical constraint on the dynamics of this tug-of-war phenomenon, the operation of the TOW model shows higher performance than other well-known algorithms such as the ε-greedy algorithm and the SOFTMAX algorithm. Its high performance has actually been confirmed by conventional algorithms and simulations (Non-Patent Document 1).

具体的に、非特許文献１のＦｉｇ．５に示されているように、ε－ｇｒｅｅｄｙアルゴリズムやＳＯＦＴＭＡＸアルゴリズムを用いた場合に比べ、少ない試行回数で同等の正解率が得られるのが理解できる。 Specifically, Fig. 5, it can be seen that the same accuracy rate can be obtained with fewer trials than when using the ε-greedy algorithm or the SOFTMAX algorithm.

また、非特許文献１のＦｉｇ．７には、試行回数３０００回付近でスロットマシーン（選択肢）ごとの報酬確率分布を変化させた例を示すもので、具体的にはスロットマシーンＡおよびＢにおいて、（Ａの確率分布Ｐ_Ａ＝０．４０、Ｂの確率分布Ｐ_Ｂ＝０．６０）から（Ｐ_Ａ＝０．６０、Ｐ_Ｂ＝０．４０）に変更したときの例が示されている。Also, Fig. 7 shows an example in which the reward probability distribution for each slot machine (choice) is changed around 3000 trials. Specifically, for slot machines A and B, (Probability distribution of _A = 0 An example is shown in which the probability distribution P _B =0.60) of B is changed to (P _A =0.60, P _B =0.40).

このように報酬確率分布が変化した場合に、その後に再度正解を取得するまでの状況を調べると、非特許文献１のＦｉｇ．７に示されているように、ＴＯＷモデルを用いた場合に、ＳＯＦＴＭＡＸアルゴリズムを用いた場合よりも正解を取得するまでのグラフの傾斜が急になっており、より早く報酬確率分布の変化に対応しており、環境変化により適応的であることが理解される。 When the reward probability distribution changes in this way, the situation until the correct answer is obtained again is investigated as shown in Fig. of Non-Patent Document 1. As shown in Figure 7, when using the TOW model, the slope of the graph until obtaining the correct answer is steeper than when using the SOFTMAX algorithm, and it can respond to changes in the reward probability distribution more quickly. It is understood that it is more adaptable to environmental changes.

このようなＴＯＷモデルの実装例としては、体積保存則等の物理的制約による綱引き的な現象の動力学を利用したＴＯＷモデル（以下、力学的ＴＯＷモデルという、例えば、非特許文献１参照）や、伸縮しない仮想的なバーを想定し、バーを左右に固定値変動させるＴＯＷモデル（以下、固定値バーＴＯＷモデルという、例えば、非特許文献２参照）、および固定値バーＴＯＷモデルを非固定値変動まで拡張したＴＯＷモデル（以下、非固定値バーＴＯＷモデルという、例えば、非特許文献５、特許文献３参照）などがある。 Examples of implementing such a TOW model include a TOW model that utilizes the dynamics of a tug-of-war phenomenon due to physical constraints such as the law of conservation of volume (hereinafter referred to as a dynamic TOW model; see, for example, Non-Patent Document 1); , a TOW model that assumes a virtual bar that does not expand or contract and changes the bar by a fixed value to the left and right (hereinafter referred to as a fixed value bar TOW model, see e.g. Non-Patent Document 2), and a fixed value bar TOW model that changes a fixed value bar TOW model to a non-fixed value. There is a TOW model (hereinafter referred to as a non-fixed value bar TOW model, see, for example, Non-Patent Document 5 and Patent Document 3) that extends to fluctuations.

また近年では、このようなＴＯＷモデルを物理的に実装した意思決定デバイスの提案がなされている。例えば単一光子の粒子性と波動性とを用いてＴＯＷモデルを実装する手法（例えば、非特許文献３、非特許文献４参照）や、パルス電圧を印加することなどにより抵抗値を変化させることができるデバイスを使用してＴＯＷモデルを実装する手法などが提案されている。 Furthermore, in recent years, proposals have been made for decision-making devices that physically implement such a TOW model. For example, a method of implementing a TOW model using the particle nature and wave nature of a single photon (for example, see Non-Patent Document 3 and Non-Patent Document 4), or changing the resistance value by applying a pulse voltage, etc. Techniques have been proposed to implement the TOW model using devices that can.

後者の例は、様々な金属を組み合わせ、印加する電圧に応じて抵抗値が変化するアナログ抵抗変化素子を用いたもので、学習モデルに提供される学習データに応じて印加する電圧を制御し、学習モデルとなる、所定の特性を有するアナログ抵抗変化素子に当該制御された電圧を印加することで、ＴＯＷモデルの実装を図るものである。 The latter example combines various metals and uses an analog resistance change element whose resistance value changes depending on the applied voltage.The applied voltage is controlled according to the learning data provided to the learning model. The TOW model is implemented by applying the controlled voltage to an analog resistance change element having predetermined characteristics, which serves as a learning model.

これによると、大型コンピュータ等にソフトウエアをインストールするといったことなく、比較的小型のハードウェアによって、意思決定を行うための装置を構築することが可能となる。 According to this, it becomes possible to construct a decision-making device using relatively small hardware without installing software on a large computer or the like.

なお、このようなアナログ抵抗変化素子としては、電極で挟持された固体電解質中で金属イオンが電界の印加によって移動・析出・再イオン化することを利用する原子スイッチを用いた素子（以下、ギャップなし型原子スイッチという、例えば、特許文献１、非特許文献５参照）、固体電解質を電極で挟持し電圧を印加することでフィラメントを成長又は収縮させる素子（以下、ギャップあり型原子スイッチという、例えば、非特許文献６参照）および電場によるイオンの輸送が可能な電解質材料層を２以上の電極で挟持した電解質素子（例えば、特許文献２、非特許文献７参照）などがある。 Note that such an analog resistance change element is an element using an atomic switch (hereinafter referred to as a gapless element) that utilizes the movement, precipitation, and reionization of metal ions in a solid electrolyte sandwiched between electrodes by the application of an electric field. (hereinafter referred to as a gap-type atomic switch), an element that grows or contracts a filament by sandwiching a solid electrolyte between electrodes and applying a voltage (hereinafter referred to as a gap-type atomic switch, for example, Non-Patent Document 6) and electrolyte elements in which an electrolyte material layer capable of transporting ions by an electric field is sandwiched between two or more electrodes (see, for example, Patent Document 2 and Non-Patent Document 7).

このような特許文献１，２，非特許文献５，６，７にあるようなアナログ抵抗変化素子を用いる意思決定の装置を、バンディット問題に適用する場合、スロットマシーン（選択肢）の数が「２」であるバンディット問題を解くＴＯＷモデルであれば、上記各文献に記載のアナログ抵抗変化素子を１個だけ用いて実装することができる。 When applying decision-making devices using analog resistance change elements such as those described in Patent Documents 1 and 2 and Non-Patent Documents 5, 6, and 7 to the bandit problem, the number of slot machines (choices) is ``2''. A TOW model that solves the bandit problem can be implemented using only one analog resistance change element described in each of the above-mentioned documents.

しかしながら、特許文献１や非特許文献５に記載された、ギャップなし型原子スイッチをアナログ抵抗変化素子として用いる場合、比較的小型で低消費電力で動作するものの、処理の速度を高速化できないという課題があり、また、スロットマシーン（選択肢の）数を拡張した大規模問題を解くための方法も確立されていない。 However, when using the gapless atomic switch described in Patent Document 1 and Non-Patent Document 5 as an analog resistance change element, although it is relatively small and operates with low power consumption, there is a problem that processing speed cannot be increased. There is also no established method for solving large-scale problems that expand the number of slot machines (choices).

また非特許文献６にあるようなギャップあり型原子スイッチをアナログ抵抗変化素子として用いて、バンディット問題を解決する具体的方法は未だ示されていない。さらに、特許文献２、非特許文献７に記載されているような、電場によるイオンの輸送が可能な電解質材料層を２以上の電極で挟持した電解質素子でも、ギャップなし型原子スイッチと同様、比較的小型で低消費電力で動作するものの、高速化できないという問題点があり、スロットマシーン（選択肢の）数を拡張した大規模問題を解くための方法が見出せていない。 Further, a specific method for solving the bandit problem by using a gap type atomic switch as disclosed in Non-Patent Document 6 as an analog resistance change element has not yet been shown. Furthermore, even with an electrolyte element in which an electrolyte material layer capable of transporting ions by an electric field is sandwiched between two or more electrodes, as described in Patent Document 2 and Non-Patent Document 7, the comparison is similar to the gapless atomic switch. Although it is compact and operates with low power consumption, it has the problem of not being able to increase speed, and no method has been found to solve large-scale problems that expand the number of slot machines (options).

このように、従来例の技術では、スロットマシーン（選択肢）の数を４以上とすることが困難であるのが実情である。 As described above, in the conventional technology, it is difficult to increase the number of slot machines (choices) to four or more.

なお、非特許文献３、非特許文献４に記載されているような、単一光子の粒子性と波動性を用いてＴＯＷモデルを実装する手法では、比較的高速に動作するものの、小型化や低消費電力化が難しく、スロットマシーン（選択肢）の数を拡張しようとすると、装置のサイズが著しく大きくなり、実用性に乏しい。 Note that the method of implementing the TOW model using the particle and wave properties of single photons, as described in Non-Patent Document 3 and Non-Patent Document 4, operates at relatively high speed, but does not require miniaturization or It is difficult to reduce power consumption, and attempting to expand the number of slot machines (selections) would significantly increase the size of the device, making it impractical.

本発明は、上記実情に鑑みて為されたもので、小型化が容易で、比較的低消費電力で動作し、また、比較的高速に動作が可能で、選択肢が多数である場合にも適用可能な意思決定装置、及びその制御方法を提供することを目的の一つとする。 The present invention was made in view of the above circumstances, and can be easily miniaturized, operates with relatively low power consumption, and operates at relatively high speed, and is applicable even when there are many options. One of the purposes is to provide a decision-making device and its control method.

上記従来例の問題点を解決する本発明の一態様は、それぞれ確率的に報酬を受け得る２ⁿ個（ｎは自然数）の選択肢のうちから一つの選択肢を選択する意思決定装置であって、所定の物理量の値を、基準値に対して正または負の方向に確率的変動を伴って制御可能な回路素子を２ⁿ－１個用いて仮想的な二分木を構成し、最下層の２^n-1個の回路素子が前記選択肢のいずれかを選択する処理に供される回路部と、前記選択肢のうち一つを選択して、報酬の有無を判断する試行手段と、前記試行手段の判断の結果により、当該試行手段が選択した選択肢に関わる回路素子のそれぞれの前記物理量の値を、予め定めた規則に従って制御する物理量制御手段と、前記仮想的な二分木に構成された各回路素子を、最上位の回路素子から順次選択し、選択した回路素子の物理量の値が前記基準値より正であるか否かにより、一対の下位の回路素子のいずれかを選択することを繰り返して、選択された最下位の回路素子の物理量の値が前記基準値より正であるか否かにより、前記選択肢のいずれかを選択する選択肢選択手段と、を有し、前記物理量制御手段は、前記試行手段の選択により、報酬を受けた場合に、当該試行手段が選択した選択肢が前記選択肢選択手段により選択される確率を上昇させるよう、選択肢に関わる回路素子のそれぞれの前記物理量の値を、予め定めた規則に従って制御することとしたものである。One aspect of the present invention that solves the problems of the conventional example is a decision-making device that selects one option from among 2 ⁿ options (n is a natural number), each of which can receive a stochastic reward, A virtual binary tree is constructed using 2 ⁿ -1 circuit elements that can control the value of a predetermined physical quantity with stochastic fluctuations in the positive or negative direction with respect to a reference value, and the lowest layer 2 a circuit unit in which ^n-1 circuit elements are subjected to a process of selecting one of the options; a trial means for selecting one of the options and determining whether or not there is a reward; physical quantity control means for controlling the value of each of the physical quantities of the circuit elements related to the option selected by the trial means according to a predetermined rule according to the result of the judgment; and each circuit element configured in the virtual binary tree. are sequentially selected from the highest circuit element, and depending on whether the value of the physical quantity of the selected circuit element is more positive than the reference value, one of the pair of lower circuit elements is selected, option selection means for selecting one of the options depending on whether the value of the physical quantity of the selected lowest circuit element is more positive than the reference value; The value of the physical quantity of each of the circuit elements related to the option is predetermined so as to increase the probability that the option selected by the trial means is selected by the option selection means when a reward is received by selecting the means. It was decided that the system would be controlled according to established rules.

本発明によると、意思決定装置を小型化し、また比較的低消費電力で動作させることが可能となる。また、比較的高速に動作が可能で、選択肢が多数である場合にも適用可能な意思決定装置が提供される。 According to the present invention, it is possible to downsize the decision-making device and operate it with relatively low power consumption. Furthermore, a decision-making device is provided that can operate at relatively high speed and is applicable even when there are many options.

本発明の一実施形態の意思決定装置の構成例を表す図である。1 is a diagram illustrating a configuration example of a decision-making device according to an embodiment of the present invention. 本発明の一実施形態における多本腕バンディット問題を説明するための図である。FIG. 3 is a diagram for explaining a multi-armed bandit problem in an embodiment of the present invention. 本発明の一実施形態の意思決定制御部の機能ブロック図である。FIG. 3 is a functional block diagram of a decision-making control unit according to an embodiment of the present invention. 本発明の一実施形態のＴＯＷモデルを説明するための図である。FIG. 2 is a diagram for explaining a TOW model according to an embodiment of the present invention. 本発明の一実施形態のアナログ抵抗変化素子の透過型電子顕微鏡により得られた断面画像を示す図である。1 is a diagram showing a cross-sectional image obtained by a transmission electron microscope of an analog resistance change element according to an embodiment of the present invention. 本発明の一実施形態のアナログ抵抗変化素子を含むＲＡＮＤ回路を示す図である。1 is a diagram showing a RAND circuit including an analog resistance change element according to an embodiment of the present invention. 本発明の一実施形態で使用するアナログ抵抗変化素子のパルス電圧印加による抵抗値変化特性を示す図である。FIG. 3 is a diagram showing resistance value change characteristics due to application of a pulse voltage of an analog resistance change element used in an embodiment of the present invention. 本発明の一実施形態で使用するアナログ抵抗変化素子のパルス電圧印加による非線形的な抵抗値変化特性を示す図である。FIG. 3 is a diagram showing a nonlinear resistance value change characteristic due to application of a pulse voltage of an analog resistance change element used in an embodiment of the present invention. 本発明の一実施形態の２本腕バンディット問題のためのＴＯＷモデルを説明するための図である。FIG. 3 is a diagram for explaining a TOW model for a two-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の２本腕バンディット問題のためのＴＯＷモデルで選択された側が報酬を得る場合の処理を示す図である。FIG. 3 is a diagram illustrating a process when the selected side receives a reward in the TOW model for the two-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の２本腕バンディット問題のためのＴＯＷモデルで選択された側が報酬を得られない場合の処理を示す図である。FIG. 6 is a diagram illustrating processing when the selected side does not receive a reward in the TOW model for the two-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の多本腕バンディット問題のためのＴＯＷモデルを説明するための図である。FIG. 2 is a diagram for explaining a TOW model for a multi-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの一例の処理のフローチャートである。2 is a flowchart of an example of processing of a hierarchical TOW model for a multi-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの別の例を示す図である。FIG. 3 is a diagram illustrating another example of a hierarchical TOW model for a multi-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの別の例の処理のフローチャートである。2 is a flowchart of processing another example of a hierarchical TOW model for a multi-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの性能評価の一例を示す図である。FIG. 3 is a diagram illustrating an example of performance evaluation of a hierarchical TOW model for a multi-armed bandit problem according to an embodiment of the present invention. 本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの性能評価の別の例を示す図である。FIG. 7 is a diagram illustrating another example of performance evaluation of a hierarchical TOW model for a multi-armed bandit problem according to an embodiment of the present invention. 本発明の実施の形態に係る意思決定装置の一例に係る構成ブロック図である。FIG. 1 is a configuration block diagram of an example of a decision-making device according to an embodiment of the present invention. 本発明の実施の形態に係る意思決定装置の一例に係る機能ブロック図である。FIG. 1 is a functional block diagram of an example of a decision-making device according to an embodiment of the present invention. 本発明の実施の形態に係る意思決定装置が利用する仮想的な二分木の階層構造の例を表す説明図である。FIG. 2 is an explanatory diagram showing an example of a hierarchical structure of a virtual binary tree used by the decision-making device according to the embodiment of the present invention. 本発明の実施の形態に係る意思決定装置の動作例を表すフローチャート図である。It is a flowchart figure showing an example of operation of a decision making device concerning an embodiment of the present invention. 本発明の実施の形態に係る意思決定装置のもう一つの動作例を表すフローチャート図である。It is a flowchart figure showing another example of operation of the decision making device concerning an embodiment of the present invention. 本発明の実施の形態に係る意思決定装置のアナログ抵抗変化素子の回路の構成例を表す説明図である。FIG. 2 is an explanatory diagram showing a configuration example of a circuit of an analog resistance change element of a decision-making device according to an embodiment of the present invention.

以下、本発明の意思決定方法およびその装置について、図面を参照して実施形態を説明する。なお、異なる図面でも、同一の処理、構成を示すときは同一の符号を用いる。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of a decision-making method and an apparatus thereof according to the present invention will be described with reference to the drawings. Note that even in different drawings, the same reference numerals are used to indicate the same processing and configuration.

（システム構成）
本発明の一実施形態で用いる意思決定装置の動作及び処理を以下に説明する。図１は、本発明の一実施形態の意思決定装置全体の構成図である。本装置は、意思決定の中心的な処理を行う意思決定制御部１０１、抵抗値の値を変更して報酬を変化させることにより最大報酬を得られるスロットマシーンなどの確率的報酬付与手段を決定するアナログ抵抗変化素子（ＲＡＮＤ）回路１０２、入力および出力に基づいて必要な電圧を決定してアナログ抵抗変化素子に印加し、所望の抵抗値を得られるように制御して意思決定処理を行うＲＡＮＤコントローラ１０３および実際に発生する様々な課題を意思決定可能な問題として意思決定制御部１０１に提供し、得られた解を実際の課題に適用する外部インタフェース部１０４を備える。(System configuration)
The operation and processing of the decision-making device used in one embodiment of the present invention will be described below. FIG. 1 is a block diagram of an entire decision-making device according to an embodiment of the present invention. This device includes a decision-making control unit 101 that performs central decision-making processing, and a probabilistic reward giving means such as a slot machine that can obtain a maximum reward by changing the resistance value and changing the reward. A RAND controller that determines a necessary voltage based on the analog resistance variable element (RAND) circuit 102 and its input and output, applies it to the analog resistance variable element, controls the analog resistance variable element to obtain a desired resistance value, and performs decision-making processing. 103 and an external interface unit 104 that provides various problems that actually occur to the decision-making control unit 101 as decisionable problems and applies the obtained solutions to the actual problems.

意思決定制御部１０１は、ＲＡＮＤ回路１０２との間で、データライン１１１のデータの入出力を行って、最終的に解を受けとる。ＲＡＮＤコントローラ１０３は、ＲＡＮＤ回路１０２を制御ライン１２１で制御するとともにデータライン１１２でデータの入出力などを行って、ＲＡＮＤの更新を進める。本実施形態では、以上のような装置構成、機能分担で処理を実行するが、これに限らず本技術分野で知られた構成、機能分担で本実施形態の処理を実装することもできる。 The decision-making control unit 101 inputs and outputs data on the data line 111 to and from the RAND circuit 102, and finally receives a solution. The RAND controller 103 controls the RAND circuit 102 through a control line 121 and inputs/outputs data through a data line 112 to update the RAND. In this embodiment, the processing is executed using the device configuration and the division of functions as described above, but the processing of this embodiment is not limited to this, and the processing of the present embodiment can also be implemented using a configuration and division of functions known in the technical field.

また、外部インタフェース部１０４は、何らかの意思決定すべき問題を取得したり、最適化問題が提供されたりした場合に、本技術分野で知られたいずれかの手法により、その問題を本実施形態のＲＡＮＤを用いた意思決定が可能な方式に変換するものであり、将来的には、携帯端末やロボットに搭載することができる可能性もある。 Further, when the external interface unit 104 obtains a problem to be decided or is provided with an optimization problem, the external interface unit 104 converts the problem into the method of the present embodiment using any method known in the technical field. This converts the system into a method that allows decision-making using RAND, and in the future it may be possible to install it in mobile terminals and robots.

図２は、本発明の一実施形態の多本腕バンディット問題に関して説明するための図である。図２に示すように、一般に多本腕バンディット問題とは、２台を超えるスロットマシーンＡ２０１、スロットマシーンＢ２０２、スロットマシーンＣ２０３およびスロットマシーンＤ２０４を用い、各スロットマシーンの報酬確率（例えばそれぞれＰ_Ａ＝０．７、Ｐ_Ｂ＝０．５、Ｐ_Ｃ＝０．３およびＰ_Ｄ＝０．１等）が不明の状態で、決められた量のコイン２０５を各スロットマシーンに任意の比率で使用して報酬を最大にするというものである。例えば、スロットマシーンＡの報酬確率（勝率、当選確率などともいう）がＰ_Ａ＝０．７で最大と知っていれば、スロットマシーンＡだけにコインを賭ければ最大報酬になるが、各スロットマシーンの報酬確率が不明な場合は、全てのスロットマシーンにコインを賭けてその結果から報酬確率を推定しながら繰り返しプレイを進めるため、報酬確率の低いスロットマシーンにもある程度コインを投入しなければならず、最終的に最大の報酬確率を有するのが、例えばスロットマシーンＡと判明したとしても、報酬確率最大のスロットマシーンのみ使用しているときに比べると全体の報酬は低くなる。FIG. 2 is a diagram for explaining the multi-armed bandit problem according to an embodiment of the present invention. As shown in FIG. 2, the multi-armed bandit problem generally uses more than two slot machines A201, B202, C203, and D204, and the reward probability of each slot machine (for example, P _A = 0.7, P _B = 0.5, P _C = 0.3, P _D = 0.1, etc.), a predetermined amount of coins 205 is used in each slot machine at an arbitrary ratio. The aim is to maximize the reward. For example, if you know that the reward probability (also called winning rate, winning probability, etc.) of slot machine A is the maximum at P _A = 0.7, then if you bet a coin only on slot machine A, you will get the maximum reward, but each slot If the reward probability of a machine is unknown, you can bet coins on all slot machines and proceed with repeated play while estimating the reward probability from the results, so you will have to put some coins into slot machines with low reward probabilities. First, even if it is determined that, for example, slot machine A has the highest reward probability, the overall reward will be lower than when only the slot machine with the highest reward probability is used.

多本腕バンディット問題を探索するアルゴリズムは、従来から様々なものが研究されてきたが、その際に探索手法に対する評価でよく用いられたものは、アルゴリズムの報酬の和であった。しかし、近年では、（平均）正解率を意思決定手法の評価指標として用いるのが好ましくなっている（非特許文献１参照）。これは、近年ロボットやＡＩなど、本実施形態のような意思決定方法の適用が期待される分野では、使用される環境が実験室のような定常的な状態を期待できないものであり、そのような自然の環境の中では、一定以上の正確な判断が可能であれば出来るだけ早期に意思決定することが重要だからである。したがって、本実施形態の意思決定システムを用いた性能の評価においても正解率を用いることとする。 Various algorithms for searching the multi-armed bandit problem have been studied in the past, but the sum of the rewards of the algorithm was often used to evaluate the search methods. However, in recent years, it has become preferable to use the (average) correct answer rate as an evaluation index for decision-making methods (see Non-Patent Document 1). This is because in recent years, in fields such as robots and AI where decision-making methods such as this embodiment are expected to be applied, the environment in which they are used cannot be expected to be in a steady state like a laboratory. This is because in a natural environment, it is important to make decisions as early as possible if it is possible to make more accurate judgments than a certain level. Therefore, the accuracy rate is also used in evaluating the performance using the decision-making system of this embodiment.

図３は、本発明の一実施形態の意思決定制御部の機能ブロック図である。意思決定制御部１０１は、本実施形態の意思決定の全体を制御し、外部インタフェース部１０４と通信して、ＴＯＷモデルによる意思決定が可能な方式で問題を取得し、結果を返す外部インタフェース管理モジュール３０１および主にＲＡＮＤコントローラ１０３を制御してＲｅｓｉｓｔｉｖｅＡｎａｌｏｇＮｅｕｒｏｍｏｒｐｈｉｃＤｅｖｉｃｅ（ＲＡＮＤ）と呼ばれるアナログ抵抗変化素子であるＲＡＮＤ回路１０２の抵抗値を更新する抵抗値管理モジュール３０２を備える。また、抵抗値を変更した後あるいは変更しながら、所定のデータをデータライン１１１でＲＡＮＤ回路１０２に入力して実行させ、その結果を受け取る実行管理モジュール３０３を備える。本実施形態では、このようなデータライン、制御により意思決定の処理を行うがこれに限らず本技術分野で知られたいずれかのデータインタフェース、制御手法により実装することができる。 FIG. 3 is a functional block diagram of a decision-making control unit according to an embodiment of the present invention. The decision-making control unit 101 is an external interface management module that controls the overall decision-making of this embodiment, communicates with the external interface unit 104, acquires problems in a manner that allows decision-making based on the TOW model, and returns the results. 301 and a resistance value management module 302 that mainly controls the RAND controller 103 to update the resistance value of the RAND circuit 102, which is an analog resistance change element called a Resistive Analog Neuromorphic Device (RAND). Furthermore, after or while changing the resistance value, predetermined data is input to the RAND circuit 102 via the data line 111 to cause the RAND circuit 102 to execute it, and an execution management module 303 is provided for receiving the result. In this embodiment, decision-making processing is performed using such data lines and control, but the decision-making process is not limited thereto, and can be implemented using any data interface or control method known in the technical field.

（本実施形態のＴＯＷモデル）
図４は、本発明の一実施形態のＴＯＷモデルを説明するための図である。近年、多本腕バンディット問題、特に２本腕バンディット問題を解決するために、ＴＯＷモデルが有効であることが示されているが（例えば、非特許文献１参照）、これとは異なるアプローチとして本実施形態では図４などに示すような仮想的なバー４０１を想定し、これを用いてＴＯＷモデルを実現する。(TOW model of this embodiment)
FIG. 4 is a diagram for explaining a TOW model according to an embodiment of the present invention. In recent years, the TOW model has been shown to be effective for solving multi-armed bandit problems, especially two-armed bandit problems (see, for example, Non-Patent Document 1). In the embodiment, a virtual bar 401 as shown in FIG. 4 is assumed and used to realize a TOW model.

図４に示すように、スロットマシーンが報酬を得ると報酬を得たスロットマシーン側（例えばスロットマシーンＡ４１１）に所定の距離（例えば、１）だけバー４０１を移動させ、バー４０２に示すように逆の場合は相手側のスロットマシーン側（例えばスロットマシーンＢ４１２）に所定の距離（例えば、ω）だけ移動させる。ωの値は、スロットマシーンＡとＢの報酬確率をそれぞれＰ_ＡとＰ_Ｂとすると、ω＝（Ｐ_Ａ＋Ｐ_Ｂ）／（２－Ｐ_Ａ－Ｐ_Ｂ）と設定することが望ましいが、他の適切な値をとっても良い。As shown in FIG. 4, when a slot machine receives a reward, the bar 401 is moved a predetermined distance (for example, 1) toward the slot machine that received the reward (for example, slot machine A411), and the bar 401 is moved in the opposite direction as shown in the bar 402. In this case, the opponent slot machine (for example, slot machine B412) is moved by a predetermined distance (for example, ω). The value of ω is preferably set as ω = ( _PA + P _B )/(2-P _A - P _B ), where the reward probabilities of slot machines _A and B are P A and P _B , respectively. You may take an appropriate value for .

このような仮想的なバーのモデルを用いて最終的に最大報酬を得られる、確率的報酬付与手段であるスロットマシーンを決定する。ここで、確率的報酬付与手段とは、１回の処理により一定の報酬確率で報酬出力の可否の結果を出力する装置などの手段で、例えばスロットマシーンが対応する。すなわち、スロットマシーンは１回のプレイごとに、所定の報酬確率でコインを得られる（報酬有）か、得られないか（報酬無）のいずれかの結果を出力する。また、このような仮想的なバーを用いたＴＯＷモデルを採用することにより、２つのスロットマシーンの報酬確率の評価を同時に更新することができる。 Using such a virtual bar model, a slot machine, which is a stochastic reward giving means, that can ultimately yield the maximum reward is determined. Here, the probabilistic reward giving means is means such as a device that outputs a result of whether or not a reward can be outputted with a fixed reward probability in one process, and corresponds to, for example, a slot machine. That is, for each play, the slot machine outputs the result of either being able to obtain coins (with reward) or not (with no reward) at a predetermined reward probability. Further, by adopting a TOW model using such a virtual bar, it is possible to simultaneously update the evaluation of reward probabilities of two slot machines.

以上のような本実施形態のＴＯＷモデルを所定のハードウェア素子に実装できれば、従来よりも簡便で小型かつ低消費電力の装置で大規模のバンディット問題を解く意思決定システムを実現することができる。素子としては、様々な電気特性を有する素子が研究されているが、印加される電圧により抵抗値が変化する素子を想定し、上述の仮想的なバーの位置を抵抗値として設定すると、素子の特性により上述の本実施形態のＴＯＷモデルの振る舞いが、素子の動作によく対応付けされる素子を得ることができた。具体的には、印加するパルス電圧の強度やパルス幅などの印加条件により一定の線形的特性、すなわち印加される電圧の値に対する抵抗値の変化量が一定範囲内にあり、印加条件を変えることにより一定の非線形特性になり、およびヒステリシス特性がなく高速に動作するアナログ抵抗変化素子（ＲＡＮＤ）を用い、仮想的なバーの位置をＲＡＮＤの抵抗値として対応付けることにより、例えば仮想的なバーの各スロットマシーンとの位置関係でいずれをプレイするかを決定するように、現在の抵抗値によってプレイするスロットマシーンを決定し、スロットマシーンのプレイの結果、その結果により報酬が得られれば抵抗値を、同じスロットマシーンをプレイすると決定される方向に仮想的なバーを移動させる距離に対応する値とする。具体的にはこのような抵抗値の値となるように変更するための電圧を印加することにより、このような抵抗値の変更を達成する。報酬が得られなければ抵抗値を、同様に、もう一方のスロットマシーンをプレイすると決定される方向に仮想的なバーを移動させる距離に対応する値に変更するための電圧を印加することにより、本実施形態のＴＯＷモデルを実装することができる。 If the TOW model of this embodiment as described above can be implemented in a predetermined hardware element, it is possible to realize a decision-making system that solves a large-scale bandit problem with a device that is simpler, smaller, and consumes less power than conventional devices. Elements with various electrical characteristics are being researched, but if we assume an element whose resistance value changes depending on the applied voltage and set the position of the virtual bar mentioned above as the resistance value, the element's resistance value will change. Due to the characteristics, it was possible to obtain an element in which the behavior of the TOW model of the present embodiment described above corresponds well to the operation of the element. Specifically, depending on the application conditions such as the intensity and pulse width of the pulse voltage to be applied, it has a certain linear characteristic, that is, the amount of change in resistance value with respect to the value of the applied voltage is within a certain range, and it is possible to change the application conditions. By using an analog resistance change element (RAND) that has constant nonlinear characteristics and operates at high speed without hysteresis characteristics, and by associating the position of the virtual bar with the resistance value of RAND, for example, each virtual bar Just like determining which slot machine to play based on the positional relationship with the slot machine, the current resistance value determines which slot machine to play, and if a reward is obtained as a result of playing the slot machine, the resistance value is changed. The value corresponds to the distance to move the virtual bar in the direction determined when playing the same slot machine. Specifically, such a change in resistance value is achieved by applying a voltage for changing the resistance value. By applying a voltage to change the resistance value, if no reward is obtained, to a value corresponding to the distance to move the virtual bar in the direction determined to similarly play the other slot machine. The TOW model of this embodiment can be implemented.

一般に、ＲＡＮＤは、積層構造を制御することによって実用的なパルス印加条件で線形に抵抗が変化するように調整できることに優位性がある。この線形の抵抗変化は、ＴＯＷモデルを忠実に実装することに好適である。一方、実社会の応用では、報酬確率が動的に変化する状況にも柔軟に適応できる能力が要求されることが多い。これは、報酬確率変化前に正しかった意思決定が、報酬確率変化後には取り下げられやすくなっている必要があることを意味する。そのためには、報酬確率変化前に正しい意思決定を表現する一方に振れた抵抗値が、報酬確率変化後には少ないパルス印加でなるべく早く抵抗値をＲ_θに戻せることが望ましい。すなわち、抵抗値が大きく（あるいは、小さく）なり一方に振れるとパルス電圧に対する抵抗変化が小さくなるという非線形的（飽和的）な応答特性も要請される。本実施形態のＲＡＮＤは、パルス電圧の強度やパルス幅などの印加条件を変えることで、こうした飽和的な応答特性を実現することができるようにすることも可能である。これにより、用途に応じてその強化学習のための能力を柔軟に最適化させることができる。Generally, RAND has an advantage in that it can be adjusted so that the resistance changes linearly under practical pulse application conditions by controlling the laminated structure. This linear resistance change is suitable for faithfully implementing the TOW model. On the other hand, real-world applications often require the ability to flexibly adapt to situations where reward probabilities change dynamically. This means that decisions that were correct before the change in reward probability must be more likely to be withdrawn after the change in reward probability. To this end, it is desirable that the resistance value, which has swung in the direction representing correct decision-making before the reward probability change, can be returned to R _θ as quickly as possible by applying a small number of pulses after the reward probability change. That is, a nonlinear (saturated) response characteristic is also required, in which the resistance change with respect to the pulse voltage becomes smaller as the resistance value increases (or decreases) and swings in one direction. In the RAND of this embodiment, it is also possible to realize such a saturated response characteristic by changing the application conditions such as the intensity and pulse width of the pulse voltage. This allows the reinforcement learning capability to be flexibly optimized depending on the application.

このようなＲＡＮＤはパルス電圧の印加条件により線形性および非線形性を有する特性を持っているとしても、一般に素子は、印加される電圧に対して、図４に示すように揺らぎ４２０を有している。この揺らぎ４２０により、一時的に報酬の有無の判断がなされ、その判断に基づいてさらに電圧が印加されて抵抗値が変化する。この際に、報酬の有無により抵抗値の変化量に差をつけておくことにより、所定の報酬確率に対応することもできる。すなわち、抵抗値を減少させる電圧をセット電圧、抵抗値を増大（回復）させる電圧をリセット電圧とすると、セット電圧とリセット電圧とを所定の比、例えば図４に示すようにωとし、ωの値を様々に変えることにより、プレイを繰り返していって適切な確率的報酬付与手段を決定することができる。 Although such RAND has characteristics of linearity and nonlinearity depending on the pulse voltage application conditions, the element generally has fluctuations 420 with respect to the applied voltage as shown in FIG. There is. Due to this fluctuation 420, a temporary determination is made as to whether or not there is a reward, and based on that determination, a further voltage is applied to change the resistance value. At this time, it is also possible to correspond to a predetermined reward probability by making a difference in the amount of change in the resistance value depending on whether there is a reward or not. In other words, if the voltage that decreases the resistance value is the set voltage, and the voltage that increases (restores) the resistance value is the reset voltage, then the set voltage and the reset voltage are set at a predetermined ratio, for example, ω as shown in FIG. By varying the values, it is possible to repeatedly play and determine an appropriate probabilistic reward granting means.

（ＲＡＮＤ構造）
上述の特性を有するＲＡＮＤとしては、以下のような構造を有する素子を得ることができた。図５は、本発明の一実施形態のアナログ抵抗変化素子の透過型電子顕微鏡による断面画像を示す図である。図６は、本実施形態のアナログ抵抗変化素子を含む回路部を示す図である。(RAND structure)
As a RAND having the above characteristics, an element having the following structure could be obtained. FIG. 5 is a diagram showing a cross-sectional image taken by a transmission electron microscope of an analog resistance change element according to an embodiment of the present invention. FIG. 6 is a diagram showing a circuit section including the analog resistance change element of this embodiment.

本実施形態のＲＡＮＤは、窒化物からなる電極によって、酸化物からなる抵抗変化層を挟持する積層構造を有し、電極としてはＴｉＮ、ＴｉＯＮ、ＴａＮ、ＴａＯＮなどを用いることができる。また、酸化物からなる抵抗変化層としては、ＴａＯｘだけでなく、ＨｆＯｘ、ＴｉＯｘ、ＡｌＯｘ等異なる酸化状態を安定に持つことができる材料であれば、本技術分野で知られたいずれの材料も使用することができる。 The RAND of this embodiment has a laminated structure in which a variable resistance layer made of oxide is sandwiched between electrodes made of nitride, and TiN, TiON, TaN, TaON, etc. can be used as the electrodes. Furthermore, as the variable resistance layer made of oxide, not only TaOx but also any material known in this technical field can be used as long as it can stably maintain different oxidation states, such as HfOx, TiOx, and AlOx. can do.

本実施形態では、図５に示すように、ＲＡＮＤは、ＴｉＮ電極によって、抵抗値の異なるＴａＯ_ｘを挟持する積層構造および微細孔構造を有している。すなわち、上部のＴｉＮ電極（ＴＥ）５０１と下部のＴｉＮ電極（ＢＥ）５０２との間に、異なる抵抗値を有するＴａＯ_ｘ（１）５０３およびＴａＯ_ｘ（２）５０４が挟持された層構造を有する。ここで、ＴａＯ_ｘ（１）５０３とＴａＯ_ｘ（２）５０４とは、異なる抵抗値を有し、ＴａＯ_ｘ（１）５０３がＴａＯ_ｘ（２）５０４よりも抵抗値が低くなっている。In this embodiment, as shown in FIG. 5, the RAND has a laminated structure and a micropore structure in which TaO _x having different resistance values are sandwiched between TiN electrodes. That is, it has a layer structure in which TaO _x (1) 503 and TaO _x (2) 504 having different resistance values are sandwiched between an upper TiN electrode (TE) 501 and a lower TiN electrode (BE) 502. . Here, TaO _x (1) 503 and TaO _x (2) 504 have different resistance values, and TaO _x (1) 503 has a lower resistance value than TaO _x (2) 504.

また、図５に示すように、直径１００ｎｍの孔構造を形成するため、ＴｉＮ電極（ＢＥ）５０２の上に、化学蒸着（ＣＶＤ）により、続いて電子線描画（ＥｌｅｃｔｒｏｎＢｅａｍＬｉｔｈｏｇｒａｐｈｙ）によりＳｉＯ_２層５０５を成膜する。図５を参照すれば理解できるように、この微細孔構造の底部では、ＴｉＮ電極（ＢＥ）が完全に露出している。反応性スパタリングにより露出したＴｉＮ電極（ＢＥ）上に、ＴａＯ_ｘ（２）、ＴａＯ_ｘ（１）およびＴｉＮ電極（ＴＥ）の層が配置され、フォトリソグラフィおよびイオンミリングによりパターンが形成される。最後に、過渡電流により素子が被る電気的ダメージを抑制するため、図６に示すように、３ｋΩの負荷抵抗６０２がＲＡＮＤ６０１に直列に調整される。Furthermore, as shown in FIG. 5, in order to form a pore structure with a diameter of 100 nm, _two layers of SiO are deposited on the TiN electrode (BE) 502 by chemical vapor deposition (CVD) and then by electron beam lithography. 505 is formed into a film. As can be seen with reference to FIG. 5, the TiN electrode (BE) is completely exposed at the bottom of this micropore structure. On the TiN electrode (BE) exposed by reactive sputtering, layers of TaO _x (2), TaO _x (1) and TiN electrode (TE) are placed and patterned by photolithography and ion milling. Finally, to suppress electrical damage to the device due to transient currents, a 3 kΩ load resistor 602 is adjusted in series with RAND 601, as shown in FIG.

以上により形成されたＲＡＮＤは、セット電圧およびリセット電圧をパルス的に複数回印加したとき、図７に示すようにその抵抗値を変化させることができる。このとき、ＲＡＮＤの抵抗値は、パルス電圧の印加回数に対してヒステリシス特性を示さずに線形に変化する、つまり、どの抵抗値の状態でもパルス電圧による抵抗値の変化量がほぼ同一であるように設定することができる。 The RAND formed as described above can change its resistance value as shown in FIG. 7 when a set voltage and a reset voltage are applied in a plurality of pulses. At this time, the resistance value of RAND changes linearly with respect to the number of pulse voltage applications without showing hysteresis characteristics.In other words, the amount of change in resistance value due to the pulse voltage is almost the same regardless of the resistance value state. Can be set to .

具体的には、図７に示すように、抵抗値を減少させるセット電圧７０１および抵抗値を増大（回復）させるリセット電圧７０２を所定回数印加することにより一定の割合で抵抗値を変化させることができる。例えば、図７を参照すると、リセット電圧として２００ｎｓ幅の－２．０Ｖのパルス電圧７０２を５０回印加するとＲＡＮＤの抵抗値は、約８０００Ωから約９０００Ωまで直線的に増大する。一方、セット電圧として２００ｎｓ幅の＋１．７Ｖのパルス電圧７０１を５０回印加するとＲＡＮＤの抵抗値は、約９０００Ωから約８０００Ωまで直線的に減少する。 Specifically, as shown in FIG. 7, the resistance value can be changed at a constant rate by applying a set voltage 701 that decreases the resistance value and a reset voltage 702 that increases (restores) the resistance value a predetermined number of times. can. For example, referring to FIG. 7, when a pulse voltage 702 of -2.0V with a width of 200 ns is applied 50 times as a reset voltage, the resistance value of RAND increases linearly from about 8000Ω to about 9000Ω. On the other hand, when a pulse voltage 701 of +1.7V with a width of 200 ns is applied 50 times as a set voltage, the resistance value of RAND decreases linearly from about 9000Ω to about 8000Ω.

１回のセット電圧またはリセット電圧により毎回異なるものの、平均としては約２０Ωほど抵抗値が変化しており、変化量は印加される際の抵抗値には影響されず、リセットのパルス電圧５０回印加、およびセットのパルス電圧５０回印加を繰り返しても基本的にはほぼ同一の特性を示すことができる。 Although it differs each time depending on the set voltage or reset voltage, the resistance value changes by about 20Ω on average, and the amount of change is not affected by the resistance value when applying the reset pulse voltage 50 times. , and even if a set pulse voltage is repeatedly applied 50 times, basically almost the same characteristics can be exhibited.

また、印加される各パルス電圧に対する抵抗値の変化量は毎回微小な変化を伴い、一定の確率的揺らぎを有する。こうした確率的揺らぎは、ＴＯＷモデルにおいては、報酬確率分布が動的に変化するような状況で、その変化を迅速に検知することにつながり、当該状況にも柔軟に適応できる能力を向上させることに資する。したがって、本実施形態のＲＡＮＤは、その抵抗値がパルス電圧の印加に対し線形的に応答し、また確率的揺らぎを有しながら変化する点で、ＴＯＷモデルを実装するのに適した素子であることが理解される。なお、本実施形態では、電極としてＴｉＮ、酸化物からなる抵抗変化層としてはＴａＯｘを用いたが、その他の材料を用いる場合は本技術分野で知られた手法によりＲＡＮＤを作製することができる。 Further, the amount of change in resistance value with respect to each applied pulse voltage is accompanied by a minute change each time, and has a certain stochastic fluctuation. In the TOW model, these stochastic fluctuations lead to rapid detection of changes in the reward probability distribution in situations where the distribution changes dynamically, and improve the ability to flexibly adapt to the situation. To contribute. Therefore, the RAND of this embodiment is an element suitable for implementing the TOW model in that its resistance value responds linearly to the application of a pulse voltage and changes with stochastic fluctuations. That is understood. Note that in this embodiment, TiN is used as the electrode and TaOx is used as the variable resistance layer made of oxide, but when using other materials, RAND can be fabricated by a method known in the technical field.

一方、ＲＡＮＤはパルス電圧の強度やパルス幅などの印加条件を変えることで、図８で示されるように、線形的特性とは異なる非線形的（飽和的）な特性８０１、例えば抵抗値が一定以上大きい、あるいは一定以下に小さい範囲では、パルス電圧に対する抵抗値の変化がより小さくなるように設定することができる。図８は、本発明の一実施形態で使用するアナログ抵抗変化素子のパルス電圧印加による非線形的な抵抗値変化特性を示す図である。こうした図８に示すような非線形的な特性は、飽和した抵抗値が少ないパルス印加で早く抵抗の基準値に復帰できることにつながり、報酬確率が動的に変化する状況にも柔軟に適応できる能力を向上させることに資する。 On the other hand, in RAND, by changing the application conditions such as the intensity and pulse width of the pulse voltage, as shown in FIG. In a range that is large or small below a certain level, it can be set so that the change in resistance value with respect to the pulse voltage becomes smaller. FIG. 8 is a diagram showing a nonlinear resistance value change characteristic due to application of a pulse voltage of an analog resistance change element used in an embodiment of the present invention. These nonlinear characteristics, as shown in Figure 8, allow the saturated resistance value to quickly return to its reference value by applying a small number of pulses, and provide the ability to flexibly adapt to situations where the reward probability changes dynamically. Contributes to improving

以上、本実施形態で用いるＲＡＮＤおよびＲＡＮＤの製造方法を説明したが、これに限らず、本実施形態で必要とされる特性を有する素子であれば、いずれの素子も用いることができ、また、本技術分野で知られたアナログ抵抗変化素子やその他素子などの製造方法のいずれかを用いて製造することができる。 The RAND used in this embodiment and the method for manufacturing RAND have been described above, but the present invention is not limited to this, and any element can be used as long as it has the characteristics required in this embodiment. It can be manufactured using any of the manufacturing methods for analog resistance change elements and other elements known in the technical field.

（本実施形態の２本腕バンディット問題）
以上説明した本実施形態のＲＡＮＤを組み込んだ意思決定装置で、２本腕バンディット問題に関する意思決定を行う処理の例を以下に説明する。図９は、本実施形態の２本腕バンディット問題のためのＴＯＷモデルを説明するための図である。図１０は、本実施形態の２本腕バンディット問題のためのＴＯＷモデルで選択された側が報酬を得る場合の処理を示す図であり、図１１は選択された側が報酬を得られなかった場合の処理を示す図である。(Two-armed bandit problem of this embodiment)
An example of processing for making a decision regarding the two-armed bandit problem using the decision-making device incorporating the RAND of this embodiment described above will be described below. FIG. 9 is a diagram for explaining the TOW model for the two-armed bandit problem of this embodiment. FIG. 10 is a diagram showing the process when the selected side gets a reward in the TOW model for the two-armed bandit problem of this embodiment, and FIG. 11 is a diagram showing the process when the selected side does not get the reward. It is a figure which shows a process.

本実施形態のＴＯＷモデルの２本腕バンディットアルゴリズムは以下の通り。
（１）Ｒ＜Ｒ_θの（図９ないし１１においてＲがＲ_θの向って左側にある）場合、スロットマシーンＡをプレイ
Ｒ＞Ｒ_θの（図９ないし１１においてＲがＲ_θの向って右側にある）場合、スロットマシーンＢをプレイ
（２）コインが出れば（報酬有）、順方向（次に同じスロットマシーンをプレイすると判定する方向）に抵抗値が＋Ｒ_＋１されるためのパルス電圧を印加
コインが出なければ（報酬無）、順方向（次に同じスロットマシーンをプレイすると判定する方向）に抵抗値が－Ｒ_ωされるための（したがって、逆方向に進む）パルス電圧を印加
（３）印加されたパルス電圧に対応する抵抗値に「揺らぎ」を混入。The two-armed bandit algorithm of the TOW model of this embodiment is as follows.
(1 ₎ If R<R _θ (R is on the left side of R _θ in Figures 9 to 11), play slot machine _A. (on the right), play slot machine B (2) If a coin comes out (with reward), the pulse voltage to increase the resistance value by +R ₊₁ in the forward direction (direction in which it is determined that the same slot machine will be played next) If a coin does not come out (no reward), apply a pulse voltage so that the resistance value is -R _ω in the forward direction (the direction in which it is determined that the same slot machine will be played next time) (therefore, it moves in the opposite direction) (3) Add "fluctuation" to the resistance value corresponding to the applied pulse voltage.

具体的には、本実施形態では、スロットマシーンＡ９０１およびスロットマシーンＢ９０２の間にある仮想的なバー９０３を、スロットマシーンの報酬により移動させるというＴＯＷモデルをＲＡＮＤの抵抗値に対応付ける。 Specifically, in this embodiment, a TOW model in which a virtual bar 903 located between slot machine A 901 and slot machine B 902 is moved by slot machine rewards is associated with the RAND resistance value.

すなわち、先ず、いずれのスロットマシーンをプレイするかを判定するバー９０３の位置に対応付けて、基準となるＲ_θよりもＲが大きいか、小さいかによりプレイするスロットマシーンを判定する。That is, first, the slot machine to play is determined based on whether R is larger or smaller than the reference R _θ in association with the position of the bar 903 that determines which slot machine to play.

図９で見ると、Ｒ＜Ｒ_θなので、スロットマシーンＡ９０１がプレイすると判定され、そのプレイ結果でセット電圧またはリセット電圧を印加し、変化した抵抗値を取得する。取得した抵抗値について同様の処理を行って、最終的に所定以下の抵抗値に達した場合は、スロットマシーンＡ９０１を選択するよう意思決定する。逆に、所定以上の抵抗値に達した場合は、スロットマシーンＢ９０２を選択するよう意思決定する。Looking at FIG. 9, since R<R _θ , it is determined that the slot machine A901 will play, and based on the play result, a set voltage or a reset voltage is applied, and the changed resistance value is obtained. Similar processing is performed on the obtained resistance value, and if the resistance value finally reaches a predetermined value or less, a decision is made to select the slot machine A901. Conversely, when the resistance value reaches a predetermined value or higher, a decision is made to select slot machine B902.

本実施形態の２本腕バンディットの処理をより具体的に説明すると、図９に示すような状態でスロットマシーンＡ９０１がプレイすると判定された場合、プレイの結果コインが出たときは、報酬が得られたとして、スロットマシーンＡ９０１が次もプレイする方向、すなわち抵抗値が減少する方向に変化するよう電圧パルスを印加する。 To explain more specifically the process of the two-armed bandit of this embodiment, when it is determined that the slot machine A901 will play in the state shown in FIG. A voltage pulse is applied so that the slot machine A901 changes in the direction in which it will play next time, that is, in the direction in which the resistance value decreases.

より具体的には、図１０に示すように所定のセット電圧を印加し、その結果スロットマシーンＡ９０１方向にＲ_＋１だけ変化させるが、これは抵抗値が減少する方向に変化する。この際、揺らぎ４０１が入るため、抵抗値は一定の範囲内にはあるが、プレイごとに同じ結果にはならない。これにより、適切に解探索が可能となる。More specifically, as shown in FIG. 10, a predetermined set voltage is applied, and as a result, the resistance value changes by R ₊₁ in the direction of the slot machine A901, which changes in the direction of decreasing the resistance value. At this time, since fluctuation 401 is included, although the resistance value is within a certain range, the result will not be the same every play. This makes it possible to search for solutions appropriately.

図９に示すような状態でスロットマシーンＡ９０１がプレイすると判定され、プレイの結果コインが出なかったときは、報酬が得られなかったとして、次はスロットマシーンＢ９０２がプレイする方向、すなわち抵抗値が増大する方向に変化するようパルス電圧を印加する。具体的には、図１１に示すように所定のリセット電圧を印加し、その結果スロットマシーンＢ９０２の方向にＲ_－ωだけ変化させるが、これは抵抗値が増大する方向に変化する。この際、揺らぎ４２０が入るため、抵抗値は一定の範囲内にはあるが、プレイごとに同じ結果にはならない。これにより、適切に解探索ができる。この結果のＲの値により次にプレイするスロットマシーンを判定するが、図１１に示す状態では、ＲはＲ_θを超えて、Ｒ＞Ｒ_θとなるので、次はスロットマシーンＢ９０２をプレイすると判定され、以上が繰り返される。If it is determined that slot machine A 901 will play in the state shown in FIG. 9, and no coins come out as a result of the play, it is assumed that no reward was obtained, and the direction in which slot machine B 902 plays, that is, the resistance value is A pulse voltage is applied so as to change in an increasing direction. Specifically, as shown in FIG. 11, a predetermined reset voltage is applied, resulting in a change in the direction of slot machine B902 by R _{2 -ω} , which changes in the direction of increasing the resistance value. At this time, since fluctuation 420 is included, although the resistance value is within a certain range, the result will not be the same every play. This allows for an appropriate solution search. The slot machine to play next is determined based on the value of R resulting from this, but in the state shown in FIG. 11, R exceeds R _θ and R>R _θ , so it is determined that slot machine B902 is to be played next. and the above is repeated.

以上説明したように、本実施形態の２本腕バンディット問題に対応するＲＡＮＤを用いたＴＯＷモデルでは抵抗値Ｒはプレイの進行に従い変化するが、プレイのｔ回目の抵抗値Ｒ（ｔ）は以下のようになる。
Ｒ（ｔ）－Ｒ_θ＝Ｅ_Ｂ（ｔ）－Ｅ_Ａ（ｔ）＋δ （１）As explained above, in the TOW model using RAND corresponding to the two-armed bandit problem of this embodiment, the resistance value R changes as the play progresses, but the resistance value R(t) for the tth play is as follows. become that way.
R(t)-R _θ =E _B (t)-E _A (t)+δ (1)

ここで、Ｅ_ｋ（ｔ）（ｋ∈｛Ａ，Ｂ｝）は、スロットマシーンのそれぞれの評価値であり、以下のように勝回数Ｗ_ｋ（ｔ）および負回数Ｌ_ｋ（ｔ）を用いて表現される。δはシステムノイズやシステムエラーを含む揺らぎである。
Ｅ_Ａ（ｔ）＝Ｗ_Ａ（ｔ）－ωＬ_Ａ（ｔ）（２）
Ｅ_Ｂ（ｔ）＝Ｗ_Ｂ（ｔ）－ωＬ_Ｂ（ｔ）（３）Here, E _k (t) (k∈{A, B}) is the evaluation value of each slot machine, and using the number of wins W _k (t) and the number of negative numbers L _k (t) as follows, It is expressed as δ is fluctuation including system noise and system error.
E _A (t) = W _A (t) - ωL _A (t) (2)
E _B (t) = W _B (t) - ωL _B (t) (3)

ここで、ωは制御パラメータであり、最適制御パラメータω_ｏは、以下のように決定される。
ω_ｏ＝γ／（２－γ）（４）
γ＝Ｐ_Ａ＋Ｐ_Ｂ（５）Here, ω is a control parameter, and the optimal control parameter ω _o is determined as follows.
ω _o =γ/(2-γ) (4)
γ=P _A + P _B (5)

本実施形態のＴＯＷモデルにおいて抵抗値の動作のみを用いることにより、従来のＡＩアルゴリズムよりも高い正解率を得ることができる。揺らぎの最適な量は、報酬確率の差異により決定するが、いずれの揺らぎであっても相対的に、より高い正解率が期待される。 By using only the resistance value behavior in the TOW model of this embodiment, it is possible to obtain a higher accuracy rate than the conventional AI algorithm. The optimal amount of fluctuation is determined by the difference in reward probabilities, but regardless of the fluctuation, a relatively higher accuracy rate is expected.

（本実施形態の多本腕バンディット）
以上、本実施形態の２本腕バンディットについて説明した。実際の応用においては、確率的報酬付与手段が２つということはないから、２本を超えるＭＡＢ問題に対応する必要がある。しかしながら、例えば、以上説明した２本腕バンディットの処理を４本に拡張して適用しようとしても、実際の電子デバイスでは、４本の端子に対して綱引きモデルを適用することは極めて困難であり、そのデバイス加工プロセスは産業上の利用を想定できるものにはならない。したがって、４本を超えて本数が増大した場合においては、さらに実装が困難となる。よって、本実施形態では、以下に説明する２種類の階層型ＴＯＷモデルにより２^ｎ本腕バンディット問題に対応する。(Multi-armed bandit of this embodiment)
The two-armed bandit of this embodiment has been described above. In actual applications, there are no two probabilistic reward giving means, so it is necessary to deal with more than two MAB problems. However, for example, even if we try to extend the two-armed bandit processing described above to four terminals, it is extremely difficult to apply the tug-of-war model to four terminals in an actual electronic device. The device fabrication process cannot be considered for industrial use. Therefore, when the number increases beyond four, implementation becomes even more difficult. Therefore, in this embodiment, the ^2n- armed bandit problem is dealt with using two types of hierarchical TOW models described below.

（本実施形態の多本腕バンディット第１実施例）
図１２は、本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの一例を示す図であり、図１３は、階層型ＴＯＷモデルの一例の処理のフローチャートである。本実施例のＭＡＢは、単一のＲＡＮＤを複数用いるが、階層構造を持たせることで、２本腕モデルを容易に拡張することができる。すなわち、図１２に示すように上位階層のＲＡＮＤの下には２つのＲＡＮＤが配置され、上位階層の状態と、下位の２つのＲＡＮＤの状態とによりいずれを実行するか選択して、選択されたＲＡＮＤが実行され抵抗値が更新され、上位階層のＲＡＮＤの抵抗値が更新される。(First example of multi-armed bandit of this embodiment)
FIG. 12 is a diagram illustrating an example of a hierarchical TOW model for a multi-armed bandit problem according to an embodiment of the present invention, and FIG. 13 is a flowchart of an example of processing of the hierarchical TOW model. The MAB of this embodiment uses a plurality of single RANDs, but by providing a hierarchical structure, the two-arm model can be easily extended. That is, as shown in FIG. 12, two RANDs are arranged below the RAND in the upper layer, and which one to execute is selected based on the state of the upper layer and the state of the two lower RANDs. RAND is executed and the resistance value is updated, and the resistance value of RAND in the upper layer is updated.

図１２は、４本腕バンディットの例であるが、以上の処理をさらに多くの階層に適用することにより、容易に２^ｎ本腕のバンディット問題を解決することができる。なお、本実施例の説明では、階層構造により各ＲＡＮＤの処理をどう組み合わせるかを中心に説明するので、各ＲＡＮＤの抵抗値の更新処理などの詳細は上述の２本腕バンディット問題の対応を参照のこと。Although FIG. 12 shows an example of a four-armed bandit, the ^2n- armed bandit problem can be easily solved by applying the above processing to more layers. In the explanation of this embodiment, we will mainly explain how to combine the processes of each RAND using the hierarchical structure, so please refer to the above-mentioned response to the two-armed bandit problem for details such as the process of updating the resistance value of each RAND. About.

図１３を参照して、さらに具体的に処理を説明すると、先ず上位階層であるバー１、下位階層のバー２及びバー３の位置により、例えば下記表１にしたがって下位階層のうちバー２のＲＡＮＤを実行するかバー３のＲＡＮＤを実行するかを決定する（ステップ１３０１）。表１を参照すると、バー１が基準位置より左の場合はバー２のいずれかのスロットマシーンが、バー２の位置に基づいて選択されることが理解できる。 To explain the process in more detail with reference to FIG. 13, first, based on the positions of bar 1 in the upper hierarchy, bar 2 and bar 3 in the lower hierarchy, the RAND of bar 2 in the lower hierarchy is determined according to the following table 1. It is determined whether to execute or RAND of bar 3 (step 1301). Referring to Table 1, it can be understood that when bar 1 is to the left of the reference position, one of the slot machines of bar 2 is selected based on the position of bar 2.

したがって、バー２の位置が基準位置の左であれば、バー３の位置にかかわらずスロットマシーンＡが、右であれば、バー３の位置にかかわらずスロットマシーンＢが選択される。例えば、下位階層のバー２のＲＡＮＤが選択され、ここに接続するマシーンＡがプレイを実行すると、スロットマシーンＡのプレイが実行され（ステップ１３０２）、選択されたスロットマシーンでのプレイ結果に基づきこのＲＡＮＤの抵抗値を更新する（ステップ１３０３）。以上の処理をＴ回行い（ステップ１３０４）、その結果により、上位階層の抵抗値が更新される（ステップ１３０５）。

Therefore, if the bar 2 is on the left of the reference position, slot machine A is selected regardless of the bar 3's position, and if bar 2 is on the right, slot machine B is selected regardless of the bar 3's position. For example, when RAND of bar 2 in the lower hierarchy is selected and machine A connected here executes a play, the play of slot machine A is executed (step 1302), and this is based on the play result of the selected slot machine. The resistance value of RAND is updated (step 1303). The above process is performed T times (step 1304), and the resistance value of the upper layer is updated based on the result (step 1305).

以上の通り、本実施例の手法により単一のＲＡＮＤを階層的に組み合わせて、２^ｎ本腕バンディット問題を解決する階層型ＴＯＷモデルを実現することができる。ここで、本実施例では、下位階層の処理をＴ回実行しているが、これは本実施例の手法は比較的容易な報酬確率分布のスロットマシーンの場合はよく解決されるものの、特定の報酬分布の場合解決が困難なときがあるためである。As described above, by the method of this embodiment, a hierarchical TOW model that solves the ^2n- armed bandit problem can be realized by hierarchically combining single RANDs. Here, in this embodiment, the processing of the lower layer is executed T times, but although the method of this embodiment can be solved well in the case of a slot machine with a relatively easy reward probability distribution, This is because in the case of reward distribution, it is sometimes difficult to solve.

具体的には、スロットマシーンＡ、スロットマシーンＢ、スロットマシーンＣ、スロットマシーンＤの確率を、Ｐ_Ａ、Ｐ_Ｂ、Ｐ_Ｃ、Ｐ_Ｄとし、（Ｐ _Ａ、Ｐ_Ｂ、Ｐ_Ｃ、Ｐ_Ｄ）＝（０．７、０．５、０．３、０．１）の場合、Ｔ回繰り返さなくてもＰ _Ａ＋Ｐ_Ｂ＞Ｐ_Ｃ＋Ｐ_Ｄとなっているので、下位階層の結果をそのまま上位階層で処理しても、最も高いＰ_Ａを有するスロットマシーンＡを正しく決定することができる。一方、（Ｐ_Ａ、Ｐ_Ｂ、Ｐ _Ｃ、Ｐ_Ｄ）＝（０．７、０．５、０．９、０．１）の場合、この場合もＰ_Ａ＋Ｐ_Ｂ＞Ｐ _Ｃ＋Ｐ_Ｄとなるので、バー２を選択してしまい、正解のスロットマシーンＣを選択できない可能性が有る。このため、下位階層がＴ回更新処理を行って、すべて決着するまでは上位階層の更新を休止する。したがって、本実施例の手法を用いた場合、下位階層処理のＴ回分時間がかかることとなる。Specifically, let the probabilities of slot machine A, slot machine B, slot machine C _, and slot machine _D be _PA , _PB , PC, _PD , and ( PA , _PB , PC, _PD ₎ . In the case of = ( 0.7 , 0.5, 0.3, 0.1), P _A + P _B > P _C + P _D without repeating T times, so the result of the lower layer is directly transferred to the upper layer. Even if the processing is performed in the above manner, it is possible to correctly determine the slot machine A having the highest _PA . On the other hand, in the case of (P _A , P _B , P _C , P _D ) = (0.7, 0.5, 0.9 , 0.1), in this case also P _A + P _B > P _C + P _D Therefore, there is a possibility that bar 2 will be selected and slot machine C, which is the correct answer, will not be selected. For this reason, the update of the upper layer is suspended until the lower layer performs the update process T times and all of them are resolved. Therefore, when the method of this embodiment is used, it will take T times of lower layer processing.

以上の通り本実施例では、バー１の位置は、バー２から得られる評価値の合計、およびバー３から得られる評価値の合計の２つの合計を比較して決定される。すなわち、本実施例では、以上の処理を繰り返すことによりＲが変化し、Ｒの値により報酬が最大となるスロットマシーンが判定されるが、バー１のｔ回プレイ後の抵抗値Ｒ（ｔ）は以下のようになる。
Ｒ（ｔ）－Ｒ_θ＝（Ｅ_Ｃ（ｔ）＋Ｅ_Ｄ（ｔ））－（Ｅ_Ａ（ｔ）＋Ｅ_Ｂ（ｔ））＋δ （６）
Ｅ_ｋ（ｔ）＝Ｗ_ｋ（ｔ）－ω_（ＡＢ）Ｌ_ｋ（ｔ）ここで（ｋ∈｛Ａ，Ｂ｝）（７）
Ｅ_ｋ（ｔ）＝Ｗ_ｋ（ｔ）－ω_（ＣＤ）Ｌ_ｋ（ｔ）ここで（ｋ∈｛Ｃ，Ｄ｝）（８）As described above, in this embodiment, the position of bar 1 is determined by comparing two totals: the sum of evaluation values obtained from bar 2 and the sum of evaluation values obtained from bar 3. That is, in this embodiment, R changes by repeating the above process, and the slot machine with the maximum reward is determined based on the value of R. However, the resistance value R(t) after playing t times of bar 1 becomes as follows.
R (t) - R _θ = (E _C (t) + E _D (t)) - (E _A (t) + E _B (t)) + δ (6)
E _k (t)=W _k (t)−ω _(AB) L _k (t) where (k∈{A, B}) (7)
E _k (t)=W _k (t)−ω _(CD) L _k (t) where (k∈{C, D}) (8)

ここで、ω_（ＡＢ）はｋ∈｛Ａ，Ｂ｝の場合の制御パラメータであり、ω_（ＣＤ）はｋ∈｛Ｃ，Ｄ｝の場合の制御パラメータである。Here, ω _(AB) is a control parameter when kε{A, B}, and ω _(CD) is a control parameter when kε{C, D}.

（本実施形態の多本腕バンディット第２実施例）
図１４は、本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの別の例を示す図であり、図１５は、階層型ＴＯＷモデルの別の例の処理のフローチャートである。本実施例のＭＡＢは、単一のＲＡＮＤを複数用いるが、上述の第１実施例と同様に、階層構造を持たせることで、２本腕モデルを容易に拡張することができる。すなわち、図１４に示すように上位階層のＲＡＮＤの下位には２つのＲＡＮＤが配置され、上位階層の状態と、下位の２つのＲＡＮＤの状態とによりいずれを実行するか選択して、選択されたＲＡＮＤが実行され抵抗値が更新され、上位階層のＲＡＮＤの抵抗値が更新される。図１４は、４本腕バンディットの例であるが、以上の処理をさらに多くの階層に適用することにより、容易に２^ｎ本腕のバンディット問題を解決することができる。本実施例で、上位階層の更新は下位階層のＴＯＷモデルの結果、トーナメント方式のように、バー２およびバー３のスロットマシーンのうちより多くの報酬を獲得したスロットマシーン同士でＴＯＷモデルを行うことによって、上位階層のバーの位置を決定する。これにより、下位階層の処理をＴ回繰り返すまで上位階層の処理を休止することなく、上述の解決の難解な報酬分布の場合も正解を得ることができる。(Second example of multi-armed bandit of this embodiment)
FIG. 14 is a diagram showing another example of the hierarchical TOW model for the multi-armed bandit problem according to an embodiment of the present invention, and FIG. 15 is a flowchart of processing of another example of the hierarchical TOW model. be. The MAB of this embodiment uses a plurality of single RANDs, but similarly to the first embodiment described above, by providing a hierarchical structure, the two-arm model can be easily expanded. That is, as shown in FIG. 14, two RANDs are arranged below the RAND in the upper layer, and which one to execute is selected based on the state of the upper layer and the state of the two lower RANDs. RAND is executed and the resistance value is updated, and the resistance value of RAND in the upper layer is updated. Although FIG. 14 shows an example of a four-armed bandit, the ^2n- armed bandit problem can be easily solved by applying the above processing to more layers. In this embodiment, the update of the upper layer is performed by performing the TOW model between the slot machines that have obtained the most rewards among the slot machines at bar 2 and bar 3, as in a tournament method, as a result of the TOW model at the lower layer. determines the position of the upper layer bar. As a result, the correct answer can be obtained even in the case of the difficult-to-solve reward distribution described above, without suspending the processing of the upper layer until the processing of the lower layer is repeated T times.

図１５を参照して、さらに具体的に処理を説明すると、第１実施例と同様に先ず上位階層および下位階層バーの位置によりプレイするスロットマシーンを選択する（ステップ１５０１）。ここで例えば、下位の階層の選択されたスロットマシーンがプレイを実行すると（ステップ１５０２）、選択されたスロットマシーンでのプレイ結果に基づきこのＲＡＮＤの抵抗値を更新する（ステップ１５０３）。バー２のＲＡＮＤはスロットマシーンＡおよびＢに接続されており、バー３のＲＡＮＤはスロットマシーンＣおよびＤに接続されているので、バー２のスロットマシーンＡおよびＢのうち最大に評価値になるスロットマシーンと、バー３のスロットマシーンＣおよびＤのうち最大に評価値になるスロットマシーンとのＴＯＷモデルにより、バー１の位置、すなわち抵抗値が決定され更新される（ステップ１５０４）。 To explain the process in more detail with reference to FIG. 15, first, a slot machine to be played is selected based on the positions of the upper layer and lower layer bars, as in the first embodiment (step 1501). Here, for example, when a selected slot machine in a lower hierarchy executes a play (step 1502), the resistance value of this RAND is updated based on the play result of the selected slot machine (step 1503). Since the RAND of bar 2 is connected to slot machines A and B, and the RAND of bar 3 is connected to slot machines C and D, the slot that has the highest evaluation value among the slot machines A and B of bar 2 The position of bar 1, that is, the resistance value, is determined and updated based on the TOW model of the machine and the slot machine with the highest evaluation value among slot machines C and D of bar 3 (step 1504).

以上の通り本実施例では、バー１の位置は、上述の第１実施例のような評価値の合計の比較ではなく、バー２のスロットマシーンＡおよびＢのうち最大の評価値になるスロットマシーンと、バー３のスロットマシーンＣおよびＤのうち最大の評価値になるスロットマシーンとのＴＯＷモデルにより決定される。すなわち、本実施例のバー１の抵抗値Ｒ（ｔ）は以下の通りである。
Ｒ（ｔ）－Ｒ_θ＝ＭＡＸ（Ｅ_Ｃ（ｔ），Ｅ_Ｄ（ｔ））－ＭＡＸ（Ｅ_Ａ（ｔ），Ｅ_Ｂ（ｔ））＋δ （９）
Ｅ_ｋ（ｔ）＝Ｗ_ｋ（ｔ）－ω_（ＡＢ）Ｌ_ｋ（ｔ）ここで（ｋ∈｛Ａ，Ｂ｝）（１０）
Ｅ_ｋ（ｔ）＝Ｗ_ｋ（ｔ）－ω_（ＣＤ）Ｌ_ｋ（ｔ）ここで（ｋ∈｛Ｃ，Ｄ｝）（１１）As described above, in this embodiment, the position of bar 1 is determined not by comparing the sum of evaluation values as in the first embodiment described above, but by determining the position of the slot machine with the maximum evaluation value among slot machines A and B of bar 2. It is determined by the TOW model of the slot machine with the highest evaluation value among the slot machines C and D of bar 3. That is, the resistance value R(t) of bar 1 in this example is as follows.
R (t) - R _θ = MAX (E _C (t), E _D (t)) - MAX (E _A (t), E _B (t)) + δ (9)
E _k (t)=W _k (t)−ω _(AB) L _k (t) where (k∈{A, B}) (10)
E _k (t)=W _k (t)−ω _(CD) L _k (t) where (k∈{C, D}) (11)

ここで、ＭＡＸはＭＡＸ関数であり、引数のうち最大のものを出力する。したがって、より多い報酬が出力される。 Here, MAX is a MAX function, which outputs the largest one among the arguments. Therefore, more rewards are output.

以上の通り、本実施例の手法により単一のＲＡＮＤを階層的に組み合わせて、２^ｎ本腕バンディット問題を解決する階層型ＴＯＷモデルを実現することができるが、本実施例の改良された階層的運用方法は階層型ＴＯＷモデルであることのみを前提としているので、ＲＡＮＤを用いた装置以外でも、様々なデバイスを用いて実現されるＴＯＷモデルのいずれにも使用することができる他、そのようなデバイスを用いない通常のコンピュータを用いたシステムでも使用することができる。As described above, by the method of this embodiment, it is possible to realize a hierarchical TOW model that solves the ^2n- armed bandit problem by hierarchically combining single RANDs. The operating method assumes only a hierarchical TOW model, so it can be used for any TOW model realized using various devices other than devices using RAND. It can also be used in systems using ordinary computers without any special devices.

図１６は、本発明の一実施形態の多本腕バンディット問題のための階層型ＴＯＷモデルの性能評価の一例を示す図であり、図１７は階層型ＴＯＷモデルの性能評価の別の例を示す図である。ここで、実装が困難であると上述した多分岐モデルは、実際の測定はできないので、シミュレーションで行った。図１６は、報酬確率分布（Ｐ_Ａ、Ｐ_Ｂ、Ｐ_Ｃ、Ｐ_Ｄ）＝（０．７、０．５、０．３、０．１）の場合の性能評価であり、図１７は、報酬確率分布（Ｐ_Ａ、Ｐ_Ｂ、Ｐ_Ｃ、Ｐ_Ｄ）＝（０．７、０．５、０．９、０．１）の場合の性能評価を示すグラフである。図１６および１７を参照すると、実施例１および実施例２が分岐モデルと比較してもほぼ同等程度の性能であることが理解される。FIG. 16 is a diagram showing an example of performance evaluation of the hierarchical TOW model for the multi-armed bandit problem according to an embodiment of the present invention, and FIG. 17 is a diagram showing another example of performance evaluation of the hierarchical TOW model. It is a diagram. Here, since the multi-branch model mentioned above, which is difficult to implement, cannot be measured in practice, simulation was used. FIG. 16 shows the performance evaluation when the reward probability distribution (P _A , P _B , P _C , P _D )=(0.7, 0.5, 0.3, 0.1), and FIG. It is a graph showing performance evaluation in the case of reward probability distribution (P _A , P _B , P _C , P _D )=(0.7, 0.5, 0.9, 0.1). Referring to FIGS. 16 and 17, it can be seen that the performance of Example 1 and Example 2 is almost equivalent to that of the branch model.

本発明の実施の形態によれば、抵抗値がパルス電圧の印加回数に対し線形的あるいは非線形的に変化するよう応答特性を変更できるＲＡＮＤを用いて、各々がＴＯＷモデルを実装する２^ｎ－１個のＲＡＮＤを大規模集積化し階層的に運用することができ、また動的に報酬確率が変化するようなＭＡＢに対する適応性を向上させられ、様々な実社会応用における２^ｎの確率的報酬付与手段をもつ従来よりも大規模の意思決定問題を解く処理を、小型かつ低消費電力かつ実時間で高速に実行することが可能になる。According to an embodiment of the present invention, each of the 2 ⁿ −1 TOW models implements a TOW model using RAND that can change the response characteristic so that the resistance value changes linearly or nonlinearly with respect to the number of pulse voltage applications. It is possible to integrate individual RANDs on a large scale and operate them hierarchically, and to improve the adaptability to MAB where the reward probability changes ^dynamically . It becomes possible to solve larger-scale decision-making problems than ever before in a compact size, with low power consumption, and at high speed in real time.

また、２本腕バンディット問題についても、本発明のＲＡＮＤを用いることにより従来よりも小型かつ低消費電力かつ実時間で高速に実行することが可能になる。 Furthermore, by using the RAND of the present invention, the two-armed bandit problem can be executed in a smaller size, with lower power consumption, and faster in real time than before.

また、ＴＯＷモデルを用いれば、２^ｎ本腕バンディット問題を解決するための新たな改良された階層的運用方法が可能になるので、ＴＯＷモデルを実装する様々なデバイスの特性を有効に活用することができる。In addition, the TOW model enables new and improved hierarchical operational methods to solve the ²ⁿ -armed bandit problem, allowing effective use of the characteristics of various devices implementing the TOW model. Can be done.

［もう一つの説明］
本発明の実施の形態に係る意思決定装置１はまた、次のように説明することもできる。図１８は、本発明の実施の形態に係る意思決定装置１の一例に係る構成ブロック図である。[Another explanation]
The decision making device 1 according to the embodiment of the present invention can also be explained as follows. FIG. 18 is a configuration block diagram of an example of the decision-making device 1 according to the embodiment of the present invention.

図１８に例示するように、本実施の形態のある例に係る意思決定装置１は、制御部１１と、記憶部１２と、アナログ抵抗変化素子回路部１３と、入力部１４と、出力部１５とを含んで構成されている。 As illustrated in FIG. 18, the decision-making device 1 according to an example of the present embodiment includes a control section 11, a storage section 12, an analog resistance change element circuit section 13, an input section 14, and an output section 15. It is composed of:

制御部１１は、ＣＰＵ等のプログラム制御デバイスにより構成され、記憶部１２に格納されたプログラムに従って動作する。本実施の形態では、この制御部１１は、複数の選択肢のうちからいずれかの選択肢を選択する処理を行うものである。ここで選択肢のそれぞれは、確率的に報酬（何らかの利益）を付与する手段であり、例えばウェブページに出稿する複数種類の広告や、通信で利用可能な複数のチャネルなどである。具体的に広告であれば、広告の効果（クリックされたか否かなど）が報酬に相当し、通信のためのチャネルであれば通信速度を報酬に相当する情報として利用すればよい。 The control unit 11 is configured by a program control device such as a CPU, and operates according to a program stored in the storage unit 12. In this embodiment, the control unit 11 performs a process of selecting one of a plurality of options. Here, each of the options is a means of stochastically providing a reward (some kind of profit), such as multiple types of advertisements placed on a web page or multiple channels available for communication. Specifically, if it is an advertisement, the effect of the advertisement (such as whether it was clicked or not) corresponds to the reward, and if it is a channel for communication, the communication speed may be used as information corresponding to the reward.

制御部１１は、この選択肢の選択の処理を、アナログ抵抗変化素子回路部１３を制御し、当該制御の結果としての、アナログ抵抗変化素子回路部１３に含まれる各アナログ抵抗変化素子の抵抗値の情報を取得することによって実行する。この制御部１１の具体的な処理の内容については、後に説明する。 The control unit 11 controls the analog resistance variable element circuit unit 13 to select this option, and determines the resistance value of each analog resistance variable element included in the analog resistance variable element circuit unit 13 as a result of the control. Execute by obtaining information. The details of the specific processing performed by the control unit 11 will be explained later.

記憶部１２は、制御部１１によって実行されるプログラムを保持する。またこの記憶部１２は、制御部１１のワークメモリとしても動作する。また入力部１４は、キーボードやマウス等であり、ユーザの指示操作を受けて、当該指示操作の内容を制御部１１に出力する。出力部１５は、ディスプレイ等であり、制御部１１から入力される指示に従い、ユーザに情報を提示する。 The storage unit 12 holds programs executed by the control unit 11. This storage section 12 also operates as a work memory for the control section 11. The input unit 14 is a keyboard, a mouse, or the like, and receives a user's instruction operation and outputs the contents of the instruction operation to the control unit 11. The output unit 15 is a display or the like, and presents information to the user according to instructions input from the control unit 11.

アナログ抵抗変化素子回路部１３は、例えばＲＡＮＤ（Resistive Analog Neuro Device）を少なくとも一つ含んで構成される。ここでＲＡＮＤは、図５に例示したように、下部電極（ＢＥ）層５０２と、この下部電極層５０２上に一部開口して形成される、微細孔構造を有する絶縁層５０５と、この絶縁層５０５の開口を通じて下部電極層５０２に接触する抵抗変化層５００と、抵抗変化層５００上に形成された上部電極（ＴＥ）層５０１とをこの順に積層したメモリ素子である。 The analog resistance change element circuit section 13 includes, for example, at least one RAND (Resistive Analog Neuro Device). Here, as illustrated in FIG. 5, RAND includes a lower electrode (BE) layer 502, an insulating layer 505 having a micropore structure formed with a partial opening on this lower electrode layer 502, and this insulating This is a memory element in which a variable resistance layer 500 that contacts a lower electrode layer 502 through an opening in a layer 505 and an upper electrode (TE) layer 501 formed on the variable resistance layer 500 are laminated in this order.

ここで抵抗変化層５００は、互いに抵抗値の異なる下側層５０４と上側層５０３とを含む。具体的にこの上側層５０３はＴａ₂Ｏ_5－δ，下側層５０４はＴａＯ_ｘであり、上側層５０３と下側層５０４との間で酸素イオンを授受して上側層５０３の少なくとも一部で酸化あるいは還元反応を生じさせ、この上側層５０３の一部にＴａＯ_ｘのチャネルを形成したり、あるいは上側層５０３をＴａ₂Ｏ_5－δに戻して、このチャネルを狭める（あるいは消失させる）ことで抵抗値を変化させることを可能としたものである。Here, the variable resistance layer 500 includes a lower layer 504 and an upper layer 503 having different resistance values. Specifically, the upper layer 503 is made of Ta ₂ O _{5 -δ} , the lower layer ₅₀₄ is made of TaO to cause an oxidation or reduction reaction to form a TaO _x channel in a part of this upper layer 503, or return the upper layer 503 to Ta ₂ O _5-δ to narrow (or eliminate) this channel. This makes it possible to change the resistance value.

この抵抗値の変化は、正パルス（上側電極層５０１が下側電極層５０２よりも高い電位（リセット電圧と呼ぶ）となるようなパルス状電圧信号）を印加するか、あるいは負パルス（上側電極層５０１が下側電極層５０２よりも低い電位（セット電圧と呼ぶ）となるようなパルス状電圧信号）を印加することで生じさせることができ、上記の酸化タンタル（ＴａＯ_ｘ）を下側層５０４に利用する例では、正パルスを印加したときに抵抗値が高められ、負パルスを印加すると抵抗値が低減される。このような素子のセルの動作は、広く知られているため、ここでのこれ以上の説明は省略する。This change in resistance value can be achieved by applying a positive pulse (a pulsed voltage signal such that the upper electrode layer 501 has a higher potential than the lower electrode layer 502 (referred to as a reset voltage)) or by applying a negative pulse (the upper electrode layer This can be generated by applying a pulsed voltage signal such that the layer 501 has a lower potential (referred to as a set voltage) than the lower electrode layer ₅₀₂ . In the example used in 504, the resistance value is increased when a positive pulse is applied, and the resistance value is decreased when a negative pulse is applied. The cell operation of such devices is widely known and will not be further described here.

なお、上部電極層５０１、下部電極層５０２は、それぞれＴｉＮにより形成でき、絶縁層５０５はＳｉＯ_２を、化学蒸着（ＣＶＤ）により形成したものでよい。またその開口部（開口の直径は約１００ｎｍでよい）は、電子線描画（Electron Beam Lithography）によって形成できる。この上側に抵抗変化層５００を形成する方法としては、例えば反応性スパッタリング等の方法を採用できる。Note that the upper electrode layer 501 and the lower electrode layer 502 may each be formed of TiN, and the insulating layer 505 may be formed of SiO ₂ by chemical vapor deposition (CVD). Further, the opening (the diameter of the opening may be about 100 nm) can be formed by electron beam lithography. As a method for forming the variable resistance layer 500 on this upper side, a method such as reactive sputtering can be adopted, for example.

このように、本実施の形態の一例では、アナログ抵抗変化素子回路部１３は、窒化物からなる電極によって、酸化物からなる抵抗変化層を挟持する積層構造を有したＲＡＮＤを用いて構成できる。なお、上記の上側電極層５０１，下側電極層５０２の材料は、ＴｉＮのほか、ＴｉＯＮ、ＴａＮ、ＴａＯＮなどを用いることができる。また酸化物を用いた抵抗変化層５００としては、ＴａＯ_ｘだけでなく、ＨｆＯ_ｘ、ＴｉＯ_ｘ、ＡｌＯ_ｘ等、いずれも安定な、互いに異なる酸化状態を有する材料であれば、他の材料を使用することもできる。As described above, in one example of the present embodiment, the analog resistance variable element circuit section 13 can be configured using RAND having a laminated structure in which a resistance variable layer made of oxide is sandwiched between electrodes made of nitride. Note that as the material for the upper electrode layer 501 and the lower electrode layer 502, TiON, TaN, TaON, etc. can be used in addition to TiN. In addition, as the variable resistance layer 500 using an oxide, other materials can be used, such as not only TaO _x but also HfO _x , TiO _x , AlO _x, etc., as long as they are all stable materials and have different oxidation states. You can also.

また各ＲＡＮＤには、それぞれ過渡電流により素子が被る電気的ダメージを抑制するため３ｋΩ程度の負荷抵抗を直列に接続し、この負荷抵抗を介して正パルスあるいは負パルスを印加することとしてもよい。 Further, a load resistor of approximately 3 kΩ may be connected in series to each RAND in order to suppress electrical damage to the element due to transient current, and a positive pulse or a negative pulse may be applied via this load resistor.

本実施の形態の一例に係るアナログ抵抗変化素子回路部１３は、このような構成の素子を少なくとも一つ用いているので、この素子（ＲＡＮＤ）にセット信号またはリセット信号をパルス的に複数回印加したとき、図７に例示したように各素子（ＲＡＮＤ）の抵抗値を変化させることができる。図７において、リセット信号は、２００ｎｓの時間だけ－２．０Ｖのリセット電圧を印加するものであり、セット信号は、２００ｎｓの時間だけ１．７Ｖのセット電圧を印加するものである。図７に例示するように、ＲＡＮＤは、リセット信号を印加したときにはその抵抗値が単調に増加し、セット信号を印加すると、その抵抗値が単調に減少するよう制御される。 Since the analog resistance change element circuit section 13 according to an example of the present embodiment uses at least one element having such a configuration, a set signal or a reset signal is applied to this element (RAND) in a pulsed manner multiple times. In this case, the resistance value of each element (RAND) can be changed as illustrated in FIG. In FIG. 7, the reset signal applies a reset voltage of -2.0V for a period of 200 ns, and the set signal applies a set voltage of 1.7 V for a period of 200 ns. As illustrated in FIG. 7, RAND is controlled so that its resistance value monotonically increases when a reset signal is applied, and its resistance value monotonically decreases when a set signal is applied.

また、このときＲＡＮＤの抵抗値は、ある抵抗値の範囲では、パルスの印加回数に対してヒステリシス特性を示さずに実質的に線形に変化する。つまり、どの抵抗値の状態でも一回のパルスによる抵抗値の変化量が実質的に同一であるようになっている。 Moreover, at this time, the resistance value of RAND changes substantially linearly without showing a hysteresis characteristic with respect to the number of pulse applications within a certain resistance value range. In other words, the amount of change in resistance value caused by one pulse is substantially the same regardless of the state of the resistance value.

すなわち、図４に示したように、リセット電圧として２００ｎｓ幅の－２．０Ｖのパルス電圧７０２を５０回印加したとき、ＲＡＮＤの抵抗値は、約８０００Ωから約９０００Ωの範囲で、直線的に増大している。 That is, as shown in FIG. 4, when a pulse voltage 702 of -2.0V with a width of 200 ns is applied 50 times as a reset voltage, the resistance value of RAND increases linearly in the range of about 8000Ω to about 9000Ω. are doing.

一方、セット電圧として２００ｎｓ幅の＋１．７Ｖのパルス電圧７０１を５０回印加したときには、ＲＡＮＤの抵抗値は、約９０００Ωから約８０００Ωの範囲で直線的に減少している。ここで１つのパルスあたりの抵抗値の変化は毎回異なるが、平均で約２０Ωずつ抵抗値が変化しており、またその変化量は、印加される際の抵抗値に関わらず実質的に一定となっている。つまりＲＡＮＤの抵抗値は、印加するパルスの回数に応じて直線的に変化するものであり、このため、例えば＋ωΩを目標としてＲＡＮＤの抵抗値を変化させる場合は、リセット電圧としてのパルスをω／２０回（例えばω＝１０００であれば５０回）印加することとすればよい。同様に、－ωΩを目標としてＲＡＮＤの抵抗値を変化させる場合は、セット電圧としてのパルスをω／２０回（例えばω＝１０００であれば５０回）印加すればよい。 On the other hand, when the pulse voltage 701 of +1.7V with a width of 200 ns is applied 50 times as a set voltage, the resistance value of RAND decreases linearly in the range of about 9000Ω to about 8000Ω. Here, the change in resistance value per pulse differs each time, but the resistance value changes by about 20Ω on average, and the amount of change is essentially constant regardless of the resistance value when applied. It has become. In other words, the resistance value of RAND changes linearly according to the number of pulses applied. Therefore, for example, when changing the resistance value of RAND with the target of +ωΩ, the pulse as the reset voltage is set to ω/ω. It may be applied 20 times (for example, 50 times if ω=1000). Similarly, when changing the resistance value of RAND with -ωΩ as the target, pulses as a set voltage may be applied ω/20 times (for example, 50 times if ω=1000).

また上述のように、パルス電圧が印加されるごとの抵抗値の変化量は、一定ではなく、毎回微小な変動（確率的揺らぎ）を伴う。この特性は、本実施の形態では有利な効果をもたらす。 Further, as described above, the amount of change in the resistance value each time a pulse voltage is applied is not constant, but involves minute fluctuations (stochastic fluctuations) each time. This characteristic provides an advantageous effect in this embodiment.

また、ＲＡＮＤでは、上記のパルス状の信号の印加により抵抗値が線形に変化する抵抗値の範囲外では、パルス状信号の印加によって抵抗値が非線形的に（ある例では抵抗値が飽和するように）変化する抵抗値の範囲が存在する（図８）。 In addition, in RAND, outside the resistance value range where the resistance value changes linearly by the application of the pulsed signal, the resistance value changes nonlinearly (in some cases, the resistance value saturates) by the application of the pulsed signal. (Fig. 8).

なお、本実施形態の例では、電極としてＴｉＮ、酸化物からなる抵抗変化層としてはＴａＯｘを用いたＲＡＮＤの例を示したが、ＲＡＮＤを構成する素材としては、これに限られず、広く知られた種々の素材を用いて得ることができる。 In this embodiment, an example of RAND is shown in which TiN is used as an electrode and TaOx is used as a variable resistance layer made of an oxide. However, the material constituting RAND is not limited to this, and may be any widely known material. It can be obtained using various materials.

このように本実施の形態の例ではＲＡＮＤのような微細な回路構造を利用できるので、小型化が容易であり、また、消費電力も抑えることが可能となっている。また、簡素な構成の回路としているので、多段の論理回路を配するなどといった構成に比べ、比較的高速な動作が可能となり、実質的に実時間で（学習した）最適な選択を為し得る。 In this way, in the example of this embodiment, a fine circuit structure such as RAND can be used, so miniaturization is easy and power consumption can be suppressed. In addition, since the circuit has a simple configuration, it can operate at relatively high speeds compared to configurations that include multi-stage logic circuits, and it is possible to make (learned) optimal selections in virtually real time. .

またＲＡＮＤでなくとも、これに限らず、本実施形態で必要とされる特性（抵抗値等の物理量を増大または減少変化させる制御を行った際に、平均的には抵抗値等の物理量の変化量は一定でありながら、毎回の抵抗値等の物理量の変化量が確率的なゆらぎを有するという特性）を有する素子であれば、いずれの素子も用いることができ、また、本技術分野で知られたアナログ抵抗変化素子やその他素子などの製造方法のいずれかを用いて製造することができる。 In addition, even if it is not RAND, the characteristics required in this embodiment (when controlling to increase or decrease a physical quantity such as resistance value, the average change in physical quantity such as resistance value) Any element can be used as long as it has the characteristic that the amount of change in a physical quantity such as resistance value has stochastic fluctuations each time although the quantity is constant. It can be manufactured using any of the methods for manufacturing analog resistance change elements and other elements.

次に制御部１１の動作について説明する。本実施の形態では、制御部１１は、記憶部１２に格納されたプログラムを実行することで、図１９に例示するように、意思決定処理部２１と、成果判定部２２と、抵抗値制御部２３と、を含む機能的構成を実現する。 Next, the operation of the control section 11 will be explained. In the present embodiment, the control unit 11 executes the program stored in the storage unit 12 to control the decision-making processing unit 21, the result determination unit 22, and the resistance value control unit, as illustrated in FIG. A functional configuration including 23 and 23 is realized.

ここで意思決定処理部２１は、アナログ抵抗変化素子回路部１３が備えるＲＡＮＤの抵抗値を参照し、当該抵抗値が所定の基準値より大きいか小さいかを判断して、当該判断の結果に応じた意思決定結果を出力する。この抵抗値の参照は、定電圧電源からＲＡＮＤに電圧を印加し、ＲＡＮＤに流れる電流量を計測し、電源電圧Ｖを計測した電流量で除して抵抗値を求めることで行えばよい。なお、複数のＲＡＮＤがある場合は、図１８に模式的に示したように、各ＲＡＮＤのそれぞれに直列にスイッチを配し、各ＲＡＮＤを順次選択しつつ、選択したＲＡＮＤに直接に接続されたスイッチをオン（接続状態）とし、その他のスイッチをオフとして定電圧電源からスイッチがオンとなったＲＡＮＤについて電圧を印加して当該ＲＡＮＤを流れる電流を計測するようにすれば、各ＲＡＮＤの抵抗値を順次求めることができる。 Here, the decision-making processing unit 21 refers to the resistance value of RAND provided in the analog resistance change element circuit unit 13, determines whether the resistance value is larger or smaller than a predetermined reference value, and responds to the result of the judgment. output the decision-making results. This resistance value may be referenced by applying a voltage to RAND from a constant voltage power supply, measuring the amount of current flowing through RAND, and dividing the power supply voltage V by the measured amount of current to obtain the resistance value. Note that when there are multiple RANDs, as schematically shown in FIG. If the switch is turned on (connected state) and the other switches are turned off, voltage is applied from a constant voltage power source to the RAND that is turned on and the current flowing through the RAND is measured, then the resistance value of each RAND can be calculated. can be obtained sequentially.

この意思決定処理部２１の最も簡単な例は次のようなものである。すなわち、アナログ抵抗変化素子回路部１３が一つのＲＡＮＤを備え、このＲＡＮＤの抵抗値によって、広告Ａ，Ｂのいずれをウェブページに掲載するかを判断する処理を行う場合、意思決定処理部２１は、上記参照した抵抗値が上記基準値より大きい場合に広告Ａを掲載したウェブページを提供し、そうでない場合に広告Ｂを掲載したウェブページを提供することとすればよい。 The simplest example of this decision-making processing section 21 is as follows. That is, when the analog resistance change element circuit unit 13 includes one RAND and performs processing for determining which of advertisements A and B to be posted on a web page based on the resistance value of this RAND, the decision-making processing unit 21 If the reference resistance value is greater than the reference value, a web page with advertisement A may be provided, and if not, a web page with advertisement B may be provided.

成果判定部２２は、意思決定処理部２１が出力した意思決定結果に応じた成果（報酬）があったか否かを判断してその判断結果を出力するものである。具体的にこの成果判定部２２は、意思決定処理部２１が上記のように掲載する広告をいずれにするかを決定するものである場合、当該広告がクリックされたか否かを報酬の有無として、クリックされた場合に成果あり、そうでない場合は成果なしとした出力を行う。 The result determination unit 22 determines whether or not there was an outcome (remuneration) corresponding to the decision-making result output by the decision-making processing unit 21, and outputs the determination result. Specifically, when the decision-making processing unit 21 decides which advertisement to post as described above, the result determination unit 22 determines whether or not the advertisement is clicked as the presence or absence of a reward. If it is clicked, there is a result, otherwise it is output as no result.

抵抗値制御部２３は、意思決定処理部２１の出力と、成果判定部２２が出力する成果の情報とに基づいて、アナログ抵抗変化素子回路部１３のＲＡＮＤの抵抗値を制御して学習処理を実行する。具体的に抵抗値制御部２３は、上述の例では、
（１）広告Ａを選択したときに成果ありの出力があったときには、上記ＲＡＮＤにセット信号を所定回数印加して、抵抗値を低減（例えば－ｒを目標として低減）させる（広告Ａが選ばれる確率を上昇させる）。
（２）広告Ａを選択したときに成果なしの出力があったときには、上記ＲＡＮＤにリセット信号を所定回数印加して、抵抗値を増大（例えば＋ωを目標として増大）させる（広告Ａが選ばれる確率を下降させる）。
（３）広告Ｂを選択したときに成果ありの出力があったときには、上記ＲＡＮＤにリセット信号を所定回数印加して、抵抗値を増大（例えば＋ｒを目標として増大）させる（広告Ｂが選ばれる確率を上昇させる）。
（４）広告Ｂを選択したときに成果なしの出力があったときには、上記ＲＡＮＤにセット信号を所定回数印加して、抵抗値を低減（例えば－ωを目標として低減）させる（広告Ｂが選ばれる確率を下降させる）。The resistance value control unit 23 controls the RAND resistance value of the analog resistance variable element circuit unit 13 based on the output of the decision-making processing unit 21 and the result information outputted by the result determination unit 22 to perform the learning process. Execute. Specifically, in the above example, the resistance value control unit 23
(1) When advertisement A is selected and there is an output that shows results, apply a set signal to RAND a predetermined number of times to reduce the resistance value (for example, reduce with -r as the target) (advertisement A is selected). ).
(2) If there is an output with no results when advertisement A is selected, apply a reset signal to RAND a predetermined number of times to increase the resistance value (for example, increase with +ω as the target) (advertisement A is selected) decrease the probability).
(3) When advertisement B is selected and there is an output that shows results, apply a reset signal to RAND a predetermined number of times to increase the resistance value (for example, increase with +r as the target) (advertisement B is selected) increase the probability).
(4) If there is an output with no results when advertisement B is selected, apply a set signal to RAND a predetermined number of times to reduce the resistance value (for example, reduce it with -ω as the target) (advertisement B is selected). ).

なお、ここで上記（１）から（４）でそれぞれ信号を印加する回数は同じである必要はなく、例えば（１）の場合の信号印加の回数より（２）での信号印加回数を小さくしてもよい。 Note that the number of times the signal is applied in each of (1) to (4) above does not have to be the same; for example, the number of times the signal is applied in (2) is smaller than the number of times the signal is applied in (1). It's okay.

この例では、ｔ回後の試行の後の抵抗値Ｒ（ｔ）は、当初の抵抗値をＲ_θとして、
Ｒ（ｔ）－Ｒ_θ＝Ｅ_B（ｔ）－Ｅ_A（ｔ）＋δ
と表すことができる。ここで、Ｅ_A（ｔ），Ｅ_B（ｔ）は、試行中にＡ，Ｂがそれぞれ選択されたときの目標値の総和：
Ｅ_A（ｔ）＝ｒ・Ｗ_A（ｔ）－ω・Ｌ_A（ｔ）
Ｅ_B（ｔ）＝ｒ・Ｗ_B（ｔ）－ω・Ｌ_B（ｔ）
であり、δはＲＡＮＤの特性である確率的変動（の総和）を表す。なお、Ｗ_k（ｋ＝ＡまたはＢ）は、試行回数のうちｋを選択して成果があったと判定された回数、Ｌk（ｋ＝ＡまたはＢ）は、試行回数のうちｋを選択して成果がなかったと判定された回数である。ωは制御パラメータであり、実験的に定める。例えばωは、選択肢ｋで成果のある確率Ｐ_k：
Ｐ_k＝Ｗ_k／（Ｗ_k＋Ｌ_k）
を用い、γ＝Ｐ_A＋Ｐ_Bとして、
ω＝γ／（２－γ）
として定めてもよい。In this example, the resistance value R(t) after t trials is given by the initial resistance value R _θ .
R (t) - R _θ = E _B (t) - E _A (t) + δ
It can be expressed as. Here, E _A (t) and E _B (t) are the sum of target values when A and B are selected during the trial, respectively:
E _A (t)=r・W _A (t)−ω・L _A (t)
E _B (t)=r・W _B (t)−ω・L _B (t)
, and δ represents (the sum of) stochastic fluctuations that are a characteristic of RAND. In addition, W _k (k = A or B) is the number of times k was selected out of the number of trials and it was determined that there was a result, and L k (k = A or B) is the number of times k was selected out of the number of trials. This is the number of times it was determined that there was no result. ω is a control parameter and is determined experimentally. For example, ω is the probability P _k that option k yields an outcome:
_Pk = _Wk / ( _Wk + _Lk )
Using γ=P _A + P _B ,
ω=γ/(2-γ)
It may be set as

また、本実施の形態の一例では、意思決定の対象となる選択肢が上記例のようにＡ，Ｂの２つだけであるとは限られず、選択肢が多数である場合にも適用できる。例えば選択肢がＡ，Ｂ，Ｃ，Ｄ…と２ⁿ個ある場合、これらから一つを選択する意思決定を行う場合には、複数個のＲＡＮＤを用いて、仮想的に階層型の意思決定処理回路を形成する。具体的にこの階層型の意思決定処理回路は、２ⁿ－１個のＲＡＮＤを用い、図２０に例示するような仮想的な二分木の階層構造として配列する。ここで各ＲＡＮＤは、電気的には図１８に例示したように、並列に配されているものとしてよいが、制御部１１の処理として、図２０に例示するように、仮想的に、階層的な配置にあるものとして次の処理を行う。なお、ここで最下層の２^n-1個のＲＡＮＤが選択肢のいずれかを選択する処理に実際に供される。Furthermore, in this embodiment, the options for decision making are not limited to only two, A and B, as in the above example, but can also be applied to a case where there are many options. For example, when there are ²ⁿ options such as A, B, C, D..., when making a decision to select one from these, multiple RANDs are used to perform a virtually hierarchical decision-making process. form a circuit. Specifically, this hierarchical decision-making processing circuit uses 2 ⁿ -1 RANDs and is arranged in a virtual binary tree hierarchical structure as illustrated in FIG. Here, each RAND may be electrically arranged in parallel as illustrated in FIG. 18, but as a process of the control unit 11, as illustrated in FIG. The following processing is performed assuming that the configuration is as follows. Note that here, the 2 ^n-1 RANDs in the lowest layer are actually subjected to the process of selecting one of the options.

つまり図２０の例では、末端側である最下層（第Ｎ層）に設定した２^n－１個のＲＡＮＤのそれぞれがＡ，Ｂ，Ｃ，Ｄ…と２ⁿ個のうちから重複なく選択された一対の選択肢（Ａ，Ｂ），（Ｃ，Ｄ）…の選択に係るものとして設定され、このＲＡＮＤの直接の上位（第Ｎ－１層）に仮想的に配される２^n－２個のＲＡＮＤは、それぞれ下層（第Ｎ層）のＲＡＮＤのうちから重複なく選択した一対のＲＡＮＤのいずれかを選択するものとして設定される。例えば、第Ｎ－１層のＲＡＮＤ（便宜的に第１のＲＡＮＤと呼ぶ）の一つは、第Ｎ層の、選択肢Ａ，Ｂに対応するＲＡＮＤ（便宜的に第２のＲＡＮＤと呼ぶ）と、選択肢Ｃ，Ｄに対応するＲＡＮＤ（便宜的に第３のＲＡＮＤと呼ぶ）とのいずれかの選択に関わる。このような互いに隣接する層にある、第１のＲＡＮＤと第２のＲＡＮＤとの関係、及び第１のＲＡＮＤと第３のＲＡＮＤとの関係を以下では階層的連関関係と呼ぶ。以下、同様の構成として最上位（第１層）までのＲＡＮＤを設定する。In other words, in the example of FIG. 20, each of the 2 ^{n -1} RANDs set in the lowest layer (Nth layer) on the terminal side is selected from A, B, C, D, etc. without ^duplication . 2n ^-2 items are set as related to the selection of a pair of options (A, B), (C, D)..., and are virtually arranged directly above this RAND (N-1st layer). The RANDs are set to select one of a pair of RANDs selected without duplication from among the RANDs of the lower layer (Nth layer). For example, one of the RANDs in the N-1 layer (referred to as the first RAND for convenience) is the RAND corresponding to options A and B in the N-th layer (referred to as the second RAND for convenience). , RAND corresponding to options C and D (referred to as the third RAND for convenience). The relationship between the first RAND and the second RAND and the relationship between the first RAND and the third RAND, which are located in layers adjacent to each other, are hereinafter referred to as a hierarchical association relationship. Hereinafter, RAND up to the highest level (first layer) will be set as a similar configuration.

例えばｎ＝２（選択肢が４つ）である場合、意思決定処理部２１は、最上位（第１層）にあるものと設定した第１のＲＡＮＤについて、その抵抗値Ｒ１が所定の基準値より大きい場合に選択肢ＡまたはＢの選択を決定する、下層の第２のＲＡＮＤを選択する、また、そうでなければ、選択肢ＣまたはＤの選択を決定する、下層の第３のＲＡＮＤを選択する。 For example, when n=2 (four options), the decision-making processing unit 21 determines that the resistance value R1 of the first RAND, which is set to be at the top level (first layer), is lower than the predetermined reference value. If it is larger, select the lower second RAND which determines the selection of option A or B; otherwise, select the lower third RAND which determines the selection of option C or D.

そして意思決定処理部２１は、選択肢ＡまたはＢを選択したときには、仮想的階層として下位（第２層、ここでの最下層）にあるものとして設定した第２のＲＡＮＤについて、その抵抗値Ｒ２を参照し、この抵抗値Ｒ２が所定の基準値より大きい場合に選択肢Ａを選択する。また抵抗値Ｒ２が所定の基準値より大きくない場合に選択肢Ｂを選択する。 Then, when selecting option A or B, the decision-making processing unit 21 calculates the resistance value R2 of the second RAND, which is set as being in the lower (second layer, the lowest layer here) in the virtual hierarchy. Option A is selected when this resistance value R2 is larger than a predetermined reference value. Further, option B is selected when the resistance value R2 is not larger than a predetermined reference value.

同様に、意思決定処理部２１は、選択肢ＣまたはＤを選択したときには、仮想的階層として第２のＲＡＮＤと同じく下位（第２層）にあるものとして設定した第３のＲＡＮＤについて、その抵抗値Ｒ３を参照し、この抵抗値Ｒ３が所定の基準値より大きい場合に選択肢Ｃを選択する。また抵抗値Ｒ３が所定の基準値より大きくない場合に選択肢Ｄを選択する。 Similarly, when selecting option C or D, the decision-making processing unit 21 determines the resistance value of the third RAND, which is set as being in the same lower (second layer) as the second RAND as a virtual layer. With reference to R3, option C is selected if this resistance value R3 is larger than a predetermined reference value. Further, option D is selected when the resistance value R3 is not larger than a predetermined reference value.

つまり、選択肢ＡまたはＢの選択には、第１のＲＡＮＤ（ルートである最上位のＲＡＮＤ）と、末端である第２のＲＡＮＤとが関わる。また選択肢ＣまたはＤの選択には、第１のＲＡＮＤ（ルートである最上位のＲＡＮＤ）と、末端である第３のＲＡＮＤとが関わる。このように、ある一対の選択肢の選択に関わるルートから末端までに至る一連のＲＡＮＤを、ここでは選択連鎖と呼ぶ。 That is, the selection of option A or B involves the first RAND (the highest RAND that is the root) and the second RAND that is the end. Further, the selection of option C or D involves the first RAND (the highest RAND, which is the root) and the third RAND, which is the terminal. In this way, a series of RANDs from the root to the end related to the selection of a pair of options is referred to herein as a selection chain.

この例では、抵抗値制御部２３は、意思決定処理部２１の出力と、成果判定部２２が出力する成果の情報とに基づいて、アナログ抵抗変化素子回路部１３の第１乃至第３のＲＡＮＤの抵抗値Ｒ１からＲ３を制御する。 In this example, the resistance value control unit 23 controls the first to third RAND values of the analog resistance variable element circuit unit 13 based on the output of the decision-making processing unit 21 and the outcome information output from the outcome determination unit 22. The resistance values R1 to R3 are controlled.

具体的にこの抵抗値制御部２３における各ＲＡＮＤの抵抗値の制御の方法には、次のいくつかの方法がある。以下、それぞれについて説明する。 Specifically, there are several methods for controlling the resistance value of each RAND in the resistance value control section 23 as follows. Each will be explained below.

［第１の方法］
第１の方法では、最下層の各ＲＡＮＤのそれぞれについて学習処理する。すなわち第１の方法では、最下層の各ＲＡＮＤについて、予め定めた回数だけ次の処理を実行する。[First method]
In the first method, learning processing is performed for each RAND in the lowest layer. That is, in the first method, the following process is executed a predetermined number of times for each RAND in the lowest layer.

ここでは制御部１１は、各ＲＡＮＤについて、対応する選択肢のいずれかを選択する。例えば第２のＲＡＮＤであれば、当該第２のＲＡＮＤの抵抗値が所定の基準値より大きいか否かにより選択肢Ａ，Ｂのいずれかを選択する。この選択は既に説明した意思決定処理部２１の動作と同様のものである。 Here, the control unit 11 selects one of the corresponding options for each RAND. For example, in the case of the second RAND, either option A or B is selected depending on whether the resistance value of the second RAND is greater than a predetermined reference value. This selection is similar to the operation of the decision-making processing section 21 described above.

そして選択の結果、成果があったか否かを調べ、選択の結果、及び成果があったか否かにより第２のＲＡＮＤの抵抗値を制御する。この制御も、抵抗値制御部２３における動作と同様のものである。制御部１１は、この第２のＲＡＮＤに関する処理を予め定めた回数（例えばＴ回）行う。制御部１１は、第３のＲＡＮＤについても同様の処理を、予め定めたＴ回だけ行う。 Then, as a result of the selection, it is checked whether or not there was a result, and the resistance value of the second RAND is controlled depending on the result of the selection and whether or not there was a result. This control is also similar to the operation in the resistance value control section 23. The control unit 11 performs the process related to the second RAND a predetermined number of times (for example, T times). The control unit 11 performs the same process for the third RAND a predetermined number of times.

つまり広告の例であれば、制御部１１は、Ａ，Ｂのうちから選択した広告を提示する動作をＴ回試行し、その結果に応じて第２のＲＡＮＤの抵抗値を制御する。また制御部１１は、Ｃ，Ｄのうちから選択した広告を提示する動作もＴ回試行し、その結果に応じて第３のＲＡＮＤの抵抗値を制御する。 In other words, in the case of an advertisement, the control unit 11 tries the operation of presenting an advertisement selected from A and B T times, and controls the resistance value of the second RAND according to the result. The control unit 11 also tries the operation of presenting the advertisement selected from C and D T times, and controls the resistance value of the third RAND according to the result.

本実施の形態のここでの例では、制御部１１は、これらの処理の間、第２，第３のＲＡＮＤのそれぞれの抵抗値の目標値の総和ΣＥ_kを求めておく。In this example of the present embodiment, the control unit 11 calculates the sum ΣE _k of the target values of the respective resistance values of the second and third RANDs during these processes.

すなわち、既に述べたように、制御部１１のこの動作により、Ｔ回後の第２のＲＡＮＤ（選択肢Ａ，Ｂに対応）については、その当初の抵抗値をＲ_θとして、
Ｒ（Ｔ）－Ｒ_θ＝Ｅ_B（Ｔ）－Ｅ_A（Ｔ）＋δ₂
と表すことができ、ここで
Ｅ_k（Ｔ）＝Ｗ_k（Ｔ）－ωＬ_k（Ｔ）（ただしｋ＝Ａ，Ｂ）
であるが、制御部１１は、この抵抗値の制御のほか、
ΣＥ_A,B（Ｔ）＝Ｅ_A（Ｔ）＋Ｅ_B（Ｔ）
を求めて記憶しておく。同様に、制御部１１は、
ΣＥ_C,D（Ｔ）＝Ｅ_A（Ｔ）＋Ｅ_B（Ｔ）
を求めて記憶しておく。That is, as already mentioned, by this operation of the control unit 11, for the second RAND (corresponding to options A and B) after T times, the initial resistance value is set as R _θ ,
R (T) - R _θ = E _B (T) - E _A (T) + δ ₂
It can be expressed as, where E _k (T) = W _k (T) - ωL _k (T) (where k = A, B)
However, in addition to controlling this resistance value, the control unit 11 also controls
ΣE _A,B (T)=E _A (T)+E _B (T)
Find and remember. Similarly, the control unit 11
ΣE _C,D (T)=E _A (T)+E _B (T)
Find and remember.

なお、このＡ，Ｂのいずれかの広告を提示するページの提供と、Ｃ，Ｄのいずれかの広告を提示するページの提供とは時間的に並列的に行われてもよい。 Note that the provision of the page presenting either advertisement A or B and the provision of the page presenting advertisement C or D may be performed in parallel in terms of time.

制御部１１は、このようにしてそれぞれＴ回の試行による学習処理結果に基づいて、より上位のＲＡＮＤについて学習処理を行う。具体的に制御部１１は、これら第２，第３のＲＡＮＤに対して階層的連関関係にある、上位の層のＲＡＮＤ（ここでの例では第１のＲＡＮＤ）について、その抵抗値Ｒを、当初の抵抗値をＲ_θ1として、
Ｒ－Ｒ_θ1＝ΣＥ_C,D（Ｔ）－ΣＥ_A,B（Ｔ）＋δ₁
となるように設定する（ΣＥ_C,D（Ｔ）－ΣＥ_A,B（Ｔ）を目標値としてパルス信号を印加する）。In this way, the control unit 11 performs learning processing for the higher-order RAND based on the learning processing results obtained through T trials. Specifically, the control unit 11 determines the resistance value R of an upper layer RAND (in this example, the first RAND) that has a hierarchical relationship with the second and third RANDs. Assuming the initial resistance value as R _θ1 ,
R-R _θ1 = ΣE _C,D (T)-ΣE _A,B (T)+δ ₁
(A pulse signal is applied with ΣE _C,D (T) - ΣE _A,B (T) as the target value).

また３以上の層がある場合は、第ｉ＋１層のＲＡＮＤについて学習処理を行ったときの、第ｉ＋１層のＲＡＮＤの各々について所定数回（例えばＴ回）更新した後のΣＥk（Ｔ）を求め、第ｉ層のＲＡＮＤの抵抗値を、上述の例と同様にして設定して、選択連鎖となっている各層のＲＡＮＤの抵抗値を更新すればよい。 In addition, if there are three or more layers, calculate ΣEk(T) after updating each RAND of the i+1st layer a predetermined number of times (for example, T times) when performing learning processing on the RAND of the i+1st layer. , the resistance value of RAND of the i-th layer may be set in the same manner as in the above example, and the resistance value of RAND of each layer forming the selection chain may be updated.

制御部１１は、この設定が終了した後（学習処理が完了した後）は、既に述べたように、最上位のＲＡＮＤから、そのＲＡＮＤの抵抗値が所定の基準値より大きいか否かにより、第２層のＲＡＮＤのいずれかを選択し、第２層の選択されたＲＡＮＤの抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第３層のＲＡＮＤのいずれかを選択し…第ｉ層の選択されたＲＡＮＤの抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第ｉ＋１層のＲＡＮＤのいずれかを選択し…と繰り返して処理し、最下層において選択されたＲＡＮＤの抵抗値が所定の基準値より大きいか否かにより、選択肢のいずれを選択するかを決定する。 After this setting is completed (after the learning process is completed), the control unit 11 starts from the highest RAND and determines whether or not the resistance value of the RAND is larger than a predetermined reference value, as described above. Select one of the RANDs in the second layer, refer to the resistance value of the selected RAND in the second layer, and select the RAND in the third layer depending on whether the referenced resistance value is larger than a predetermined reference value. Select one of the RANDs of the i+1th layer by referring to the resistance value of the selected RAND of the i-th layer, and depending on whether the referenced resistance value is larger than a predetermined reference value. This process is repeated, and it is determined which of the options to select depending on whether the resistance value of RAND selected in the lowest layer is greater than a predetermined reference value.

例えば上述の例であれば、制御部１１は、最上位の第１のＲＡＮＤの抵抗値が所定の基準値より大きい場合に、第３のＲＡＮＤ（選択肢Ｃ，Ｄに対応）を選択し、当該選択した第３のＲＡＮＤの抵抗値を参照して、当該参照した第３のＲＡＮＤの抵抗値が所定の基準値より大きいときには、選択肢Ｄを選択する。また、当該参照した第３のＲＡＮＤの抵抗値が所定の基準値より大きくなければ、制御部１１は、選択肢Ｃを選択する。 For example, in the above example, if the resistance value of the first RAND at the highest level is larger than a predetermined reference value, the control unit 11 selects the third RAND (corresponding to options C and D) and With reference to the resistance value of the selected third RAND, if the referenced resistance value of the third RAND is greater than a predetermined reference value, option D is selected. Further, if the referenced resistance value of the third RAND is not greater than the predetermined reference value, the control unit 11 selects option C.

［第２の方法］
また第２の方法では、制御部１１は、図２１に例示するように、処理の時点での各ＲＡＮＤの抵抗値によって選択肢を選択する（Ｓ２１０１）。つまり制御部１１は、最上位のＲＡＮＤから、そのＲＡＮＤの抵抗値が所定の基準値より大きいか否かにより、第２層のＲＡＮＤのいずれかを選択し、第２層の選択されたＲＡＮＤの抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第３層のＲＡＮＤのいずれかを選択し…第ｉ層の選択されたＲＡＮＤ（以下、第ｉ層注目ＲＡＮＤと呼ぶ）の抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第ｉ＋１層のＲＡＮＤのいずれかを選択し…と繰り返して処理し、最下層において選択されたＲＡＮＤ（以下の説明で最下層注目ＲＡＮＤと呼ぶ）の抵抗値が所定の基準値より大きいか否かにより、選択肢のいずれを選択するかを決定する。[Second method]
In the second method, the control unit 11 selects an option based on the resistance value of each RAND at the time of processing, as illustrated in FIG. 21 (S2101). In other words, the control unit 11 selects one of the RANDs in the second layer from the highest RAND, depending on whether the resistance value of the RAND is larger than a predetermined reference value, and With reference to the resistance value, one of the RANDs of the third layer is selected depending on whether the referenced resistance value is larger than a predetermined reference value...The selected RAND of the i-th layer (hereinafter referred to as the i-th layer) Depending on whether the referenced resistance value is larger than a predetermined reference value, one of the RANDs in the i+1th layer is selected, and so on. Which of the options to select is determined depending on whether the resistance value of the RAND selected in (referred to as the lowest layer attention RAND in the following description) is greater than a predetermined reference value.

そして制御部１１は、決定した選択に基づく所定処理を行い（例えば選択された選択肢が表す広告を掲載したウェブページを提供し）、成果があったか否か（例えばクリックがされた否か）を調べる（Ｓ２１０２）。そして制御部１１は、当該選択した選択肢と、成果があったか否かにより、最下層注目ＲＡＮＤの抵抗値を制御する（Ｓ２１０３）。この制御は、抵抗値制御部２３における動作と同様のものである。 Then, the control unit 11 performs a predetermined process based on the determined selection (for example, provides a web page with an advertisement represented by the selected option), and checks whether there is a result (for example, whether a click was made or not). (S2102). Then, the control unit 11 controls the resistance value of the lowest layer attention RAND based on the selected option and whether or not there is a result (S2103). This control is similar to the operation in the resistance value control section 23.

またこのとき制御部１１は、既に説明した
Ｅ_k（ｔ）＝Ｗ_k（ｔ）－ωＬ_k（ｔ）（ただしｋ＝Ａ，Ｂ，Ｃ…、またｔは時刻（試行回数））
を各選択肢について評価値として演算して求めておく。Further, at this time, the control unit 11 performs the above-mentioned E _k (t) = W _k (t) - ωL _k (t) (where k = A, B, C..., or t is time (number of trials)).
is calculated and obtained as an evaluation value for each option.

そして制御部１１は、この最下層（第Ｎ層）の最下層注目ＲＡＮＤの上位となる、第Ｎ－１層注目ＲＡＮＤの抵抗値を、次のように制御する。 Then, the control unit 11 controls the resistance value of the N-1st layer of interest RAND, which is higher than the lowest layer of interest RAND of the lowest layer (Nth layer), as follows.

制御部１１は最下層から順に、時刻ｔでの下層側（第ｉ層）のＲＡＮＤの評価値を求めると、時刻ｔにおける第ｉ－１層のＲＡＮＤの抵抗値を、当該ＲＡＮＤの選択肢である第ｉ層のＲＡＮＤ（それぞれの選択肢をＵ，Ｖ及びＸ，Ｙとする）の各評価値を用いて、
Ｒ（ｔ）－Ｒ_θ＝ＭＡＸ［Ｅ_X（ｔ），Ｅ_Y（ｔ）］－ＭＡＸ［Ｅ_U（ｔ），Ｅ_V（ｔ）］＋δ
となるよう設定する（ＭＡＸ［Ｅ_X（ｔ），Ｅ_Y（ｔ）］－ＭＡＸ［Ｅ_U（ｔ），Ｅ_V（ｔ）］を目標値としてパルス信号を印加する）。ここで、ＭＡＸ［ｘ，ｙ］は、ｘ，ｙのうち大きい値をとることを意味する。When the control unit 11 calculates the evaluation value of the RAND of the lower layer (i-th layer) at time t, starting from the lowest layer, the control unit 11 determines the resistance value of the RAND of the i-1th layer at time t, which is the option of the RAND. Using each evaluation value of RAND (each option is U, V and X, Y) of the i-th layer,
R (t) - R _θ = MAX [E _X (t), E _Y (t)] - MAX [E _U (t), E _V (t)] + δ
( _A _pulse signal _is applied with MAX[ _E Here, MAX[x, y] means taking the larger value of x and y.

つまり制御部１１は、下層側のＥkが更新されるごとに、上層側のＲＡＮＤの抵抗値を次のように更新する（Ｓ２１０４；上位層抵抗値更新）。すなわち制御部１１は、上述の例で、選択肢Ｕ，Ｖを選択する第ｉ層のＲＡＮＤのＥ_k（ｋ＝Ｕ，Ｖ）が更新されたならば、当該第ｉ層のＲＡＮＤに対して階層的連関関係にある第ｉ－１層のＲＡＮＤの抵抗値を、Ｒ（ｔ－１）よりも、ＭＡＸ［Ｅ_U（ｔ），Ｅ_V（ｔ）］－ＭＡＸ［Ｅ_U（ｔ－１），Ｅ_V（ｔ－１）］だけ増大するよう目標値を設定して制御する。That is, the control unit 11 updates the resistance value of RAND on the upper layer side as follows every time Ek on the lower layer side is updated (S2104; upper layer resistance value update). In other words, in the above example, if E _k (k=U, V) of the i-th layer RAND that selects options U and V is updated, the control unit 11 controls the hierarchy for the i-th layer RAND. The resistance value of RAND of the i-1th _layer , which _has _a relationship of , E _V (t-1)] by setting a target value and controlling it.

また制御部１１は、上述の例で、選択肢Ｘ，Ｙを選択する第ｉ層のＲＡＮＤのＥk（ｋ＝Ｘ，Ｙ）が更新されたならば、当該第ｉ層のＲＡＮＤに対して階層的連関関係にある第ｉ－１層のＲＡＮＤの抵抗値を、Ｒ（ｔ－１）よりも、ＭＡＸ［Ｅ_X（ｔ），Ｅ_Y（ｔ）］－ＭＡＸ［Ｅ_X（ｔ－１），Ｅ_Y（ｔ－１）］だけ減少させるよう目標値を設定して制御する。Furthermore, in the above example, if Ek (k=X, Y) of the i-th layer RAND that selects options X and Y is updated, the control unit 11 performs a hierarchical The resistance value of RAND of the i-1th layer, which has an association relationship, is set as MAX[E _X (t), E _Y (t)] - MAX[E _X (t-1), The target value is set and controlled so as to decrease by E _Y (t-1)].

これにより時刻ｔの時点で、当該ＲＡＮＤの抵抗値が、
Ｒ（ｔ）－Ｒ_θ＝ＭＡＸ［Ｅ_X（ｔ），Ｅ_Y（ｔ）］－ＭＡＸ［Ｅ_U（ｔ），Ｅ_V（ｔ）］＋δ
となるよう目標値を設定して制御する。本実施の形態のこの例では、制御部１１は、こうして選択連鎖となっている各層のＲＡＮＤの抵抗値を更新する。As a result, at time t, the resistance value of the RAND becomes
R (t) - R _θ = MAX [E _X (t), E _Y (t)] - MAX [E _U (t), E _V (t)] + δ
The target value is set and controlled so that In this example of the present embodiment, the control unit 11 updates the RAND resistance value of each layer forming the selection chain in this way.

制御部１１は、以上のように、処理Ｓ２１０３，Ｓ２１０４での学習処理を行いつつ、処理Ｓ２１０１，Ｓ２１０２での選択の処理を実行する。つまり、既に述べたように、制御部１１は、最上位のＲＡＮＤから、そのＲＡＮＤの抵抗値が所定の基準値より大きいか否かにより、第２層のＲＡＮＤのいずれかを選択し、第２層の選択されたＲＡＮＤの抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第３層のＲＡＮＤのいずれかを選択し…第ｉ層の選択されたＲＡＮＤの抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第ｉ＋１層のＲＡＮＤのいずれかを選択し…と繰り返して処理し、最下層において選択されたＲＡＮＤの抵抗値が所定の基準値より大きいか否かにより、選択肢のいずれを選択するかを決定する。 As described above, the control unit 11 executes the selection process in processes S2101 and S2102 while performing the learning process in processes S2103 and S2104. In other words, as described above, the control unit 11 selects one of the second layer RANDs from the highest RAND, depending on whether the resistance value of that RAND is larger than a predetermined reference value, and With reference to the resistance value of the selected RAND of the layer, one of the RANDs of the third layer is selected depending on whether the referenced resistance value is larger than a predetermined reference value... By referring to the resistance value of the RAND, one of the RANDs in the i+1th layer is selected depending on whether the referenced resistance value is larger than a predetermined reference value, and so on. Which of the options to select is determined depending on whether the resistance value of RAND is greater than a predetermined reference value.

また本実施の形態のこの第２の方法においては、処理Ｓ２１０２と並行して、処理Ｓ２１０１と処理Ｓ２１０３との間で、制御部１１は、すべてのＲＡＮＤについて、その抵抗値を、判断の基準となる所定の基準値に近接させる処理を行ってもよい。具体的に制御部１１は、基準値をＲ_Ｔとする場合、各ＲＡＮＤの時刻ｔでの抵抗値をＲ（ｔ）として、
Ｒ′（ｔ）＝α（Ｒ（ｔ）－Ｒ_Ｔ）＋Ｒ_Ｔ（つまり抵抗値のα倍）
となるよう制御（α（Ｒ（ｔ）－Ｒ_Ｔ）＋Ｒ_Ｔを目標値として制御）する。なお、αは０以上１以下の実数であり、経験的に定める。In addition, in this second method of the present embodiment, in parallel with process S2102, between process S2101 and process S2103, the control unit 11 uses the resistance value of every RAND as a criterion for judgment. Processing may be performed to bring the value close to a predetermined reference value. Specifically, when the reference value is _RT , the control unit 11 assumes that the resistance value of each RAND at time t is R(t),
R'(t)=α(R(t)-R _T )+R _T (that is, α times the resistance value)
(Control with α(R(t)-R _T )+R _T as the target value). Note that α is a real number between 0 and 1, and is determined empirically.

［もう一つの方法（第３の方法）の例］
また別の方法では、制御部１１は、図２２に例示するように、処理の時点での各ＲＡＮＤの抵抗値によって選択肢を選択する（Ｓ２２０１）。つまり、制御部１１は、最上位のＲＡＮＤから、そのＲＡＮＤの抵抗値が所定の基準値より大きいか否かにより、第２層のＲＡＮＤのいずれかを選択し、第２層の選択されたＲＡＮＤの抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第３層のＲＡＮＤのいずれかを選択し…第ｉ層の選択されたＲＡＮＤ（以下、第ｉ層注目ＲＡＮＤと呼ぶ）の抵抗値を参照して、当該参照した抵抗値が所定の基準値より大きいか否かにより、第ｉ＋１層のＲＡＮＤのいずれかを選択し…と繰り返して処理し、最下層において選択されたＲＡＮＤ（以下の説明で最下層注目ＲＡＮＤと呼ぶ）の抵抗値が所定の基準値より大きいか否かにより、選択肢のいずれを選択するかを決定する。[Example of another method (third method)]
In another method, the control unit 11 selects an option based on the resistance value of each RAND at the time of processing, as illustrated in FIG. 22 (S2201). That is, the control unit 11 selects one of the RANDs in the second layer from the highest RAND depending on whether the resistance value of the RAND is larger than a predetermined reference value, and The selected RAND of the i-th layer (hereinafter referred to as the i-th The RAND of the i+1th layer is selected depending on whether the referenced resistance value is larger than a predetermined reference value, and so on. Which of the options to select is determined depending on whether the resistance value of the RAND selected in the lower layer (referred to as the lowest layer attention RAND in the following description) is greater than a predetermined reference value.

制御部１１は、次に、すべてのＲＡＮＤについて、その抵抗値を、判断の基準となる所定の基準値に近接させる処理を行う（Ｓ２２０２）。具体的に制御部１１は、基準値をＲＴとする場合、各ＲＡＮＤの時刻ｔでの抵抗値をＲ（ｔ）として、
Ｒ′（ｔ）＝α（Ｒ（ｔ）－Ｒ_Ｔ）＋Ｒ_Ｔ（つまり抵抗値のα倍）
となるよう制御（α（Ｒ（ｔ）－Ｒ_Ｔ）＋Ｒ_Ｔを目標値として制御）する。なお、αは０以上１以下の実数であり、経験的に定める。Next, the control unit 11 performs a process of bringing the resistance values of all RANDs close to a predetermined reference value that is a criterion for determination (S2202). Specifically, when the reference value is RT, the control unit 11 assumes that the resistance value of each RAND at time t is R(t),
R'(t)=α(R(t)-R _T )+R _T (that is, α times the resistance value)
(Control with α(R(t)-R _T )+R _T as the target value). Note that α is a real number between 0 and 1, and is determined empirically.

一方で、制御部１１は、処理Ｓ２２０１で選択した選択肢を試行し、成果があったか否かを判断する（Ｓ２２０３）。この処理は具体的に選択肢がウェブページに貼り付けられる広告である場合、当該広告がクリックされたか否かにより成果の有無を判断することになる。 On the other hand, the control unit 11 tries the option selected in step S2201, and determines whether or not it was successful (S2203). Specifically, in this process, if the option is an advertisement pasted on a web page, whether or not the advertisement is successful is determined based on whether or not the advertisement is clicked.

制御部１１は、処理Ｓ２２０３の判断の結果に応じて、処理Ｓ２２０１で選択した選択肢に対応する最下層のＲＡＮＤの抵抗値を制御する（Ｓ２２０４）。すなわち、当該最下層（第Ｎ層）のＲＡＮＤが選択肢Ｘ，Ｙに対応するもの（最下層注目ＲＡＮＤ）であり、その抵抗値が所定の基準値より大きい場合に選択肢Ｙを選択し、そうでない場合に選択肢Ｘを選択するものであるときには、制御部１１は、
（１）選択肢Ｘを選択したときに成果ありと判断したときには、対応するＲＡＮＤにセット信号を所定回数印加して、抵抗値を低減（例えば－ｒを目標値として低減）させ、
（２）選択肢Ｘを選択したときに成果なしと判断したときには、対応するＲＡＮＤにリセット信号を所定回数印加して、抵抗値を増大（例えば＋ω_sを目標値として増大）させ、
（３）選択肢Ｙを選択したときに成果ありと判断したときには、対応するＲＡＮＤにリセット信号を所定回数印加して、抵抗値を増大（例えば＋ｒを目標値として増大）させ、
（４）選択肢Ｙを選択したときに成果なしと判断したときには、対応するＲＡＮＤにセット信号を所定回数印加して、抵抗値を低減（例えば－ω_sを目標値として低減）させる。The control unit 11 controls the resistance value of the lowest layer RAND corresponding to the option selected in step S2201, according to the result of the determination in step S2203 (S2204). In other words, if the RAND of the lowest layer (Nth layer) corresponds to options X and Y (bottom layer noted RAND) and its resistance value is greater than a predetermined reference value, select option Y; otherwise, select option Y. If option X is to be selected in this case, the control unit 11:
(1) If it is determined that there is a result when option
(2) If it is determined that there is no result when selecting _option
(3) When selecting option Y and determining that there is a result, apply a reset signal to the corresponding RAND a predetermined number of times to increase the resistance value (for example, increase +r as a target value),
(4) If it is determined that there is no result when option Y is selected, a set signal is applied to the corresponding RAND a predetermined number of times to reduce the resistance value (for example, reduce it to −ω _s as a target value).

さらに制御部１１は、第ｉ（ｉ＝１，２，…Ｎ－１）層のＲＡＮＤのうち、第ｉ＋１層にあって、所定Ｓ２２０１において選択されたＲＡＮＤ（第ｉ＋１層注目ＲＡＮＤ）に対して階層的連関関係にあるＲＡＮＤ（第ｉ層注目ＲＡＮＤ）について、この第ｉ層注目ＲＡＮＤと階層的連関関係を有する下層（第ｉ＋１層）の一対のＲＡＮＤ（一方は第ｉ＋１層注目ＲＡＮＤとなる）を、ＲＡＮＤｘ，ＲＡＮＤｙとして、
（１）ＲＡＮＤｘを選択し、成果ありと判断したときには、第ｉ層注目ＲＡＮＤにセット信号を所定回数印加して、抵抗値を低減（例えば－ｒを目標値として低減）させ、
（２）ＲＡＮＤｘを選択し、成果なしと判断したときには、第ｉ層注目ＲＡＮＤにリセット信号を所定回数印加して、抵抗値を増大（例えば＋ω_iを目標値として増大）させ、
（３）ＲＡＮＤｙを選択し、成果ありと判断したときには、第ｉ層注目ＲＡＮＤにリセット信号を所定回数印加して、抵抗値を増大（例えば＋ｒを目標値として増大）させ、
（４）ＲＡＮＤｙを選択し、成果なしと判断したときには、第ｉ層注目ＲＡＮＤにセット信号を所定回数印加して、抵抗値を低減（例えば－ω_iを目標値として低減）させる（上位層更新処理；Ｓ２２０５）。なお、ここでω_iは、第ｉ層での目標値であり、各層で異なる値が設定されてもよい。Furthermore, the control unit 11 controls the RAND (i+1st layer attention RAND) that is in the i+1st layer and selected in the predetermined S2201 among the RANDs in the i-th (i=1, 2,...N-1) layer. For a RAND (i-th layer attention RAND) that has a hierarchical association relationship, a pair of RANDs (one of which is the i+1 layer attention RAND) in a lower layer (i+1 layer) that has a hierarchical association relationship with this i-layer attention RAND. As RANDx, RANDy,
(1) When RANDx is selected and it is determined that there is a result, a set signal is applied a predetermined number of times to the i-layer attention RAND to reduce the resistance value (for example, -r is set as the target value),
(2) When RANDx is selected and it is determined that there is no result, a reset signal is applied a predetermined number of times to the i-th layer attention RAND to increase the resistance value (for example, increase +ω _i as a target value);
(3) When RANDy is selected and it is determined that there is a result, a reset signal is applied to the i-layer attention RAND a predetermined number of times to increase the resistance value (for example, increase +r as a target value);
(4) When RANDy is selected and it is determined that there is no result, a set signal is applied to the i-layer focused RAND a predetermined number of times to reduce the resistance value (for example, -ω _i is set as the target value) (upper layer update Processing; S2205). Note that here, ω _i is a target value for the i-th layer, and a different value may be set for each layer.

制御部１１は、この上位層更新処理を、第Ｎ－１層から第１層までの各層に対して順次行い、選択連鎖に含まれる各層における注目ＲＡＮＤ（処理Ｓ２２０１で選択したＲＡＮＤ）の抵抗値を更新する。制御部１１は、この処理Ｓ２２０１からＳ２２０５の処理を繰り返して実行する。 The control unit 11 sequentially performs this upper layer update process on each layer from the N-1st layer to the first layer, and updates the resistance value of the RAND of interest (RAND selected in process S2201) in each layer included in the selection chain. Update. The control unit 11 repeatedly executes the processing from S2201 to S2205.

この例では、処理Ｓ２２０１での選択結果が、そのまま選択の処理の結果として利用できる。なお、学習処理が完了したと判断できる場合には、処理Ｓ２２０１のみを繰り返し実行するようにしてもよいが、各選択肢において成果が得られる確率が変動する場合（例えば広告であれば、季節により変動することがある）は、処理Ｓ２２０１からＳ２２０５の処理を繰り返すことで、変動に追従した選択肢の選択（意思決定）を行うことができる。 In this example, the selection result in step S2201 can be used as is as a result of the selection process. Note that if it is determined that the learning process has been completed, only process S2201 may be repeatedly executed; however, if the probability of obtaining results for each option varies (for example, in the case of advertising, it may vary depending on the season). By repeating the processes from S2201 to S2205, it is possible to select options (decision making) that follow changes.

また、ここまでの説明において、アナログ抵抗変化素子回路部１３は、図２３に例示するようなクロスバー構造の回路を用いてもよい。この回路は、図２３に示すように、互いに異なる層に配され、平面視で交差する第１，第２の配線群２３０１，２３０２（図２３では左右方向に伸びる配線群を第１の配線群２３０１、これに直交して交差する配線群を第２の配線群２３０２とする）の各交点にＲＡＮＤ（または抵抗変化型素子でもよい。ここまでに説明した例でも同様）を配してなる。図２３では、図示の都合上、ＲＡＮＤ（ないし抵抗変化型素子）を通常の抵抗器の記号で示している。 Further, in the description up to this point, the analog resistance change element circuit section 13 may use a circuit with a crossbar structure as illustrated in FIG. 23. As shown in FIG. 23, this circuit includes first and second wiring groups 2301 and 2302 that are arranged in different layers and intersect in plan view (in FIG. 23, a wiring group extending in the left-right direction is called a first wiring group). 2301, and a group of wires intersecting perpendicularly thereto is referred to as a second group of wires 2302). In FIG. 23, for convenience of illustration, RAND (or variable resistance element) is shown with the symbol of a normal resistor.

本実施の形態の、仮想的に階層型の意思決定処理回路を形成する例を実現する際にこのようなアナログ抵抗変化素子回路部１３を使用する場合、最下層のＲＡＮＤの数が２^ｎであれば、最低でも２^ｎ×２^ｎの配線群（交点）を有する回路を使用する。When such an analog resistance variable element circuit unit 13 is used to realize the example of forming a virtually hierarchical decision-making processing circuit according to this embodiment, the number of RANDs in the lowest layer is 2 ⁿ . If so, use a circuit having at least 2 ⁿ × 2 ⁿ wiring groups (intersections).

そしてこの回路のうち、第１の配線群２３０１のうち１行目の配線と、第２の配線群２３０２のうち１列目の配線との交点（以下、１行１列目の交点というように表記する）のＲＡＮＤ1-1と、２行１列目及び１行２列目のＲＡＮＤ2-1，1-2と、…というように、ｉ行１列目の交点から１行ｉ列目（ただしｉ＝２^ｋ，ｋ＝１，２…，ｎ）の交点を結ぶ仮想的な線分２３０３-1，２３０３-2…上の各交点にあるＲＡＮＤを使用する（他のＲＡＮＤは使用しない）。In this circuit, the intersection between the first row of wires in the first wire group 2301 and the first column wire in the second wire group 2302 (hereinafter referred to as the intersection of the first row and first column) RAND1-1 of 2nd row 1st column and 1st row 2nd column RAND2-1, 1-2 of The RAND at each intersection on the virtual line segments 2303-1, 2303-2... connecting the intersections of i= ^2k , k=1, 2..., n) is used (other RANDs are not used).

つまり、階層型の意思決定処理回路における第ｋ層のＲＡＮＤとして、線分２３０３-k上の各ＲＡＮＤを使用すれば、上述の種々の方法による処理により、本実施の形態の意思決定装置１を実現できる。ただし本実施の形態ではこの例に限られず、階層型の意思決定処理回路における各ＲＡＮＤとして、どの格子点のＲＡＮＤを割り当てるかについては、異なる割り当て方法が採用されても構わない。 In other words, if each RAND on the line segment 2303-k is used as the RAND of the k-th layer in the hierarchical decision-making processing circuit, the decision-making device 1 of this embodiment can be processed by the various methods described above. realizable. However, the present embodiment is not limited to this example, and a different allocation method may be adopted as to which grid point's RAND is allocated as each RAND in the hierarchical decision-making processing circuit.

［変形例］
なお、ここまでの説明では、アナログ抵抗変化素子回路を用い、当該回路の抵抗値を用いて一対の選択肢のいずれかを選択させる選択要素として機能させることとしていた。しかしながら、選択の基準として用いる物理量は抵抗値だけに限られず、所定の基準値に対して正または負の方向に、確率的変動を伴った（つまり目標値に対する制御に対して確率的に変動するが当該方向には変化し、かつ、平均の変化量が有意である（測定にかかる）ような）制御が可能であれば、磁場の向き及び強さなど、他の物理量を変化させ得る素子であってもよい。[Modified example]
In the explanation so far, an analog resistance change element circuit is used, and the resistance value of the circuit is used to function as a selection element that selects one of a pair of options. However, the physical quantity used as a selection criterion is not limited to only the resistance value, but also includes stochastic fluctuations in the positive or negative direction with respect to a predetermined reference value (in other words, it fluctuates stochastically with respect to control over the target value). changes in the relevant direction, and if control is possible such that the average amount of change is significant (requires measurement), then it is an element that can change other physical quantities such as the direction and strength of the magnetic field. There may be.

また本実施の形態に係る意思決定装置１の意思決定の対象の例は、上述の広告の例に限られるものではなく、複数の選択肢のうちから選択されるものであれば、例えば通信チャネルの選択（この場合、報酬は通信チャネルの空きがあったか否か、つまり通信ができたか否かによる）や、振動発電装置における、電力を取り出す際の周波数の選択（この例では、報酬は所定値以上の電力が取り出せたか否かによる）などに利用できる。 Further, the example of the decision-making target of the decision-making device 1 according to the present embodiment is not limited to the above-mentioned advertisement example, but may be selected from among a plurality of options, for example, the communication channel. selection (in this case, the reward depends on whether or not there was a communication channel available, that is, whether communication was successful), and the selection of the frequency when extracting electricity from the vibration power generation device (in this example, the reward depends on whether or not there was a free communication channel, that is, whether communication was possible) (depending on whether electricity can be extracted or not).

１意思決定装置
１１制御部
１２記憶部
１３アナログ抵抗変化素子回路部
１４入力部
１５出力部
２１意思決定処理部
２２成果判定部
２３抵抗値制御部
１０１意思決定制御部
１０２ＲＡＮＤ回路
１０３ＲＡＮＤコントローラ
１０４外部インタフェース部
１１１，１１２データライン
１２１制御ライン
２０１、２０２、２０３、２０４スロットマシーン
２０５コイン
３０１外部インタフェース管理モジュール
３０２抵抗値管理モジュール
３０３実行管理モジュール
４０１、４０２バー
４１１、４１２スロットマシーン
４２０揺らぎ
５０１ＴｉＮ電極（ＴＥ）
５０２ＴｉＮ電極（ＢＥ）
５０３ＴａＯ_ｘ（１）
５０４ＴａＯ_ｘ（２）
５０５ＳｉＯ_２層
５０６基板
６０１ＲＡＮＤ
６０２負荷抵抗
７０１セット電圧の動作
７０２リセット電圧の動作
９０１、９０２スロットマシーン
９０３バー1 Decision-making device 11 Control section 12 Storage section 13 Analog resistance change element circuit section 14 Input section 15 Output section 21 Decision-making processing section 22 Result judgment section 23 Resistance value control section 101 Decision-making control section 102 RAND circuit 103 RAND controller 104 External Interface section 111, 112 Data line 121 Control line 201, 202, 203, 204 Slot machine 205 Coin 301 External interface management module 302 Resistance value management module 303 Execution management module 401, 402 Bars 411, 412 Slot machine 420 Fluctuation 501 TiN electrode ( T.E.)
502 TiN electrode (BE)
503 TaO _x (1)
504 TaO _x (2)
505 SiO ₂ layer 506 Substrate 601 RAND
602 Load resistance 701 Set voltage operation 702 Reset voltage operation 901, 902 Slot machine 903 Bar

Claims

A decision-making device element that determines one of two or more stochastic reward granting means based on a TOW (Tug-of-War) model,
A circuit means whose resistance value changes by applying a predetermined pulse voltage, wherein the resistance value changes linearly with respect to the number of times the pulse voltage is applied, and further the resistance value has stochastic fluctuations. changing circuit means;
Depending on the magnitude of the current resistance value of the circuit means with respect to a predetermined threshold value, it is determined which of the first stochastic reward giving means and the second stochastic reward giving means is to be executed, and the reward As a result of the granting process, a pulse voltage is applied to change the resistance value in a direction in which it is determined that the stochastic reward granting means that has been executed further executes the reward granting process when there is a reward, and when there is no reward circuit control means for applying a pulse voltage for changing the resistance value in a direction in which it is determined that a stochastic reward giving means different from the stochastic reward giving means that executed the reward giving process executes the reward giving process;
A decision-making device that determines one from three or more probabilistic reward giving means using a decision-making device element comprising :
A hierarchical structure formed by arbitrarily selecting two of all stochastic reward giving means to form a pair to form a lower hierarchy, and further combining each pair to form a pair to form an upper hierarchy. and associating the decision-making device elements with respect to each of the formed pairs, determining one of the stochastic reward granting means according to a set of resistance values of all the decision-making device elements , and determining the stochastic reward granting means. Upper layer processing that executes lower layer processing that updates the resistance value of the corresponding decision making device element according to the reward result of the means, and sequentially updates the resistance value of the higher layer decision making device element based on the updated value. A decision-making device characterized in that by repeating the above, it is determined to execute a reward grant process of one stochastic reward granting means from three or more stochastic reward granting means.

The decision-making device according to claim 1 , wherein the upper layer processing is executed after the lower layer processing has been executed a predetermined number of times.

2. The upper layer process is performed so as to select a probabilistic reward giving means with a higher evaluation value based on the result of the executed reward process as a result of the lower layer process. decision-making device.

Based on the TOW model (Tug-of-War), in which a solution is obtained by a virtual bar negotiation between two parties, a decision-making method is developed in which a decision is made to decide one of three or more probabilistic reward giving means. A program to be executed by a computer, the decision making method comprising:
a hierarchy forming step of arbitrarily selecting two of all stochastic reward giving means to form a pair to form a lower hierarchy, and further combining each pair to form a pair to form an upper hierarchy;
For each pair of the formed lowest hierarchy, the pair of stochastic reward giving means is arranged on both sides of the virtual bar of the TOW model, and the virtual positional relationship of the virtual bar of the lowest hierarchy is determined. and a lowest reward granting step of determining and executing the reward granting process of which probabilistic reward granting means is to be executed based on the positional relationship of the virtual bars of all the layers above it;
As a result of the process of the lowest level reward granting step , the virtual bar of the lowest level is moved in the direction in which it is determined that the stochastic reward granting means that has executed the reward granting process further executes the reward granting process based on the result that there is a reward. is moved by a predetermined distance, and based on the result of no reward, the virtual of the lowest hierarchy is moved in the direction in which it is determined that a stochastic reward giving means different from the stochastic reward giving means that executed the reward giving process executes the reward giving process. a bottom update step of moving the bar by a predetermined distance;
and an upper update step in which, based on the lowest update step, updates are performed sequentially by reflecting the updated values of the immediately lower hierarchy in the virtual bars of each higher hierarchy. program.

A decision-making device that selects one option from among ²ⁿ options (n is a natural number) each of which can receive a reward with probability,
A virtual binary tree is constructed using 2 ⁿ -1 circuit elements that can control the value of a predetermined physical quantity with stochastic fluctuations in the positive or negative direction with respect to a reference value, and the lowest layer 2 a circuit unit in which ^n-1 circuit elements are subjected to a process of selecting one of the options;
trial means for selecting one of the options and determining whether there is a reward;
physical quantity control means for controlling the value of the physical quantity of each of the circuit elements related to the option selected by the trial means according to a determination result of the trial means, according to a predetermined rule;
Each of the circuit elements configured in the virtual binary tree is selected sequentially starting from the highest level circuit element, and depending on whether the value of the physical quantity of the selected circuit element is more positive than the reference value, the pair of lower order option selection means for repeatedly selecting one of the circuit elements and selecting one of the options depending on whether the value of the physical quantity of the selected lowest circuit element is more positive than the reference value; ,
has
The physical quantity control means includes:
In order to increase the probability that the option selected by the trial means is selected by the option selection means when a reward is received due to the selection of the trial means, the value of the physical quantity of each of the circuit elements related to the choice is set. A decision-making device that controls according to predetermined rules.

The decision-making device according to claim 5 ,
The physical quantity control means includes:
When a reward is received as a result of the selection of the trial means, among the circuit elements configured in the virtual binary tree, the circuit elements of each layer involved in the selection of the option selected by the trial means are set as circuit elements of interest. , controlling the value of the physical quantity of each circuit element of interest according to a predetermined rule so as to increase the probability that the circuit element of interest in the option or the lower layer is selected, respectively;
If no reward is received due to the selection of the trial means, among the circuit elements configured in the virtual binary tree, the circuit elements of each layer involved in the selection of the option selected by the trial means are designated as circuit elements of interest. A decision-making device that controls the value of the physical quantity of each circuit element of interest according to a predetermined rule so as to reduce the probability that the option or the circuit element of interest in a lower layer is selected.

The decision-making device according to claim 5 or 6 ,
The physical quantity control means includes:
A decision-making device that controls the value of the physical quantity of each circuit element to approach a predetermined value before controlling the value of the physical quantity of the circuit element according to the predetermined rule.

The decision-making device according to any one of claims 5 to 7 ,
The decision-making device wherein the physical quantity is a resistance value of a circuit, and the circuit element is a variable resistance element.

Each can receive rewards with probability 2 ⁿⁿ A method for controlling a decision-making device that selects one option from among (n is a natural number) options, the method comprising:
The decision-making device includes two circuit elements capable of controlling the value of a predetermined physical quantity with stochastic fluctuations in a positive or negative direction with respect to a reference value. ⁿⁿ -1 is used to construct a virtual binary tree, and the bottom 2 ^n-1n-1 subjecting the circuit elements to a process of selecting one of the options,
The trial means selects one of the options and determines whether there is a reward,
A physical quantity control means controls the value of each of the physical quantities of the circuit elements related to the option selected by the trial means according to a predetermined rule, based on the determination result of the trial means,
The option selection means sequentially selects each circuit element configured in the virtual binary tree starting from the highest circuit element, and determines whether the value of the physical quantity of the selected circuit element is more positive than the reference value. , repeating the process of selecting one of the pair of lower-order circuit elements, and selecting one of the options depending on whether the value of the physical quantity of the selected lowest-order circuit element is more positive than the reference value. death,
The physical quantity control means includes:
In order to increase the probability that the option selected by the trial means is selected by the option selection means when a reward is received due to the selection of the trial means, the value of the physical quantity of each of the circuit elements related to the choice is set. A method for controlling a decision-making device according to predetermined rules.