JP2019128402A

JP2019128402A - Signal processor, sound emphasis device, signal processing method, and program

Info

Publication number: JP2019128402A
Application number: JP2018008649A
Authority: JP
Inventors: 達馬石原; Tatsuma Ishihara
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2019-08-01
Anticipated expiration: 2038-01-23
Also published as: JP6925995B2

Abstract

To calculate information to be used for signal processing with high precision.SOLUTION: A signal processor comprises a storage part, a similarity calculation part, a weight calculation part, an update part, and a signal processing part. The storage part stores a first feature quantity representing features of a first input signal. The similarity calculation part calculates similarity between the first feature quantity and a second feature quantity representing features of a second input signal. The weight calculation part calculates a first weight for the first feature quantity upon the basis of the similarity and second feature quantity. The update part calculates a third feature quantity upon the basis of the first quantity having been multiplied by the first weight and the second feature quantity, and updates the first feature quantity stored in the storage part with the third feature quantity. The signal processing part executes signal processing using the first feature quantity having been updated.SELECTED DRAWING: Figure 2

Description

本発明の実施形態は、信号処理装置、音声強調装置、信号処理方法およびプログラムに関する。 Embodiments described herein relate generally to a signal processing device, a speech enhancement device, a signal processing method, and a program.

音声認識システムの認識率を向上させるため、音声を強調する信号処理などを実行する技術が提案されている。音声強調装置で用いられている技術として、信号の空間情報を利用して特定の方向の音声を強調するビームフォーミングが知られている。信号処理をより高精度に実行するためには、信号処理に用いる情報（特徴量など）をより高精度に算出することが望ましい。 In order to improve the recognition rate of a speech recognition system, a technique for executing signal processing for enhancing speech has been proposed. As a technique used in a speech enhancement apparatus, beam forming that enhances speech in a specific direction using spatial information of a signal is known. In order to execute signal processing with higher accuracy, it is desirable to calculate information (features and the like) used for signal processing with higher accuracy.

特許第５０４４５８１号公報Patent No. 5044581 gazette

Heymann et al.,“NEURAL NETWORK BASED SPECTRAL MASK ESTIMATION FOR ACOUSTIC BEAMFORMING”，ICASSP 2016Heymann et al., “NEURAL NETWORK BASED SPECTRAL MASK ESTIMATION FOR ACOUSTIC BEAMFORMING”, ICASSP 2016

しかしながら、従来技術では信号処理に用いる情報を高精度に算出できない場合があった。例えば、ビームフォーミングでは、忘却機能を設けることにより、現在の音源位置を優先して強調する場合がある。しかし、音源が移動しない場合にも忘却機能が働き、強調の効果が低下する場合があった。 However, in the prior art, information used for signal processing may not be calculated with high accuracy. For example, in beam forming, there is a case where the current sound source position is preferentially emphasized by providing a forgetting function. However, the forgetting function works even when the sound source does not move, and the enhancement effect may be reduced.

実施形態の信号処理装置は、記憶部と、類似度算出部と、重み算出部と、更新部と、信号処理部と、を備える。記憶部は、第１入力信号の特徴を表す第１特徴量を記憶する。類似度算出部は、第１特徴量と、第２入力信号の特徴を表す第２特徴量と、の類似度を算出する。重み算出部は、類似度および第２特徴量に基づいて、第１特徴量に対する第１重みを算出する。更新部は、第１重みを乗算した第１特徴量と、第２特徴量とに基づいて、第３特徴量を算出し、第３特徴量により記憶部に記憶された第１特徴量を更新する。信号処理部は、更新された第１特徴量を用いた信号処理を実行する。 The signal processing device according to the embodiment includes a storage unit, a similarity calculation unit, a weight calculation unit, an update unit, and a signal processing unit. The storage unit stores a first feature amount representing a feature of the first input signal. The similarity calculation unit calculates the similarity between the first feature value and the second feature value representing the feature of the second input signal. The weight calculation unit calculates a first weight for the first feature amount based on the similarity and the second feature amount. The update unit calculates a third feature amount based on the first feature amount multiplied by the first weight and the second feature amount, and updates the first feature amount stored in the storage unit with the third feature amount. Do. The signal processing unit executes signal processing using the updated first feature amount.

第１の実施形態にかかる信号処理装置のハードウェア図。FIG. 1 is a hardware diagram of a signal processing device according to a first embodiment. 第１の実施形態にかかる信号処理装置のブロック図。FIG. 1 is a block diagram of a signal processing device according to a first embodiment. 第１の実施形態における信号処理のフローチャート。6 is a flowchart of signal processing in the first embodiment. 特徴量を算出および更新する処理の流れを説明するための図。The figure for demonstrating the flow of the process which calculates and updates the feature-value. 第２の実施形態にかかる信号処理装置のハードウェア構成図。The hardware block diagram of the signal processing apparatus concerning 2nd Embodiment. 第２の実施形態にかかる信号処理装置のブロック図。FIG. 7 is a block diagram of a signal processing device according to a second embodiment. 第２の実施形態における信号処理のフローチャート。The flowchart of the signal processing in 2nd Embodiment.

以下に添付図面を参照して、この発明にかかる信号処理装置の好適な実施形態を詳細に説明する。なお以下では主に音声を強調する信号処理を実行する装置を例に説明するが、適用可能な信号処理は音声強調処理に限られない。音声以外の任意の信号の処理に適用できる。また、信号を強調する以外の信号処理を適用してもよい。 Hereinafter, preferred embodiments of a signal processing apparatus according to the present invention will be described in detail with reference to the accompanying drawings. In the following description, an apparatus that mainly performs signal processing for enhancing speech will be described as an example, but applicable signal processing is not limited to speech enhancement processing. It can be applied to the processing of any signal other than voice. Further, signal processing other than enhancing the signal may be applied.

ビームフォーミングでは、通常、音源の到来方向は一定であることが仮定されている。このため、話者が切り替わる場合、および、音声を入力する音声入力装置（マイクなど）に対して話者が相対的に移動する場合には音源が固定されている場合より強調の効果が発揮されづらい。そこで、上述のような忘却機能を設け、過去の音源位置より現在の音源位置を優先して強調する技術が提案されている。しかし、話者が相対的に移動しない場合でも忘却機能が働くために、忘却機能を設定しない場合と比較して強調の効果が得られない場合がある。 In beam forming, it is usually assumed that the direction of arrival of a sound source is constant. For this reason, when the speaker is switched, and when the speaker moves relative to a voice input device (such as a microphone) for inputting voice, the emphasis effect is exhibited more than when the sound source is fixed. It is difficult. Therefore, a technique has been proposed in which the above-described forgetting function is provided and the current sound source position is prioritized and emphasized over the past sound source positions. However, even when the speaker does not move relatively, the effect of emphasis may not be obtained compared to the case where the forgetting function is not set because the forgetting function works.

一方、クラスタリングを用いることで話者切り替えに対処する技術が提案されている。しかしこのような方法は規則ベースの方法であり、微分不可能な構成要素を含む。このため、出力の基準、例えば信号対雑音比（ＳＮ比）を最大化することを表す基準（最大ＳＮＲ基準）などを用いて、クラスタリングの精度を向上させるためのパラメータを調整することは困難であった。 On the other hand, a technique for dealing with speaker switching by using clustering has been proposed. However, such a method is a rule-based method and includes non-differentiable components. For this reason, it is difficult to adjust a parameter for improving the accuracy of clustering by using an output criterion, for example, a criterion (maximum SNR criterion) representing maximization of a signal-to-noise ratio (SN ratio). there were.

（第１の実施形態）
第１の実施形態にかかる信号処理装置は、話者の空間情報を表す特徴量を複数の記憶領域それぞれに記憶する。信号処理装置は、音声信号に対する特徴量が入力されるごとに、記憶部に記憶された特徴量と入力された特徴量との類似度、および、入力された特徴量をニューラルネットに入力する。ニューラルネットは、記憶領域の個数と等しい次元数を持つ重みを出力する。出力される重みは、例えば、記憶された特徴量に対する重み（消去重み）、入力された特徴量に対する重み（書き込み重み）、および、記憶領域から読み出した特徴量に対する重み（読み出し重み）を含む。記憶領域から読み出した特徴量は、ビームフォーミングなどの信号処理に用いられる。 First Embodiment
The signal processing apparatus according to the first embodiment stores feature quantities representing spatial information of a speaker in each of a plurality of storage areas. The signal processing apparatus inputs, to the neural network, the degree of similarity between the feature quantity stored in the storage unit and the input feature quantity and the input feature quantity each time the feature quantity for the audio signal is input. The neural network outputs a weight having a number of dimensions equal to the number of storage areas. The weight to be output includes, for example, a weight for the stored feature amount (erasing weight), a weight for the input feature amount (write weight), and a weight for the feature amount read from the storage area (read weight). The feature amount read from the storage area is used for signal processing such as beam forming.

本実施形態では、学習データを利用して、適切な特徴量の書き換えと読み出しの方法をニューラルネットに学習させることができる。このため、忘却せずに特徴量を保持する方が強調するために都合がいい場合に忘却をしないといったことが学習できるようになる。 In the present embodiment, it is possible to make the neural network learn an appropriate feature amount rewriting and reading method using learning data. For this reason, it becomes possible to learn that no forgetting is performed when it is convenient for emphasizing to hold the feature amount without forgetting.

また本実施形態では、忘却の要否と関連性が高い情報である、記憶された特徴量と現在の特徴量との類似度を、ニューラルネットの入力に含めている。これにより、類似度を入力しない場合と比較して学習に必要なデータを削減できる。類似度を入力しない場合でも、記憶された特徴量と現在の特徴量とが類似するかに応じて出力が変わるように学習させることができるが、そのためにはより多くのデータが必要になるためである。学習のためのデータが増大する可能性はあるが、ニューラルネットの入力に類似度を含めないように構成してもよい。 Further, in the present embodiment, the similarity between the stored feature amount and the current feature amount, which is information highly relevant to necessity of oblivion, is included in the input of the neural network. Thereby, data required for learning can be reduced as compared with the case where the similarity is not input. Even if the similarity is not input, it can be learned to change the output depending on whether the stored feature quantity is similar to the current feature quantity, but more data is required for this purpose. It is. Although there is a possibility that the data for learning increases, it may be configured not to include the similarity in the input of the neural network.

このように、本実施形態によれば、忘却機能を導入しつつ、信号処理に用いる情報をより高精度に算出可能となる。例えば話者が相対的に移動しない場合でも強調の効果を維持可能となる。また、以下に述べるように本実施形態では微分不可能な構成要素を含まないモデルを用いるため、忘却機能を含む各機能を定めるパラメータが、出力で定義される評価基準（ＳＮ比など）を最大化するように調整可能となる。 Thus, according to the present embodiment, information used for signal processing can be calculated with higher accuracy while introducing the forgetting function. For example, even when the speaker does not move relatively, the emphasis effect can be maintained. In addition, as described below, in this embodiment, a model that does not include a non-differentiable component is used. Therefore, the parameters that define each function including the forgetting function maximize the evaluation criteria (such as SN ratio) defined by the output. Can be adjusted to

次に、第１の実施形態にかかる信号処理装置のハードウェア構成について図１を用いて説明する。図１は、第１の実施形態にかかる信号処理装置１００のハードウェア構成例を示す説明図である。 Next, the hardware configuration of the signal processing apparatus according to the first embodiment will be described with reference to FIG. FIG. 1 is an explanatory diagram illustrating a hardware configuration example of the signal processing apparatus 100 according to the first embodiment.

信号処理装置１００は、ＣＰＵ（Central Processing Unit）５１、ＲＯＭ（Read Only Memory）５２、ＲＡＭ（Random Access Memory）５３、記憶装置５４、および、操作装置５５を含み、それらがバスを介して接続されている。 The signal processing device 100 includes a central processing unit (CPU) 51, a read only memory (ROM) 52, a random access memory (RAM) 53, a storage device 54, and an operation device 55, which are connected via a bus. ing.

ＣＰＵ５１は、ＲＡＭ５３を作業領域とし、ＲＡＭ５３に記録されたプログラムとの協働により各種処理を実行し、信号処理装置１００の動作を統括的に制御する。 The CPU 51 uses the RAM 53 as a work area, executes various processes in cooperation with the program recorded in the RAM 53, and centrally controls the operation of the signal processing apparatus 100.

ＲＯＭ５２は、信号処理装置１００の動作に関わるプログラム、および、学習に必要なメディアデータなどを、書き換え不可能な形式で記憶する。 The ROM 52 stores a program relating to the operation of the signal processing apparatus 100, media data necessary for learning, and the like in a non-rewritable form.

ＲＡＭ５３は、例えばＳＤＲＡＭ（Synchronous Dynamic Random Access Memory）などの記憶媒体である。ＲＡＭ５３は、ＣＰＵ５１の作業エリアとして機能し、中間データの保持などの役割を果たす。 The RAM 53 is a storage medium such as SDRAM (Synchronous Dynamic Random Access Memory). The RAM 53 functions as a work area of the CPU 51 and plays a role of holding intermediate data.

記憶装置５４は、磁気的または光学的に情報を記憶可能な媒体であり、各種設定情報および学習結果などを記憶する。 The storage device 54 is a medium capable of storing information magnetically or optically, and stores various setting information and learning results.

操作装置５５は、例えばキーボードおよびマウスなどであり、ユーザの入力をＣＰＵ５１に出力する。 The operating device 55 is, for example, a keyboard and a mouse, and outputs a user's input to the CPU 51.

図２は、信号処理装置１００の構成の一例を示すブロック図である。図２に示すように、信号処理装置１００は、生成部１０１と、解析部１１１と、特徴量算出部１１２と、類似度算出部１１３と、重み算出部１１４と、更新部１１５と、信号処理部１２１と、学習部１２２と、記憶部１４１と、を備えている。 FIG. 2 is a block diagram illustrating an example of the configuration of the signal processing apparatus 100. As shown in FIG. 2, the signal processing apparatus 100 includes a generation unit 101, an analysis unit 111, a feature quantity calculation unit 112, a similarity calculation unit 113, a weight calculation unit 114, an update unit 115, and signal processing. Unit 121, learning unit 122, and storage unit 141.

記憶部１４１は、過去に入力された音声信号（第１入力信号）に対して算出された特徴量（第１特徴量）を記憶する。記憶部１４１は、例えば図１のＲＡＭ５３により構成することができる。記憶部１４１は、複数の記憶領域を含み、複数の記憶領域それぞれに特徴量を記憶する。 The storage unit 141 stores a feature amount (first feature amount) calculated for a speech signal (first input signal) input in the past. The storage unit 141 can be configured by, for example, the RAM 53 of FIG. The storage unit 141 includes a plurality of storage areas, and stores the feature amount in each of the plurality of storage areas.

生成部１０１は、学習に用いる学習データを生成する。例えば生成部１０１は、音声信号（第３入力信号）と参照データとを含む学習データを生成する。参照データは、音声信号に対する信号処理の処理結果を表すデータである。参照データは、学習部１２２による学習時に参照される。 The generation unit 101 generates learning data used for learning. For example, the generation unit 101 generates learning data including an audio signal (third input signal) and reference data. The reference data is data representing a processing result of signal processing on the audio signal. The reference data is referred to at the time of learning by the learning unit 122.

生成部１０１は、例えば予め準備された学習データを加工することにより、多様性を増大させ学習後の頑健性を向上させるような学習データを生成し、解析部１１１に出力する。上述のように、生成する学習データには、学習部１２２で用いるための参照データを含めることができる。その場合、参照データは、学習部１２２以外に入力する必要はない。 The generation unit 101 generates learning data that increases diversity and improves robustness after learning, for example, by processing learning data prepared in advance, and outputs the generated learning data to the analysis unit 111. As described above, the learning data to be generated can include reference data for use in the learning unit 122. In that case, the reference data need not be input to other than the learning unit 122.

生成部１０１により生成される学習データに相当するデータが準備されている場合は、そのデータを用いるように構成すれば、生成部１０１を備える必要はない。 When data corresponding to learning data generated by the generation unit 101 is prepared, if the data is configured to be used, the generation unit 101 need not be provided.

音声信号は、例えば、マイクロフォンアレイなどの音声入力装置により収録された信号である。マイクロフォンアレイは、空間内の異なる位置に備えられる複数のマイクを含み、複数のマイクに対応する複数のチャンネルの音声信号を取得する。以下では、複数のチャンネルの音声信号を用いる場合を例に説明するが、１つのチャンネルの音声信号に対しても同様の方法を適用できる。 The audio signal is a signal recorded by an audio input device such as a microphone array. The microphone array includes a plurality of microphones provided at different positions in the space, and acquires audio signals of a plurality of channels corresponding to the plurality of microphones. In the following, the case of using audio signals of a plurality of channels will be described as an example, but the same method can be applied to an audio signal of one channel.

学習データの生成方法はどのような方法であってもよいが、例えば、以下のような方法を用いることができる。
・音源が存在する領域（部屋など）のインパルス応答を生成して元の信号に畳み込む。
・雑音を加える。
・ランダムにサンプルを欠損させる。
・チャンネル間にランダムな遅延を加える。
・フェーズボコーダーにより持続時間、および、音高を変化させる。 Although any method may be used to generate learning data, for example, the following method can be used.
Generate an impulse response in a region (such as a room) in which the sound source is present and fold it into the original signal.
Add noise.
・ Drop samples randomly.
Add random delays between channels.
・ Change duration and pitch with phase vocoder.

また生成部１０１は、話者が交代する状況を再現するための学習データを生成してもよい。例えば、話者がＡ→Ｂ→Ａと変化したときの状況を再現するために、生成部１０１は、Ａに対応するチャンネル間相関を持つ信号とＢに対応するチャンネル間相関を持つ信号とを互い違いに連結し、さらに雑音を重畳した学習データを生成してもよい。これにより、過去に発話したことのある話者が再度発話したときの音声強調の追従速度が向上することが期待できる。 Further, the generation unit 101 may generate learning data for reproducing a situation where a speaker changes. For example, to reproduce the situation when the speaker changes from A to B to A, the generation unit 101 generates a signal having an interchannel correlation corresponding to A and a signal having an interchannel correlation corresponding to B. It is also possible to generate learning data in which noises are further connected in a staggered manner. Thereby, it can be expected that the follow-up speed of the speech enhancement when a speaker who has spoken in the past speaks again is improved.

解析部１１１は、入力された学習データを解析し、後段の処理で用いる情報を解析結果として出力する。例えば解析部１１１は、入力された音声信号に対して窓関数付き短時間フーリエ変換を実行し、スペクトログラムを出力する。非特許文献１と同様に、後段の特徴量の算出のため、スペクトログラムの各時間周波数ビンにおける信号／雑音判定をニューラルネットワークに実行させ、判定結果を出力に追加するように構成してもよい。 The analysis unit 111 analyzes the input learning data, and outputs information used in the subsequent processing as an analysis result. For example, the analysis unit 111 performs a short-time Fourier transform with a window function on the input audio signal, and outputs a spectrogram. Similarly to Non-Patent Document 1, it may be configured to cause the neural network to execute signal / noise determination in each time frequency bin of the spectrogram and to add the determination result to the output in order to calculate the subsequent feature amount.

特徴量算出部１１２は、解析部１１１が出力した情報を元に特徴量を算出する。例えば特徴量算出部１１２は、入力信号に含まれる複数チャンネルの信号間の空間相関を特徴量として算出する。空間相関の例としては、入力全体の空間相関、雑音が多く含まれると推定されるスペクトログラムの領域のみから算出される雑音空間相関、および、信号が多く含まれると推定されるスペクトログラムの領域から算出される信号空間相関が挙げられる。 The feature amount calculation unit 112 calculates a feature amount based on the information output from the analysis unit 111. For example, the feature amount calculation unit 112 calculates a spatial correlation between signals of a plurality of channels included in the input signal as a feature amount. Examples of spatial correlation include the spatial correlation of the entire input, the noise spatial correlation calculated only from the spectrogram region estimated to contain a lot of noise, and the spectrogram region estimated to contain a lot of signal. Signal space correlation.

類似度算出部１１３は、記憶部１４１の各記憶領域に記憶されている特徴量と、特徴量算出部１１２により算出された特徴量（第２特徴量）と、の類似度を算出する。類似度は、例えば、空間相関をベクトル化したベクトルｖと、ｉ番目の記憶領域の内容ｒ_ｉとの複素相関係数Ｒｅａｌ（ｖ^Ｈｒ_ｉ）／（｜ｖ｜｜ｒ_ｉ｜）を用いる。記号Ｈはエルミート転置を表す。 The similarity calculation unit 113 calculates the similarity between the feature amount stored in each storage area of the storage unit 141 and the feature amount (second feature amount) calculated by the feature amount calculation unit 112. The degree of similarity uses, for example, a complex correlation coefficient Real (v ^H r _i ) / (| v | r _i |) between a vector v obtained by vectorizing spatial correlation and the content r _i of the i-th storage area . The symbol H represents Hermitian transpose.

空間相関は、例えば周波数ごとに算出される。ベクトルｖは、各周波数に対して算出された特徴量すべてを連結してベクトル化することにより求めてもよい。ベクトルｖは、周波数ごとに算出された空間相関を個別にベクトル化してもよい。後者の場合、記憶部１４１の記憶領域の確保、および、類似度の算出などの後段の処理も、空間相関ごとに独立して実行される。 The spatial correlation is calculated, for example, for each frequency. The vector v may be obtained by concatenating and vectorizing all the feature quantities calculated for each frequency. The vector v may individually vectorize the spatial correlation calculated for each frequency. In the latter case, securing of the storage area of the storage unit 141 and processing of the latter stage such as calculation of the degree of similarity are also executed independently for each spatial correlation.

重み算出部１１４は、上述の消去重み、書き込み重み、および、読み出し重みを算出する。消去重みは、記憶部１４１に記憶された特徴量に対する重み（第１重み）である。消去重みは、例えば上述の忘却機能で用いられる忘却係数に対応する。書き込み重みは、特徴量算出部１１２により算出された特徴量に対する重み（第２重み）である。読み出し重みは、信号処理に用いる特徴量を算出するために記憶領域から読み出した特徴量に対する重み（第３重み）である。 The weight calculation unit 114 calculates the erase weight, the write weight, and the read weight described above. The erasure weight is a weight (first weight) for the feature amount stored in the storage unit 141. The erasure weight corresponds to, for example, a forgetting factor used in the forgetting function described above. The writing weight is a weight (second weight) for the feature amount calculated by the feature amount calculating unit 112. The read weight is a weight (third weight) for the feature value read from the storage area in order to calculate the feature value used for signal processing.

重み算出部１１４は、例えば、類似度算出部１１３により算出された類似度、および、特徴量算出部１１２により算出された特徴量に基づいて重みを算出する。重みの算出には、類似度および特徴量を入力し、各重みを出力するニューラルネットワークを用いることができる。重みを算出するためのモデルはニューラルネットワークに限られない。例えば、ガウシアンプロセスなどの、回帰分析を行う他のモデルを適用してもよい。 The weight calculation unit 114 calculates a weight based on, for example, the similarity calculated by the similarity calculation unit 113 and the feature amount calculated by the feature amount calculation unit 112. For the calculation of the weight, a neural network that inputs the similarity and the feature amount and outputs each weight can be used. The model for calculating the weight is not limited to the neural network. For example, another model that performs regression analysis, such as a Gaussian process, may be applied.

重み算出部１１４は、例えば、類似度と空間相関（ベクトル化したベクトルｖ）とを入力とし、消去重み、書き込み重み、および、読み出し重みを表す３つの重みベクトルを出力するニューラルネットワークを用いる。各重みベクトルは、特徴量を記憶する記憶領域の個数と同じ次元数のベクトルである。各重みベクトルの要素は、０から１の範囲の実数値を取る。 The weight calculation unit 114 uses, for example, a neural network that receives the similarity and the spatial correlation (vectorized vector v) and outputs three weight vectors representing the erasure weight, the write weight, and the read weight. Each weight vector is a vector having the same number of dimensions as the number of storage areas for storing feature amounts. Each weight vector element takes a real value ranging from 0 to 1.

本実施形態では、類似度に応じて異なる値となるように重みを算出することができる。例えば、記憶された特徴量と入力された音声信号に対する特徴量とが類似する場合、言い換えると音源が移動しない場合には、消去重みを大きな値とすることにより、忘却機能の効果を抑制することが可能となる。忘却機能の効果を抑制するためには、少なくとも消去重みを類似度に応じて算出すればよく、他の重み（書き込み重み、読み出し重み）は、他の方法で決定してもよい。例えば、他の重みを固定値とする方法、および、消去重みの値に応じて他の重みを算出する方法などを適用してもよい。 In the present embodiment, the weights can be calculated so as to have different values according to the degree of similarity. For example, when the stored feature value is similar to the feature value for the input audio signal, in other words, when the sound source does not move, the effect of the forgetting function is suppressed by increasing the erasure weight. Is possible. In order to suppress the effect of the forgetting function, at least the erasure weight may be calculated according to the degree of similarity, and other weights (write weight, read weight) may be determined by another method. For example, a method of setting other weights to a fixed value, a method of calculating other weights according to the value of the erasure weight, and the like may be applied.

更新部１１５は、算出された重みベクトルと、特徴量算出部１１２により算出された特徴量と、を用いて記憶部１４１の各記憶領域に記憶された特徴量を更新する。例えば更新部１１５は、記憶された特徴量に消去重みを乗算し、特徴量算出部１１２により算出された特徴量に書き込み重みを乗算し、各乗算結果を加算することにより、特徴量（第３特徴量）を算出する。このように算出される特徴量は、記憶された特徴量と同じ次元のベクトルであり、記憶された特徴量の個数（記憶領域の個数）と同じ個数となる。更新部１１５は、算出した特徴量により、記憶部１４１に記憶された特徴量を更新する。 The updating unit 115 updates the feature amount stored in each storage area of the storage unit 141 using the calculated weight vector and the feature amount calculated by the feature amount calculating unit 112. For example, the update unit 115 multiplies the stored feature value by the erasure weight, multiplies the feature value calculated by the feature value calculation unit 112 by the write weight, and adds each multiplication result to obtain the feature value (third Calculate the feature amount). The feature quantities calculated in this manner are vectors having the same dimensions as the stored feature quantities, and the number is the same as the number of stored feature quantities (the number of storage areas). The update unit 115 updates the feature amount stored in the storage unit 141 with the calculated feature amount.

なお空間相関にはエルミート対称であるという性質があるため、更新部１１５により算出される特徴量も行列として解釈した場合にはエルミート対称であるという性質を満たす必要がある。エルミート対称である特徴量（空間相関）を用いて、エルミート対称を維持する演算（乗算および加算など）を行って特徴量を算出しているため、更新部１１５により算出される特徴量もエルミート対称であるという性質を満たす。 Since the spatial correlation has the property of being Hermitian symmetric, the feature quantity calculated by the updating unit 115 needs to satisfy the property of being Hermitian symmetric when interpreted as a matrix. Since feature quantities are calculated by performing operations (multiplication, addition, etc.) that maintain Hermitian symmetry using feature quantities (spatial correlation) that are Hermitian symmetric, the feature quantities calculated by the updating unit 115 are also Hermitian symmetric. Meet the property of being

信号処理部１２１は、更新された特徴量を用いた信号処理を実行する。信号処理は、例えば、複数チャンネルの音声信号のうち一部の音声信号を強調する音声強調処理である。例えば信号処理部１２１は、記憶部１４１から読み出した特徴量（空間相関）を元に信号を強調するフィルタを生成し、生成したフィルタを入力に作用させて出力を得る。フィルタの算出方法としては、例えば非特許文献１に記載されているような最大ＳＮＲ基準による方法を用いることができる。出力した信号に対して、さらにポストフィルタを適用してもよい。例えば非特許文献１にあるようにＢＡＮ（Blind Analytical Normalization）を用いることができる。 The signal processing unit 121 performs signal processing using the updated feature amount. The signal processing is, for example, a voice enhancement process that emphasizes some of the voice signals of a plurality of channels. For example, the signal processing unit 121 generates a filter that emphasizes a signal based on the feature amount (spatial correlation) read from the storage unit 141, and causes the generated filter to act on the input to obtain an output. As a filter calculation method, for example, a method based on the maximum SNR standard as described in Non-Patent Document 1 can be used. A post filter may be further applied to the output signal. For example, as described in Non-Patent Document 1, BAN (Blind Analytical Normalization) can be used.

学習部１２２は、重み算出時に用いるニューラルネットワークのパラメータを学習する。例えば学習部１２２は、学習データを用いて信号処理部１２１による信号処理までの処理を実行し、信号処理の処理結果を評価し、評価結果に応じてニューラルネットワークのパラメータを更新する。学習部１２２は、例えば生成部１０１により生成された学習データを用いて学習処理を実行する。解析部１１１がニューラルネットワークを用いる場合、学習部１２２は、このニューラルネットワークのパラメータも学習してもよい。 The learning unit 122 learns the parameters of the neural network used when calculating the weight. For example, the learning unit 122 executes processing up to signal processing by the signal processing unit 121 using learning data, evaluates the processing result of the signal processing, and updates parameters of the neural network according to the evaluation result. The learning unit 122 performs a learning process using the learning data generated by the generation unit 101, for example. When the analysis unit 111 uses a neural network, the learning unit 122 may also learn parameters of this neural network.

学習部１２２は、例えば、参照データ、および、信号処理部１２１による処理結果から評価値を算出し、誤差逆伝播によりニューラルネットワークのパラメータを更新する。参照データが雑音の重畳されていない信号である場合には、出力との２乗誤差を評価値として用いることができる。参照データが信号と雑音である場合には、適用したフィルタから算出できるＳＮ比を評価値として用いることができる。 For example, the learning unit 122 calculates an evaluation value from reference data and a processing result of the signal processing unit 121, and updates parameters of the neural network by error back propagation. When the reference data is a signal on which no noise is superimposed, a square error from the output can be used as the evaluation value. When the reference data is a signal and noise, an SN ratio that can be calculated from the applied filter can be used as the evaluation value.

学習部１２２は、評価値の推移から、学習を終了させるか否かを判定する。終了を判定するための基準（終了基準）としては、例えば過去１００００回の入力から算出された評価値の推移に改善が見られないこと、などの基準が考えられる。終了基準を満たさない場合、学習部１２２は、例えば生成部１０１に新たに学習データを生成するように指令を出力する。終了基準を満たす場合、学習部１２２は、学習したパラメータを記憶部１４１などに記憶し、学習処理を終了する。 The learning unit 122 determines whether to end learning from the transition of the evaluation value. As a criterion (end criterion) for determining the end, for example, a criterion such as no improvement in the transition of the evaluation value calculated from the past 10000 inputs can be considered. If the end criterion is not satisfied, the learning unit 122 outputs a command to the generation unit 101 to newly generate learning data, for example. When the end criterion is satisfied, the learning unit 122 stores the learned parameter in the storage unit 141 or the like, and ends the learning process.

上記各部（生成部１０１、解析部１１１、特徴量算出部１１２、類似度算出部１１３、重み算出部１１４、更新部１１５、信号処理部１２１、および、学習部１２２）は、例えば、１または複数のプロセッサにより実現される。例えば上記各部は、ＣＰＵ５１などのプロセッサにプログラムを実行させること、すなわちソフトウェアにより実現してもよい。上記各部は、専用のＩＣ（Integrated Circuit）などのプロセッサ、すなわちハードウェアにより実現してもよい。上記各部は、ソフトウェアおよびハードウェアを併用して実現してもよい。複数のプロセッサを用いる場合、各プロセッサは、各部のうち１つを実現してもよいし、各部のうち２以上を実現してもよい。 Each of the units (generation unit 101, analysis unit 111, feature quantity calculation unit 112, similarity calculation unit 113, weight calculation unit 114, update unit 115, signal processing unit 121, and learning unit 122) is, for example, one or more Realized by the processor of For example, each unit may be realized by causing a processor such as the CPU 51 to execute a program, that is, software. The respective units may be realized by a processor such as a dedicated IC (Integrated Circuit), that is, hardware. Each of the above units may be realized by using software and hardware together. When using a plurality of processors, each processor may realize one of the respective units, or may realize two or more of the respective units.

記憶部１４１は、ＨＤＤ（Hard Disk Drive）、光ディスク、メモリカード、ＲＡＭなどの一般的に利用されているあらゆる記憶媒体により構成することができる。記憶部１４１の記憶領域は、物理的に異なる記憶媒体としてもよいし、物理的に同一の記憶媒体の異なる記憶領域として実現してもよい。さらに記憶部１４１の記憶領域のそれぞれは、物理的に異なる複数の記憶媒体により実現してもよい。 The storage unit 141 can be configured by any generally used storage medium such as a hard disk drive (HDD), an optical disk, a memory card, and a RAM. The storage areas of the storage unit 141 may be physically different storage media, or may be realized as different storage areas of the physically same storage medium. Furthermore, each of the storage areas of the storage unit 141 may be realized by a plurality of physically different storage media.

次に、このように構成された信号処理装置１００による信号処理について図３を用いて説明する。図３は、第１の実施形態における信号処理の一例を示すフローチャートである。 Next, signal processing by the signal processing apparatus 100 configured as described above will be described with reference to FIG. FIG. 3 is a flowchart illustrating an example of signal processing in the first embodiment.

まず操作装置５５などを介して信号処理の開始が指示されると、生成部１０１は、初期化処理を実行する（ステップＳ１０１）。例えば生成部１０１は、学習処理の各種設定のための記憶領域、および、特徴量を記憶するための記憶領域を、記憶部１４１内に確保する。 First, when the start of signal processing is instructed through the operation device 55 or the like, the generation unit 101 executes initialization processing (step S101). For example, the generation unit 101 reserves a storage area for various settings of the learning process and a storage area for storing feature amounts in the storage unit 141.

また生成部１０１は、記憶部１４１などに事前に記憶された学習データを読み出してＲＡＭ５３内に記憶する。学習データは一度にすべて読み出して記憶してもよいし、逐次的に読み出して記憶してもよい。生成部１０１は、読み出したデータを破棄してもよい。 Further, the generation unit 101 reads out learning data stored in advance in the storage unit 141 and stores the learning data in the RAM 53. The learning data may be read out and stored all at once or may be read out and stored sequentially. The generation unit 101 may discard the read data.

学習データは、例えば、強調の対象となる信号、および、抑圧の対象となる信号の２種類の信号に分けられる。強調の対象である信号は、典型的には音声（音声信号）である。強調の対象でない信号は、存在しても十分小さいこととする。例えば、ＳＮ比が予め定められた閾値（例えば４０ｄＢ（デシベル））以上となるような学習データを用いる。以降、強調の対象は音声であるものとして説明するが、以下の手続きは強調する対象が音声でなくとも適用できることに注意する。例えば楽器の鳴動音など時間周波数領域で特徴的なパターンを持つ任意の信号に適用可能である。また、音波に限らず、例えば反射されたレーザー光を含む電磁波などを対象にすることもできる。抑圧の対象となる信号を、以下では雑音（雑音信号）と呼ぶ。 The learning data is divided into two types of signals, for example, a signal to be emphasized and a signal to be suppressed. The signal to be emphasized is typically voice (voice signal). It is assumed that a signal that is not an object of enhancement is sufficiently small even if it exists. For example, learning data whose SN ratio is equal to or higher than a predetermined threshold value (for example, 40 dB (decibel)) is used. The following description will be made on the assumption that the object to be emphasized is speech, but it is to be noted that the following procedure can be applied even if the object to be emphasized is not speech. For example, it can be applied to an arbitrary signal having a characteristic pattern in the time-frequency domain, such as a sound of a musical instrument. Moreover, not only a sound wave but the electromagnetic wave containing the reflected laser beam etc. can also be made into object. Hereinafter, the signal to be suppressed is referred to as noise (noise signal).

音声および雑音は、それぞれ同一とみなせる信号が複数チャンネルに渡って観測され、少なくとも１つのチャンネルの信号は他のチャンネルの信号と異なる。このような信号は、例えばマイクロフォンアレイを使用した収録などによって得られる。音源が存在する領域（部屋など）のインパルス応答を１チャンネルの信号に対して畳み込むなどの方法で多チャンネルの収録をシミュレーションすることによって、このような信号を生成してもよい。また、音声と雑音のチャンネル数は等しいことに注意する。 As for voice and noise, signals that can be regarded as the same are observed over a plurality of channels, and signals of at least one channel are different from signals of other channels. Such a signal is obtained, for example, by recording using a microphone array. Such a signal may be generated by simulating multi-channel recording by convoluting an impulse response of a region (such as a room) where a sound source is present with a signal of one channel. Also note that the number of voice and noise channels is equal.

次に生成部１０１は、事前に準備された学習データから、学習部１２２による学習処理で用いる学習データを生成する（ステップＳ１０２）。例えば生成部１０１は、音声と雑音をランダムに選択し、ランダムなＳＮ比で振幅を調整してすべてのチャンネルで重畳する。生成部１０１は、例えば、予め定められた範囲（例えば−５ｄＢから１０ｄＢの範囲）の一様分布からサンプリングすることでＳＮ比を決定する。このとき、すべてのチャンネルの音声の開始時間を、ランダムな時間だけ共通に遅らせてもよい。例えば雑音が音声より十分に長い場合、生成部１０１は、音声が雑音の範囲に含まれるような時間遅れの範囲を定めた一様分布からサンプリングすることで、遅らせる時間を決定する。 Next, the production | generation part 101 produces | generates the learning data used by the learning process by the learning part 122 from the learning data prepared in advance (step S102). For example, the generation unit 101 randomly selects voice and noise, adjusts the amplitude with a random signal-to-noise ratio, and superimposes them on all channels. For example, the generation unit 101 determines the SN ratio by sampling from a uniform distribution in a predetermined range (for example, a range of −5 dB to 10 dB). At this time, voice start times of all channels may be commonly delayed by random time. For example, when the noise is sufficiently longer than the speech, the generation unit 101 determines the delay time by sampling from a uniform distribution in which the range of the time delay is set such that the speech is included in the noise range.

雑音に重畳する音声は複数存在してもよい。その場合、生成部１０１は、互いに重ならない複数の音声を用いる。生成部１０１は、複数の音声に対して共通のインパルス応答を畳み込んでもよい。これにより、同じ位置から発話している状況をシミュレーションすることができる。生成部１０１は、僅かに異なる位置のインパルス応答、例えば２０ｃｍから５０ｃｍ程度移動させた位置からのインパルス応答を畳み込むように構成してもよい。これにより、音源が僅かに移動した状況をシミュレーションすることができる。 There may be a plurality of voices to be superimposed on the noise. In that case, the generation unit 101 uses a plurality of sounds that do not overlap each other. The generation unit 101 may convolve a common impulse response with a plurality of sounds. As a result, it is possible to simulate a situation where the user speaks from the same position. The generation unit 101 may be configured to convolute an impulse response at a slightly different position, for example, an impulse response from a position moved about 20 cm to 50 cm. This makes it possible to simulate a situation where the sound source has moved slightly.

生成部１０１は、以上のようにして得られたデータから、音声の含まれない範囲の信号をカットして学習データとしてもよい。 From the data obtained as described above, the generation unit 101 may cut a signal in a range not including speech and use it as learning data.

次に解析部１１１は、生成された学習データ（入力信号）を入力し、入力信号に対して信号解析処理を実行する（ステップＳ１０３）。例えば解析部１１１は、入力信号をそれぞれのチャンネルについて時間周波数解析して時間周波数で表された解析結果を出力し、例えばＲＡＭ５３に記憶する。時間周波数解析の方法としては、例えば短時間フーリエ変換、および、ウェーブレット変換などのフィルタバンク分析を用いることができる。 Next, the analysis unit 111 inputs the generated learning data (input signal), and executes signal analysis processing on the input signal (step S103). For example, the analysis unit 111 performs time-frequency analysis on the input signal for each channel, outputs an analysis result represented by the time frequency, and stores the analysis result in, for example, the RAM 53. As a method of time frequency analysis, for example, short time Fourier transform and filter bank analysis such as wavelet transform can be used.

次に解析部１１１は、解析結果をニューラルネットワークＮ_１に入力し、ニューラルネットワークＮ_１の中間出力と最終出力を例えばＲＡＭ５３に記憶する。入力を与える方法としては、複数チャンネルをまとめて入力してもよいし、チャンネルごとに独立に処理してもよい。チャンネルごとに独立に処理する場合、最終出力を得るために後処理を加える。例えば得られた各チャンネルの出力の中央値を各時間周波数座標について求めるなどの方法が考えられる。 Then analyzer 111 receives the analysis result to the neural network _{N 1,} stores the intermediate output and the final output of the neural network _{N 1,} for example, in RAM 53. As a method of providing input, a plurality of channels may be input collectively, or may be processed independently for each channel. When processing each channel independently, post-processing is added to obtain the final output. For example, a method of obtaining the median value of the obtained output of each channel for each time frequency coordinate is conceivable.

ここで、ニューラルネットワークＮ_１の最終出力の次元数は、解析結果のフレームごとの特徴量数の２倍である。ニューラルネットワークＮ_１の構成要素としては、フィードフォワード接続、畳み込み接続、および、ＬＳＴＭ（Long short-term memory）を用いた構造など、任意の構造を採用できる。ＢｉｄｉｒｅｃｔｉｏｎａｌＬＳＴＭなどの系列全体の情報を利用するタイプの構造を用いる場合、学習後の実行時にオンライン処理ができないことに注意する。 Here, the number of dimensions of the final output of the neural network N ₁ is twice the characteristic quantity number for each frame of the analysis results. The constituent elements of the neural network N _1, feedforward connection, convolution connection, and, like structures using LSTM (Long short-term memory) , may employ any structure. When using a type of structure that uses information of the entire sequence such as Bidirectional LSTM, note that online processing can not be performed at the time of execution after learning.

解析結果の位相情報を破棄して絶対値のみにし、さらに絶対値の自然対数を取った値をニューラルネットワークＮ_１に入力してもよい。このように構成することで、入力のダイナミックレンジが狭くなり、後段のパラメータ更新時の安定性を向上させることができる。 Only the absolute value discards the phase information of the analysis result, may enter additional values it took the natural logarithm of the absolute value to the neural network N _1. With this configuration, the dynamic range of the input is narrowed, and the stability at the time of parameter updating in the subsequent stage can be improved.

解析部１１１は、ニューラルネットワークＮ_１の最終出力に対してシグモイド関数を適用する。シグモイド関数は、例えば出力を０〜１の範囲にするために用いられる。同様の機能を有するシグモイド関数以外の関数を用いてもよい。解析部１１１は、シグモイド関数の出力を２つに分離し、片方を音声マスクとし、もう片方を雑音マスクとする。 Analyzer 111 applies a sigmoidal function to the final output of the neural network N _1. The sigmoid function is used, for example, to make the output range from 0 to 1. A function other than the sigmoid function having the same function may be used. The analysis unit 111 separates the output of the sigmoid function into two, one as a speech mask and the other as a noise mask.

次に特徴量算出部１１２は、音声および雑音それぞれについて特徴量を算出する（ステップＳ１０４）。例えば特徴量算出部１１２は、解析結果に対してそれぞれのマスクを用いて、音声の空間相関の推定値と、雑音の空間相関の推定値と、を求める。より具体的には、特徴量算出部１１２は、時刻ｔ、周波数ωにおける入力ベクトルｘ（ｔ，ω）に対して、音声マスクｍ_Ｓ（ｔ，ω）と雑音マスクｍ_Ｎ（ｔ，ω）とを用いて、以下の（１）式により特徴量（空間相関）ξ_Ｘを算出する。
ξ_Ｘ（ｔ，ω）＝ｍ_Ｘ（ｔ，ω）ｘ（ｔ，ω）ｘ^Ｈ（ｔ，ω）・・・（１） Next, the feature amount calculation unit 112 calculates a feature amount for each of speech and noise (step S104). For example, the feature amount calculation unit 112 obtains an estimated value of spatial correlation of speech and an estimated value of spatial correlation of noise using each mask for the analysis result. More specifically, the feature quantity calculation unit 112 calculates the speech mask m _S (t, ω) and the noise mask m _N (t, ω) with respect to the input vector x (t, ω) at time t and frequency ω. And the feature quantity (spatial correlation) ξ _X is calculated by the following equation (1).
ξ _X (t, ω) = m _X (t, ω) x (t, ω) x ^H (t, ω) (1)

ξ_Ｘおよびｍ_Ｘの「Ｘ」は、音声を示す「Ｓ」、または、雑音を示す「Ｎ」のいずれかが設定されることを表す。以下の処理は、音声と雑音で独立に実行される。説明の便宜のため、区別する必要がない場合は「Ｘ」を付した変数名を用いる。入力ベクトルｘ（ｔ，ω）の各要素は、各チャンネルに対応する。 “X” of ξ _X and m _X indicates that either “S” indicating speech or “N” indicating noise is set. The following processing is executed independently for voice and noise. For convenience of explanation, variable names with “X” are used when it is not necessary to distinguish them. Each element of the input vector x (t, ω) corresponds to each channel.

次に類似度算出部１１３は、記憶部１４１に記憶された各特徴量と、ステップＳ１０４で算出された特徴量との類似度を算出する（ステップＳ１０５）。特徴量を記憶する記憶領域の個数をＬとする。Ｌ個の記憶領域に記憶された特徴量を示すＬ個のベクトルをｒ_１，ｒ_２，・・・，ｒ_Ｌと表す。また以下では、Ｌ個のベクトルを並べた行列をＲ＝｛ｒ_１，ｒ_２，・・・，ｒ_Ｌ｝と表す。 Next, the similarity calculation unit 113 calculates the similarity between each feature amount stored in the storage unit 141 and the feature amount calculated in step S104 (step S105). Let L be the number of storage areas for storing feature quantities. The L vectors indicating the feature quantities stored in the L storage areas are denoted as r ₁ , r ₂ ,..., R _L. Also, in the following, a matrix in which L vectors are arranged is represented as R = {r ₁ , r ₂ ,..., R _L }.

例えば類似度算出部１１３は、Ｌ個のベクトルｒ_１，ｒ_２，・・・，ｒ_Ｌのそれぞれと、特徴量ξ_Ｘをベクトル化したｖ_Ｘとの間の相関係数を類似度として算出する。相関係数は、上述の複素相関係数Ｒｅａｌ（ｖ^Ｈ _Ｘｒ_ｉ）／（｜ｖ_Ｘ｜｜ｒ_ｉ｜）（１≦ｉ≦Ｌ）などを用いることができる。またｖ_Ｘは、周波数ごとに算出される特徴量（ξ_Ｓまたはξ_Ｎ）をすべて連結してベクトル化することにより生成してもよいし、適当に分割してそれぞれ管理してもよい。例えば、周波数ごとにベクトル化してｖ_Ｘを生成してもよい。記憶部１４１に記憶するＬ個のベクトルｒ_１，ｒ_２，・・・，ｒ_Ｌそれぞれは、ｖ_Ｘの次元数と等しいベクトルとする。 For example, the similarity calculation unit 113 calculates, as the similarity, the correlation coefficient between each of the _L vectors r ₁ , r ₂ , ..., r _L and v _X obtained by vectorizing the feature amount ξ _X Do. As the correlation coefficient, the above-mentioned complex correlation coefficient Real (v ^H _x r _i ) / (| v _x || r _i |) (1 ≦ i ≦ L) or the like can be used. Further, v _X may be generated by connecting and vectorizing all feature quantities (ξ _S or ξ _N ) calculated for each frequency, or may be divided appropriately and managed. For example, it may generate a v _X and vectorized for each frequency. Each of L vectors r ₁ , r ₂ ,..., R _L stored in the storage unit 141 is a vector equal to the dimensionality of v _X.

次に重み算出部１１４は、算出された類似度、および、特徴量を用いて重みを算出する（ステップＳ１０６）。例えば重み算出部１１４は、Ｌ個の類似度と、特徴量をベクトル化したｖ_Ｘと、をニューラルネットワークＮ_２に入力する。ニューラルネットワークＮ_２は次元数Ｌの３つの重みベクトルＷ_Ｄ、Ｗ_Ｗ、Ｗ_Ｒを出力する。各重みベクトルの各要素は０以上の実数であり、各要素の総和は１である。重みベクトルＷ_Ｄ、Ｗ_Ｗ、Ｗ_Ｒは、それぞれ消去重み、書き込み重み、読み出し重みに対応する。 Next, the weight calculation unit 114 calculates a weight using the calculated similarity and the feature amount (step S106). For example the weight calculation unit 114 inputs the L number of similarities, and v _X that vector the feature amount, the neural network N _2. Three weight vector _W D neural network _{N 2} is the number of dimensions _L, W W, and outputs the _{W R.} Each element of each weight vector is a real number greater than or equal to 0, and the sum of each element is 1. The weight vectors W _D , W _W , and W _R correspond to the erase weight, the write weight, and the read weight, respectively.

ｖ_Ｘの次元数が固定であるか、任意であるかはニューラルネットワークＮ_２の構成に依存する。例えば全結合のフィードフォワード型の構造のように入力と出力の次元数が固定される場合、ｖ_Ｘの次元数は学習時および音声強調時で共通の固定された値を用いる。一方、畳み込みネットワークのような、ｖ_Ｘの次元数に依存せず計算可能な構造を採用した場合、ｖ_Ｘの次元数は任意である。任意の場合であっても、記憶領域に記憶された各特徴量を新たに初期化しない限り、続けて入力されるｖ_Ｘの次元数は前に入力したものと等しい。 v The _X number of dimensions is fixed, it is or is optional depending on the configuration of the neural network N _2. For example, when the dimensionality of the input and output is fixed, as in a feedforward type structure of full coupling, the dimensionality of v _X uses a common fixed value during learning and speech enhancement. On the other hand, such as convolutional network, v case of adopting a computable structure without depending on the number of dimensions of _X, v the number of dimensions of _X is arbitrary. Even in any case, as long as each feature amount stored in the storage area is not newly initialized, the number of dimensions of the subsequently input v _X is equal to that previously input.

次に更新部１１５は、算出された重みを用いて、記憶部１４１に記憶された特徴量を更新する（ステップＳ１０７）。例えば更新部１１５は、記憶されたＬ個のベクトルを含む行列Ｒを、以下の（２）式により更新する。Ｄｉａｇ（・）は、ベクトルを対角要素に持つ対角行列を表す。
Ｒ←ＲＤｉａｇ（Ｗ_Ｄ）＋ｖ_ＸＷ^Ｈ _Ｗ・・・（２） Next, the updating unit 115 updates the feature amount stored in the storage unit 141 using the calculated weight (step S107). For example, the updating unit 115 updates the matrix R including the stored L vectors according to the following equation (2). Diag (.) Represents a diagonal matrix having vectors as diagonal elements.
_{_{^{R ← RDiag (W D) +}}} v X W H W ··· (2)

更新部１１５は、更新されたＲを用いて出力φ_Ｘを以下の（３）式により算出する。
φ_Ｘ＝Ｗ^Ｈ _ＲＲ・・・（３） Updating unit 115 uses the updated R follows the output phi _X (3) is calculated by equation.
φ _X = W ^H _R R (3)

以上の手順は、記憶部１４１に記憶する特徴量の個数を１（Ｌ＝１）とし、入力に依存しない固定の値を重み（忘却係数）に用いたとき、以下の（４）式と定数倍を除いて一致する。（４）式は、空間相関のオンライン推定の忘却係数付きの推定方法を表す。αは固定された忘却係数を表す。従って、以上の手順は、固定の忘却係数を用いる既存の方法を特別な場合に含むことがわかる。
Ｒ←αＲ＋ｖ_Ｘ・・・（４） In the above procedure, assuming that the number of feature quantities stored in the storage unit 141 is 1 (L = 1) and a fixed value independent of the input is used as a weight (forgetting factor), the following equation (4) and a constant Matches except double. Equation (4) represents an estimation method with a forgetting coefficient for online estimation of spatial correlation. α represents a fixed forgetting factor. Thus, it can be seen that the above procedure includes in special cases the existing method with a fixed forgetting factor.
R ← αR + v _X (4)

本実施形態では、以上の手順で重みを算出することにより、既存の場合と比較し、入力に適応して重みを柔軟に制御することができる。 In this embodiment, by calculating the weight according to the above-described procedure, the weight can be flexibly controlled by adapting to the input, as compared with the existing case.

出力φ_Ｘは、記憶部１４１に記憶する各特徴量が空間相関とみなせる場合、空間相関の推定値とみなせる。このためには、記憶部１４１に記憶する情報が空間相関の推定値とみなせるように初期化してある必要がある。例えば、ランダムな複素ベクトルｃを用いて、ｃｃ^Ｈを各記憶領域に十分な回数加算するなどの方法で初期化した初期値は、この条件を満たす。ｃの次元数は入力のチャンネル数に等しい。十分な回数とは、例えば、ｃの次元数の２倍程度である。複素ベクトルのサンプリング方法としては、例えば実部と虚部を−１から１の範囲の一様分布からサンプリングする方法を用いることができる。 The output φ _X can be regarded as an estimated value of spatial correlation when each feature quantity stored in the storage unit 141 can be regarded as spatial correlation. For this purpose, the information stored in the storage unit 141 needs to be initialized so that it can be regarded as an estimated value of the spatial correlation. For example, an initial value initialized by a method such as adding cc ^H to each storage area a sufficient number of times using a random complex vector c satisfies this condition. The number of dimensions of c is equal to the number of input channels. The sufficient number of times is, for example, about twice the number of dimensions of c. As a sampling method of the complex vector, for example, a method of sampling the real part and the imaginary part from uniform distribution in the range of -1 to 1 can be used.

このようにして、音声および雑音それぞれに対応する次元数の等しい出力φ_Ｓおよびφ_Ｎが得られる。 In this way, outputs φ _S and φ _N having the same number of dimensions corresponding to speech and noise are obtained.

信号処理部１２１は、これらの出力を用いた信号処理を実行する（ステップＳ１０８）。例えば信号処理部１２１は、出力φ_Ｓ、φ_Ｎに対して最大ＳＮＲ基準でフィルタｆを設計する。これは一般化固有値問題により解くことができる。例えば信号処理部１２１は、非特許文献１に記載された方法によりフィルタを生成することができる。信号処理部１２１は、生成したフィルタを混合音声の時間周波数表現に対して適用し、必要ならばさらにＢＡＮを適用して、雑音抑圧音声の時間周波数表現を出力する。 The signal processing unit 121 executes signal processing using these outputs (step S108). For example, the signal processing unit 121 designs the filter f on the basis of the maximum SNR for the outputs φ _S and φ _N. This can be solved by a generalized eigenvalue problem. For example, the signal processing unit 121 can generate a filter by the method described in Non-Patent Document 1. The signal processing unit 121 applies the generated filter to the time-frequency representation of the mixed speech, and further applies BAN, if necessary, to output a time-frequency representation of noise-suppressed speech.

次に学習部１２２は、信号処理部１２１の処理結果を用いてニューラルネットワークのパラメータを更新する（ステップＳ１０９）。例えば学習部１２２は、信号処理部１２１により算出された雑音抑制音声の時間周波数表現に対して、ＳＮ比を算出する。音声のみが含まれた信号をｓ（ｔ，ω）、雑音のみが含まれた信号をｎ（ｔ，ω）として、以下の（５）式によりＳＮ比Ｅ_ＣＮが求められる。
Ｅ_ＣＮ＝｜（ｆ^Ｈｓ）／（ｆ^Ｈｎ）｜・・・（５） Next, the learning unit 122 updates the parameters of the neural network using the processing result of the signal processing unit 121 (step S109). For example, the learning unit 122 calculates an SN ratio for the time-frequency expression of the noise-suppressed speech calculated by the signal processing unit 121. Assuming that a signal containing only speech is s (t, ω) and a signal containing only noise is n (t, ω), the SN ratio E _CN is obtained by the following equation (5).
E _CN = | (f ^H s) / (f ^H n) | (5)

学習部１２２は、算出されたＳＮ比の微分を求め、例えば誤差逆伝搬法によってニューラルネットワークＮ_１およびＮ_２のパラメータを更新する。更新するとき、微分値をそのまま用いる代わりにＡｄａｍなどを適用して修正を施した値を利用してもよい。 Learning unit 122 obtains a differential of the calculated SN ratio, for example, to update the parameters of the neural network N ₁ and N ₂ by the error backpropagation. When updating, instead of using the derivative value as it is, a modified value may be used by applying Adam or the like.

ニューラルネットワークＮ_１のパラメータ更新を安定させるため、ＳＮ比を反映した正解マスクと、算出された音声マスクｍ_Ｓ（ｔ，ω）または雑音マスクｍ_Ｎ（ｔ，ω）のクロスエントロピー誤差を評価値として追加し、パラメータを更新してもよい。 In order to stabilize the parameter update of the neural network N ₁ , the evaluation value is the cross-entropy error of the correct mask reflecting the S / N ratio and the calculated speech mask m _S (t, ω) or noise mask m _N (t, ω). And the parameters may be updated.

正解マスクは、例えば、ＳＮ比が上限値（例えば１０ｄＢ）以上であれば音声マスクを１とし、ＳＮ比が下限値（例えば−１０ｄＢ）以下であれば雑音マスクを１とし、それ以外では０にするという基準で作成される。 For example, when the SN ratio is equal to or higher than the upper limit value (for example, 10 dB), the correct mask is set to 1. When the SN ratio is equal to or lower than the lower limit value (for example, −10 dB), the noise mask is set to 1. Created on the basis of

学習部１２２は、以上の処理を学習が収束するまで繰り返す。学習部１２２は、例えば、終了条件が満たされたか否かを判定する（ステップＳ１１０）。終了条件はどのような条件であってもよいが、例えば、以下のような条件を適用できる。
・更新の回数が一定値（例えば１００万回）に達したときに収束したとみなす。
・更新の回数が一定値（例えば１００万回）に達するごとに、評価データの平均ＳＮ比に対してＳＮ比が改善されたかを評価する。所定回数（例えば５回）に渡って改善が見られないときに収束したとみなす。学習部１２２は、例えば、学習データの一部を学習には利用せずに分離して、評価データとして利用する。 The learning unit 122 repeats the above processing until learning converges. For example, the learning unit 122 determines whether or not an end condition is satisfied (step S110). The termination condition may be any condition. For example, the following condition can be applied.
-It is considered that it has converged when the number of updates reaches a certain value (for example, 1 million times).
-Assess that the SN ratio has improved with respect to the average SN ratio of the evaluation data each time the number of updates reaches a fixed value (for example, 1 million times). When no improvement is observed for a predetermined number of times (for example, 5 times), it is considered that the image has converged. For example, the learning unit 122 separates part of the learning data without using it for learning, and uses it as evaluation data.

終了条件が満たされていない場合（ステップＳ１１０：Ｎｏ）、ステップＳ１０３に戻り処理が繰り返される。終了条件が満たされた場合（ステップＳ１１０：Ｙｅｓ）、学習部１２２は、更新したパラメータを例えば記憶部１４１に記憶する。 When the end condition is not satisfied (step S110: No), the process returns to step S103 and the process is repeated. When the end condition is satisfied (step S110: Yes), the learning unit 122 stores the updated parameter in the storage unit 141, for example.

次に、特徴量を算出および更新する処理についてさらに説明する。図４は、特徴量を算出および更新する処理の流れを説明するための図である。 Next, the process of calculating and updating the feature amount will be further described. FIG. 4 is a diagram for explaining the flow of processing for calculating and updating feature amounts.

解析部１１１および特徴量算出部１１２により、入力信号から特徴量が算出される。特徴量は、例えば複数チャンネルの信号間の空間相関を表す空間相関行列により表される。特徴量は、ｖ_Ｘにベクトル化される。 The analysis unit 111 and the feature amount calculation unit 112 calculate a feature amount from the input signal. The feature amount is represented by, for example, a spatial correlation matrix that represents a spatial correlation between signals of a plurality of channels. Feature amount is vectorized into v _X.

一方、記憶部１４１には、ｖ_Ｘと同じ次元数のＬ個のベクトルｒ_１，ｒ_２，・・・，ｒ_Ｌが記憶される。記憶部１４１全体としては、Ｌ個のベクトルを並べた行列Ｒ＝｛ｒ_１，ｒ_２，・・・，ｒ_Ｌ｝を記憶する。 On the other hand, the storage unit 141 stores L vectors r ₁ , r ₂ ,..., R _{L of the} same dimensional number as v _X. The entire storage unit 141 stores a matrix R = {r ₁ , r ₂ ,..., R _L } in which _L vectors are arranged.

類似度算出部１１３は、ベクトルｖ_Ｘと、Ｌ個のベクトルそれぞれとの類似度を算出する。算出された類似度は、ニューラルネットに入力され、ニューラルネットが重みを出力する。重みの次元数は、Ｌ個のベクトルに対応してＬとなる。出力される重みは、少なくとも記憶された特徴量に対する重み（消去重み）を含む。 Similarity calculating unit 113 calculates the similarity between the vector v _X, respectively the L vectors. The calculated similarity is input to the neural network, and the neural network outputs a weight. The dimension number of the weight is L corresponding to L vectors. The output weight includes at least a weight (erase weight) for the stored feature amount.

更新部１１５は、出力された重み、算出された特徴量、および、記憶部１４１に記憶された特徴量を用いて、記憶部１４１に記憶された特徴量を更新するとともに、更新後の特徴量を用いて、信号処理のための特徴量φ_Ｘ（φ_Ｓおよびφ_Ｎ）を算出する。 The updating unit 115 updates the feature amount stored in the storage unit 141 using the output weight, the calculated feature amount, and the feature amount stored in the storage unit 141, and the updated feature amount Is used to calculate a feature value φ _X (φ _S and φ _N ) for signal processing.

このように、第１の実施形態にかかる信号処理装置では、記憶された特徴量に対する重み（消去重み）を用いるため、従来の忘却機能と同様の機能を実現できる。さらに、算出された特徴量と、記憶された特徴量との類似度に応じて重みを算出するため、信号処理に用いる情報（特徴量）をより高精度に算出可能となる。 As described above, in the signal processing apparatus according to the first embodiment, since the weight (erasing weight) for the stored feature amount is used, the same function as the conventional forgetting function can be realized. Furthermore, since the weight is calculated according to the similarity between the calculated feature amount and the stored feature amount, it is possible to calculate information (feature amount) used for signal processing with higher accuracy.

（第２の実施形態）
第２の実施形態にかかる信号処理装置は、第１の実施形態の信号処理装置などによりパラメータが学習されたモデルを用いて信号処理（例えば音声強調処理）を実行する装置である。第１の実施形態の信号処理装置（学習処理を実行する装置）の機能と、本実施形態の信号処理装置の機能とを両方備えるように構成してもよい。 Second Embodiment
The signal processing apparatus according to the second embodiment is an apparatus that performs signal processing (for example, speech enhancement processing) using a model whose parameters have been learned by the signal processing apparatus or the like according to the first embodiment. It may be configured to have both the function of the signal processing device (the device that executes the learning process) of the first embodiment and the function of the signal processing device of the present embodiment.

図５は、第２の実施形態にかかる信号処理装置１００−２のハードウェア構成例を示す説明図である。 FIG. 5 is an explanatory diagram illustrating a hardware configuration example of the signal processing device 100-2 according to the second embodiment.

信号処理装置１００−２は、ＣＰＵ６１、ＲＯＭ６２、ＲＡＭ６３、記憶装置６４、操作装置６５、入力装置６６、および、出力装置６７を含み、それらがバスを介して接続されている。 The signal processing device 100-2 includes a CPU 61, a ROM 62, a RAM 63, a storage device 64, an operating device 65, an input device 66, and an output device 67, which are connected via a bus.

ＣＰＵ６１、ＲＯＭ６２、ＲＡＭ６３、記憶装置６４、および、操作装置６５の機能は、信号処理装置１００と同様であるため説明を省略する。 The functions of the CPU 61, the ROM 62, the RAM 63, the storage device 64, and the operation device 65 are the same as those of the signal processing device 100, and thus the description thereof is omitted.

入力装置６６は、例えば音声を入力するマイクロフォンアレイである。入力装置６６は、マイクロフォンアレイを構成する複数のマイクから複数の独立した信号を取得する。 The input device 66 is, for example, a microphone array that inputs sound. The input device 66 acquires a plurality of independent signals from a plurality of microphones constituting the microphone array.

出力装置６７は、各種情報を出力するための装置である。例えば出力装置６７は、スピーカ、イヤホン、および、ヘッドホンなどの１つまたは複数の音声出力装置である。音声出力装置は、電気信号を空気の振動に変換して出力する。出力装置６７は、ディスプレイであってもよい。ディスプレイは、例えば音声認識結果を表示する。 The output device 67 is a device for outputting various information. For example, the output device 67 is one or a plurality of audio output devices such as a speaker, an earphone, and a headphone. The audio output device converts an electrical signal into air vibration and outputs the air vibration. The output device 67 may be a display. The display displays, for example, a speech recognition result.

図６は、第２の実施形態にかかる信号処理装置１００−２の構成の一例を示すブロック図である。図６に示すように、信号処理装置１００−２は、受付部１３１−２と、解析部１１１と、特徴量算出部１１２と、類似度算出部１１３と、重み算出部１１４と、更新部１１５と、信号処理部１２１と、記憶部１４１と、を備えている。 FIG. 6 is a block diagram illustrating an example of a configuration of a signal processing device 100-2 according to the second embodiment. As illustrated in FIG. 6, the signal processing apparatus 100-2 includes a reception unit 131-2, an analysis unit 111, a feature quantity calculation unit 112, a similarity calculation unit 113, a weight calculation unit 114, and an update unit 115. A signal processing unit 121 and a storage unit 141.

第２の実施形態では、生成部１０１および学習部１２２が削除され、受付部１３１−２が追加されたことが第１の実施形態と異なっている。その他の構成および機能は、第１の実施形態にかかる信号処理装置１００のブロック図である図２と同様であるので、同一符号を付し、ここでの説明は省略する。 The second embodiment is different from the first embodiment in that the generation unit 101 and the learning unit 122 are deleted and the reception unit 131-2 is added. The other configurations and functions are the same as those in FIG. 2 which is a block diagram of the signal processing apparatus 100 according to the first embodiment, and thus the same reference numerals are given and the description thereof is omitted here.

受付部１３１−２は、信号処理の対象となる情報の入力を受け付け、解析部１１１に出力する。例えば受付部１３１−２は、マイクロフォンアレイにより取得された多チャンネルの波形データである入力信号を受け付ける。受付部１３１−２は、入力信号をＡＤ（アナログデジタル）変換によりデジタル化し、デジタル化した信号を、例えば記憶部１４１内の作業領域に記憶する。受付部１３１−２は、デジタル化した信号を解析部１１１に出力する。 The accepting unit 131-2 accepts input of information to be subjected to signal processing and outputs it to the analyzing unit 111. For example, the reception unit 131-2 receives an input signal that is multi-channel waveform data acquired by the microphone array. The reception unit 131-2 digitizes the input signal by AD (analog-digital) conversion, and stores the digitized signal in, for example, a work area in the storage unit 141. The reception unit 131-2 outputs the digitized signal to the analysis unit 111.

解析部１１１以降の処理は、第１の実施形態と同様である。信号処理部１２１は、受け付けられた波形データに対する処理結果を出力する。例えば信号処理部１２１は、雑音抑圧音声の時間周波数表現（スペクトル）を出力する。信号処理部１２１は、後段の処理で用いる形式に変換した処理結果を出力してもよい。例えば信号処理部１２１は、強調処理後のスペクトルに対し、合成窓を適用したオーバーラップアドにより出力波形に変換して出力してもよい。後段に音声認識システムが接続されている場合は、波形に変換せず、直接スペクトルを出力してもよい。 The processing after the analysis unit 111 is the same as in the first embodiment. The signal processing unit 121 outputs a processing result for the received waveform data. For example, the signal processing unit 121 outputs a time frequency expression (spectrum) of the noise-suppressed speech. The signal processing unit 121 may output the processing result converted into the format used in the processing of the latter stage. For example, the signal processing unit 121 may convert the spectrum after enhancement processing into an output waveform using an overlap add to which a synthesis window is applied, and output the output waveform. When a speech recognition system is connected to the latter stage, the spectrum may be output directly without converting it into a waveform.

次に、このように構成された第２の実施形態にかかる信号処理装置１００−２による信号処理について図７を用いて説明する。図７は、第２の実施形態における信号処理の一例を示すフローチャートである。 Next, signal processing by the signal processing apparatus 100-2 according to the second embodiment configured as described above will be described with reference to FIG. FIG. 7 is a flowchart illustrating an example of signal processing according to the second embodiment.

まず操作装置６５などを介して信号処理の開始が指示されると、受付部１３１−２は、初期化処理を実行する（ステップＳ２０１）。例えば受付部１３１−２は、学習されたパラメータのための記憶領域、および、特徴量を記憶するための記憶領域を、記憶部１４１内に確保する。 First, when the start of signal processing is instructed via the operation device 65 or the like, the reception unit 131-2 executes an initialization process (step S201). For example, the receiving unit 131-2 secures in the storage unit 141 a storage area for the learned parameter and a storage area for storing the feature amount.

受付部１３１−２は、例えばマイクロフォンアレイにより取得された複数チャンネルの信号の入力を受け付ける（ステップＳ２０２）。受付部１３１−２は、信号をＡＤ変換によりデジタル化し、デジタル化した波形を記憶部１４１に記憶する。 The accepting unit 131-2 accepts input of signals of a plurality of channels acquired by, for example, a microphone array (Step S202). The reception unit 131-2 digitizes the signal by AD conversion, and stores the digitized waveform in the storage unit 141.

ステップＳ２０３からステップＳ２０８までは、第１の実施形態にかかる信号処理装置１００におけるステップＳ１０３からステップＳ１０８までと同様の処理なので、その説明を省略する。 The processes from step S203 to step S208 are the same as the processes from step S103 to step S108 in the signal processing apparatus 100 according to the first embodiment, and thus the description thereof will be omitted.

ステップＳ２０８の信号処理により、信号処理の処理結果（例えば強調音声のスペクトル）が得られる。以上の手順が、動作の終了が指示されるまで繰り返される。例えば受付部１３１−２は、操作装置６５などを介して動作の終了が指示されたか否かを判定する（ステップＳ２０９）。動作の終了が指示されていない場合（ステップＳ２０９：Ｎｏ）、次に入力された信号に対してステップＳ２０２から処理が繰り返される。動作の終了が指示された場合（ステップＳ２０９：Ｙｅｓ）、信号処理が終了する。 The signal processing result (for example, the spectrum of the emphasized speech) is obtained by the signal processing in step S208. The above procedure is repeated until the end of the operation is instructed. For example, the reception unit 131-2 determines whether or not the operation has been instructed via the operation device 65 or the like (step S209). When the end of the operation is not instructed (Step S209: No), the process is repeated from Step S202 for the next input signal. When the end of the operation is instructed (step S209: Yes), the signal processing ends.

終了時に、記憶部１４１の各記憶領域に記憶された特徴量を、他の不揮発性の記憶媒体（例えば記憶装置６４）に記憶してもよい。そして、この記憶媒体に記憶した特徴量を、次回の起動時に初期設定値として読み出し、記憶部１４１に設定してもよい。これにより、記憶部１４１の記憶領域の初期化処理を省略することができる。 At the end of the process, the feature quantities stored in the respective storage areas of the storage unit 141 may be stored in another non-volatile storage medium (for example, the storage device 64). Then, the feature amount stored in the storage medium may be read as an initial setting value at the next activation and set in the storage unit 141. Thereby, the initialization process of the storage area of the storage unit 141 can be omitted.

このように、第２の実施形態にかかる信号処理装置では、第１の実施形態と同様の手法を、音声強調処理などの信号処理時に適用可能となる。 As described above, the signal processing apparatus according to the second embodiment can apply the same method as that of the first embodiment at the time of signal processing such as speech enhancement processing.

以上説明したとおり、第１から第２の実施形態によれば、信号処理に用いる情報（特徴量）をより高精度に算出可能となる。 As described above, according to the first and second embodiments, information (feature amount) used for signal processing can be calculated with higher accuracy.

上記実施形態の信号処理装置（信号処理装置１００、信号処理装置１００−２）で実行されるプログラムは、ＲＯＭ５２等に予め組み込まれて提供される。 The program executed by the signal processing device (the signal processing device 100, the signal processing device 100-2) of the above embodiment is provided by being incorporated in advance in the ROM 52 or the like.

信号処理装置で実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ（Compact Disk Read Only Memory）、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ（Compact Disk Recordable）、ＤＶＤ（Digital Versatile Disk）等のコンピュータで読み取り可能な記録媒体に記録してコンピュータプログラムプロダクトとして提供されるように構成してもよい。 A program executed by the signal processing apparatus is an installable or executable file, which is a CD-ROM (Compact Disk Read Only Memory), a flexible disk (FD), a CD-R (Compact Disk Recordable), a DVD ( The program may be configured to be recorded as a computer program product by being recorded on a computer readable recording medium such as Digital Versatile Disk).

さらに、信号処理装置で実行されるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また、信号処理装置で実行されるプログラムをインターネット等のネットワーク経由で提供または配布するように構成してもよい。 Furthermore, the program executed by the signal processing apparatus may be stored on a computer connected to a network such as the Internet, and may be provided by being downloaded via the network. The program executed by the signal processing apparatus may be provided or distributed via a network such as the Internet.

信号処理装置で実行されるプログラムは、コンピュータを上述した信号処理装置の各部として機能させうる。このコンピュータは、ＣＰＵがコンピュータ読取可能な記憶媒体からプログラムを主記憶装置上に読み出して実行することができる。 The program executed by the signal processing device can cause the computer to function as each unit of the signal processing device described above. This computer can read out a program from the computer readable storage medium onto the main storage device and execute it.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, substitutions, and modifications can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and the gist of the invention, and are included in the invention described in the claims and the equivalent scope thereof.

１００、１００−２信号処理装置
１０１生成部
１１１解析部
１１２特徴量算出部
１１３類似度算出部
１１４重み算出部
１１５更新部
１２１信号処理部
１２２学習部
１３１−２受付部
１４１記憶部 100, 100-2 Signal processing apparatus 101 Generation unit 111 Analysis unit 112 Feature quantity calculation unit 113 Similarity calculation unit 114 Weight calculation unit 115 Update unit 121 Signal processing unit 122 Learning unit 131-2 Reception unit 141 Storage unit

Claims

A storage unit for storing a first feature amount representing a feature of the first input signal;
A similarity calculation unit that calculates a similarity between the first feature amount and a second feature amount representing a feature of the second input signal;
A weight calculator configured to calculate a first weight for the first feature amount based on the similarity and the second feature amount;
A third feature amount is calculated based on the first feature amount multiplied by the first weight and the second feature amount, and the first feature amount stored in the storage unit according to the third feature amount. Updating section, and
A signal processing unit that performs signal processing using the updated first feature value;
A signal processing apparatus comprising:

The weight calculation unit calculates the first weight using a model that receives the similarity and the second feature and outputs the first weight.
The signal processing apparatus according to claim 1.

The model is a neural network,
The signal processing device according to claim 2.

The system further includes a learning unit that evaluates the processing result of the signal processing using learning data and updates parameters of the model.
The signal processing apparatus according to claim 2.

It further comprises a generation unit that generates learning data including a third input signal and reference data representing the processing result of the signal processing,
The learning unit executes a learning process using the generated learning data.
The signal processing device according to claim 4.

The first feature quantity, the second feature quantity, and the third feature quantity are spatial correlations based on a plurality of input signals input from different positions in space.
The signal processing apparatus according to claim 1.

The weight calculation unit further calculates a second weight for the second feature amount based on the similarity and the second feature amount.
The update unit calculates the third feature amount based on the first feature amount multiplied by the first weight and the second feature amount multiplied by the second weight.
The signal processing apparatus according to claim 1.

The weight calculation unit further calculates a third weight for the first feature value read from the storage unit based on the similarity and the second feature value.
The signal processing unit executes signal processing using the first feature amount multiplied by the third weight.
The signal processing apparatus according to claim 1.

A storage unit for storing a first feature amount representing a feature of a first input signal including a plurality of channels of audio signals;
A similarity calculation unit that calculates a similarity between the first feature amount and a second feature amount representing a feature of a second input signal including audio signals of a plurality of channels;
A weight calculator configured to calculate a first weight for the first feature amount based on the similarity and the second feature amount;
A third feature amount is calculated based on the first feature amount multiplied by the first weight and the second feature amount, and the first feature amount stored in the storage unit according to the third feature amount. Updating section, and
A signal processing unit that executes signal processing for enhancing a part of audio signals of a plurality of channels using the updated first feature amount;
A speech enhancement device comprising

A storage step of storing a first feature amount representing a feature of the first input signal in a storage unit;
A similarity calculation step of calculating a similarity between the first feature and a second feature representing a feature of the second input signal;
A weight calculation step of calculating a first weight for the first feature amount based on the similarity and the second feature amount;
A third feature amount is calculated based on the first feature amount multiplied by the first weight and the second feature amount, and the first feature amount stored in the storage unit according to the third feature amount. Update step to update
A signal processing step of performing signal processing using the updated first feature value;
Signal processing method including:

On the computer,
A storage step of storing a first feature amount representing a feature of the first input signal in a storage unit;
A similarity calculation step of calculating a similarity between the first feature and a second feature representing a feature of the second input signal;
A weight calculation step of calculating a first weight for the first feature amount based on the similarity and the second feature amount;
A third feature amount is calculated based on the first feature amount multiplied by the first weight and the second feature amount, and the first feature amount stored in the storage unit according to the third feature amount. Update step to update
A signal processing step of performing signal processing using the updated first feature value;
A program to run a program.