JP2020092381A

JP2020092381A - Sound acquisition device, sound acquisition method, and sound acquisition program

Info

Publication number: JP2020092381A
Application number: JP2018230153A
Authority: JP
Inventors: 龍一宇治橋; Ryuichi Ujihashi; 純一内田; Junichi Uchida; 貴大中代; Takahiro Nakadai
Original assignee: Olympus Corp
Current assignee: Olympus Corp
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2020-06-11

Abstract

To emphasize a sound emitted by a sound collection target located at a predetermined distance by using first and second microphones.SOLUTION: A sound acquisition device includes a first microphone that is arranged at a first distance from a sound collection target and collects a sound emitted from the sound collection target to acquire a first sound, a second microphone that is arranged apart from the sound collection target by a second distance different from the first distance and that collects a second sound by collecting the sound emitted from the sound collection target, and a voice emphasizing unit that performs emphasizing processing for emphasizing a component based on a distance difference between the first distance and the second distance from among the first and second sounds.SELECTED DRAWING: Figure 1

Description

本発明は、第１及び第２のマイクロホンを利用して音声を取得する音声取得装置、音声取得方法及び音声取得プログラムに関する。 The present invention relates to a voice acquisition device, a voice acquisition method, and a voice acquisition program for acquiring voice by using first and second microphones.

従来、デジタルカメラやスマートフォン等の携帯機器においては、録音機能及び録画機能を有するものが多い。この種の携帯機器は、被写体を撮像する撮像部と周囲の音声を収音する内蔵マイクロホンを備えており、動画撮影に際して、映像及び音声を含むＡＶデータを記録可能なものもある。 Conventionally, many mobile devices such as digital cameras and smartphones have a recording function and a recording function. This type of mobile device includes an image pickup unit for picking up an image of a subject and a built-in microphone for picking up surrounding sounds, and there are some that can record AV data including video and sound when shooting a moving image.

更に、携帯機器においては、外部マイクロホンが収音した録音データを、ケーブルや無線通信により取得可能なものもあり、外部マイクロホンによって取得された音声と内蔵された撮像部による画像とを記録可能な装置も商品化されている。例えば、被写体を撮像する携帯機器から離間した位置に外部マイクロホンを配置可能な場合には、外部マイクロホンをその収音対象である被写体の近傍に配置しておくことで、Ｓ／Ｎの良好な音声を外部マイクロホンから取得することも可能である。 Furthermore, some portable devices can acquire recording data picked up by an external microphone through a cable or wireless communication, and can record the sound acquired by the external microphone and the image captured by the built-in image pickup unit. Is also commercialized. For example, when an external microphone can be placed at a position away from a mobile device that images a subject, by placing the external microphone in the vicinity of the subject that is the sound collection target, a sound with good S/N can be obtained. Can also be obtained from an external microphone.

特開２００５−１５１４７１号公報JP, 2005-151471, A

しかしながら、例えば、野鳥の鳴き声を収音する場合等においては、収音対象である野鳥の極近傍にマイクロホンを設置することができるとは限らず、収音対象から比較的離れた位置にマイクロホンを設置する必要がある場合がある。そうすると、マイクロホンには、収音対象からの音声だけでなく、マイクロホンの近くの音や、収音対象以外から発せられる音が雑音として収音されてしまう。 However, for example, when collecting the bark of a wild bird, it is not always possible to install a microphone in the immediate vicinity of the wild bird that is the target of sound collection, and the microphone is located relatively far from the target of sound collection. May need to be installed. Then, not only the sound from the sound pickup target but also the sound near the microphone and the sound emitted from other than the sound pickup target are picked up as noise in the microphone.

なお、特許文献１においては、複数のマイクやカメラを用いたＴＶ会議システムにおいて、発言者の選択を正確に行うものが知られている。しかしながら、このシステムは、声紋登録された会議参加者を認証する声紋認証部や話者を最適に撮影するようにテレビカメラ装置を制御する撮像調整部等を有しており、装置規模が大きく、また、任意の収音対象の音声を選択して録音できるものでもない。 In addition, in patent document 1, there is known a TV conference system using a plurality of microphones and cameras, which accurately selects a speaker. However, this system has a voiceprint authentication unit that authenticates conference participants who have registered voiceprints, an image pickup adjustment unit that controls the TV camera device so as to optimally capture the speaker, and the device scale is large. Moreover, it is not possible to select and record an arbitrary sound to be picked up.

本発明は、第１及び第２のマイクロホンを利用することで、所定距離に位置する収音対象が発した音声を強調することができる音声取得装置、音声取得方法及び音声取得プログラムを提供することを目的とする。 The present invention provides a voice acquisition device, a voice acquisition method, and a voice acquisition program capable of emphasizing a voice emitted by a sound collection target located at a predetermined distance by using the first and second microphones. With the goal.

本発明の一態様による音声取得装置は、収音対象から第１の距離だけ離れて配置され上記収音対象から発せられた音声を収音して第１音声を取得する第１のマイクロホンと、上記収音対象から上記第１の距離と異なる第２の距離だけ離れて配置され上記収音対象から発せられた音声を収音して第２音声を取得する第２のマイクロホンと、上記第１及び第２音声のうち上記第１の距離と第２の距離との距離差に基づく成分を強調する強調処理を行う音声強調部とを具備する。 A voice acquisition device according to an aspect of the present invention includes a first microphone that is arranged at a first distance from a sound collection target and that collects a sound emitted from the sound collection target to acquire a first sound. A second microphone disposed apart from the sound collecting target by a second distance different from the first distance to collect a sound emitted from the sound collecting target to obtain a second sound; and the first microphone. And a voice enhancement unit that performs enhancement processing for enhancing the component based on the distance difference between the first distance and the second distance in the second voice.

本発明の一態様による音声取得方法は、収音対象から第１の距離だけ離れて配置された第１のマイクロホンによって上記収音対象から発せられた音声を収音して第１音声を取得し、上記収音対象から上記第１の距離と異なる第２の距離だけ離れて配置された第２のマイクロホンによって上記収音対象から発せられた音声を収音して第２音声を取得し、上記第１及び第２音声のうち上記第１の距離と第２の距離との距離差に基づく成分を強調する強調処理を行う。 A sound acquisition method according to an aspect of the present invention acquires a first sound by collecting a sound emitted from the sound collection target by a first microphone arranged at a first distance from the sound collection target. , A second microphone arranged apart from the sound collecting target by a second distance different from the first distance collects a sound emitted from the sound collecting target to obtain a second sound, and An emphasis process for emphasizing a component based on the distance difference between the first distance and the second distance of the first and second voices is performed.

本発明の一態様による音声取得プログラムは、コンピュータに、収音対象から第１の距離だけ離れて配置された第１のマイクロホンによって上記収音対象から発せられた音声を収音して第１音声を取得し、上記収音対象から上記第１の距離と異なる第２の距離だけ離れて配置された第２のマイクロホンによって上記収音対象から発せられた音声を収音して第２音声を取得し、上記第１及び第２音声のうち上記第１の距離と第２の距離との距離差に基づく成分を強調する強調処理を行う手順を実行させる。 A sound acquisition program according to an aspect of the present invention collects a sound emitted from the sound collection target by a first microphone arranged in a computer at a first distance from the sound collection target to generate a first sound. Is acquired, and a second microphone arranged apart from the sound collecting target by a second distance different from the first distance collects a sound emitted from the sound collecting target to obtain a second sound. Then, the procedure of performing the emphasis process for emphasizing the component based on the distance difference between the first distance and the second distance of the first and second voices is executed.

本発明によれば、第１及び第２のマイクロホンを利用することで、所定距離に位置する収音対象が発した音声を強調することができる音声取得装置、音声取得方法、音声取得プログラム及び音声取得システムを提供することを目的とする。 According to the present invention, a voice acquisition device, a voice acquisition method, a voice acquisition program, and a voice that can emphasize a voice emitted by a sound collection target located at a predetermined distance by using the first and second microphones. The purpose is to provide an acquisition system.

本発明の第１の実施の形態に係る音声取得装置を示すブロック図。The block diagram which shows the audio|voice acquisition apparatus which concerns on the 1st Embodiment of this invention. 収音の様子を示す説明図。Explanatory drawing which shows the state of sound collection. 横軸に時間をとり縦軸に振幅を取って、収音された音声の時間ずれを説明するための説明図。Explanatory drawing for demonstrating the time gap of the audio|voice which was picked up by taking time on a horizontal axis and showing an amplitude on a vertical axis. 横軸に時間をとり縦軸に振幅を取って、収音された音声の時間ずれを説明するための説明図。Explanatory drawing for demonstrating the time gap of the audio|voice which was picked up by taking time on a horizontal axis and showing an amplitude on a vertical axis. 対象音声の選択方法の一例を説明するための説明図。Explanatory drawing for demonstrating an example of the selection method of target audio|voice. カメラ１の動作を説明するためのフローチャート。6 is a flowchart for explaining the operation of the camera 1. レコーダ２の動作を説明するためのフローチャート。6 is a flowchart for explaining the operation of the recorder 2. 再生装置３の動作を説明するためのフローチャート。6 is a flowchart for explaining the operation of the playback device 3. カメラ１の撮影時の状態を説明するための説明図。Explanatory drawing for demonstrating the state at the time of photography of the camera 1. 再生装置３の再生時の状態を示す説明図。Explanatory drawing which shows the state at the time of reproduction|regeneration of the reproducing|regenerating apparatus 3. 本発明の第２の実施の形態を示すブロック図。The block diagram which shows the 2nd Embodiment of this invention. 鳥４１、草４２、レコーダ２及びカメラ１の位置関係をＸＹ座標上で示す説明図。Explanatory drawing which shows the positional relationship of the bird 41, the grass 42, the recorder 2, and the camera 1 on XY coordinates. 横軸に時間をとり縦軸に振幅を取って、収音された音声の時間ずれを説明するための説明図。Explanatory drawing for demonstrating the time gap of the audio|voice which was picked up by taking time on a horizontal axis and showing an amplitude on a vertical axis. 横軸に時間をとり縦軸に振幅を取って、収音された音声の時間ずれを説明するための説明図。Explanatory drawing for demonstrating the time gap of the audio|voice which was picked up by taking time on a horizontal axis and showing an amplitude on a vertical axis. 再生装置の制御を示すフローチャート。The flowchart which shows control of a reproducing|regenerating apparatus. 本発明の第３の実施の形態に係る音声取得装置を示すブロック図。The block diagram which shows the audio|voice acquisition apparatus which concerns on the 3rd Embodiment of this invention. 収音及び録音時の様子を示す説明図。Explanatory drawing which shows a mode at the time of sound collection and sound recording. カメラの制御を示すフローチャート。The flowchart which shows control of a camera. レコーダの制御を示すフローチャート。The flowchart which shows control of a recorder.

以下、図面を参照して本発明の実施の形態について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（第１の実施の形態）
図１は本発明の第１の実施の形態に係る音声取得装置を示すブロック図である。本実施の形態は、収音対象からの収音の対象となる音声（以下、対象音声という）を収音する第１及び第２のマイクロホンを採用し、収音対象から第１のマイクロホンまでの距離と第２のマイクロホンまでの距離との距離差に基づいて、収音音声を遅延させることにより、第１及び第２のマイクロホンからそれぞれ所定の距離に位置する収音対象の音声を強調する強調処理を実現するものである。また、第１のマイクロホンから所定距離に位置する収音対象から発せられた対象音声を、他の位置から発生する雑音と分離して抽出し、選択的に収音することを可能にする。 (First embodiment)
FIG. 1 is a block diagram showing a voice acquisition device according to a first embodiment of the present invention. The present embodiment employs first and second microphones that collect a sound (hereinafter, referred to as a target sound) that is a target of sound collection from the sound collection target, and the sound from the sound collection target to the first microphone is used. Emphasizing the sound to be collected, which is located at a predetermined distance from the first and second microphones, by delaying the collected sound based on the distance difference between the distance and the distance to the second microphone. It realizes the processing. Further, it is possible to separate and extract the target sound emitted from the sound collection target located at a predetermined distance from the first microphone, separately from the noise generated from other positions, and selectively collect the sound.

図１の例は、第１及び第２のマイクロホンの一方を収音機能を有する撮像装置としてのカメラに構成し、他方をレコーダに構成する例について説明する。第１及び第２のマイクロホンは、どのような構成の装置であってもよい。なお、本実施の形態における音声取得装置は、カメラ内に構成してもよく、レコーダ内に構成してもよく、カメラ及びレコーダに分散して構成してもよく、更に、これらの装置とは独立した装置として構成してもよい。なお、図１では、音声取得装置をカメラ、レコーダ及び再生装置に分散して構成する例を示している。 In the example of FIG. 1, one of the first and second microphones is configured as a camera as an imaging device having a sound collection function, and the other is configured as a recorder. The first and second microphones may be devices having any configuration. The audio acquisition device according to the present embodiment may be configured in the camera, in the recorder, or distributed in the camera and the recorder. It may be configured as an independent device. It should be noted that FIG. 1 shows an example in which the audio acquisition device is distributed and configured into a camera, a recorder, and a playback device.

先ず、図２、図３Ａ、図３Ｂ及び図４を参照して第１の実施の形態における収音の仕方について説明する。図２は収音の様子を示す説明図であり、図３Ａ及び図３Ｂは、横軸に時間をとり縦軸に振幅を取って、収音された音声の時間ずれを説明するための説明図である。また、図４は対象音声の選択方法の一例を説明するための説明図である。 First, a sound collecting method according to the first embodiment will be described with reference to FIGS. 2, 3A, 3B, and 4. FIG. 2 is an explanatory diagram showing a state of picked up sound, and FIGS. 3A and 3B are explanatory diagrams for explaining a time lag of picked up sound with time on the horizontal axis and amplitude on the vertical axis. Is. Further, FIG. 4 is an explanatory diagram for explaining an example of a method of selecting a target voice.

図２に示すカメラ１は図１の各回路が収納された筐体１ａを有する。筐体１ａの上面には、後述する操作部１５を構成するシャッタボタン１５ａが設けられている。また、図２に示すレコーダ２は、図１の各回路が収納された筐体２ａを有している。 The camera 1 shown in FIG. 2 has a housing 1a in which each circuit of FIG. 1 is housed. A shutter button 15a that constitutes an operation unit 15 described later is provided on the upper surface of the housing 1a. Further, the recorder 2 shown in FIG. 2 has a housing 2a in which each circuit of FIG. 1 is housed.

図２の例では、鳥４１は樹木４３ａの枝に留まっている。例えば、野鳥の撮影及び収音を行う場合には、野鳥が留まりやすいであろう樹木の近くに、事前にレコーダ２を設置しておくことが考えられる。一方、カメラ１を携帯するユーザは、野鳥が逃げ出さないように、また、障害物を避けるために、野鳥から比較的離れた位置で撮影を行う。 In the example of FIG. 2, the bird 41 remains on the branch of the tree 43a. For example, when performing shooting and sound collection of wild birds, it is conceivable to install the recorder 2 in advance near a tree where wild birds are likely to stay. On the other hand, the user carrying the camera 1 shoots at a position relatively distant from the wild bird in order to prevent the bird from escaping and to avoid obstacles.

映像については、望遠レンズ等を採用することで被写体から比較的離れた位置からも十分に高画質の画像を取得することができるが、音声については、被写体からの距離が大きくなるとノイズが増加し、対象音声の収音品質が劣化してしまう。この理由から、レコーダ２については、収音対象により近い位置に設置される。 For video, it is possible to obtain a sufficiently high-quality image from a position relatively far from the subject by adopting a telephoto lens, but for audio, noise increases as the distance from the subject increases. , The sound collection quality of the target voice is deteriorated. For this reason, the recorder 2 is installed at a position closer to the sound collection target.

図２の例では、カメラ１を保持するユーザは、樹木４３ａと樹木４３ｂとの隙間を利用し、被写体である鳥４１から比較的離れた位置に位置して鳥４１の撮影及び収音を試みる。一方、レコーダ２は、樹木４３ａに比較的近い位置に設置された三脚上に配置され、鳥４１に比較的近い位置にて鳥４１の鳴き声を対象音声として収音する。なお、図２中の鳥４１を囲った枠はカメラ１の撮影範囲を示しており、カメラ１の後述する表示部１６の表示画面１６ａ上には、撮影された鳥４１の画像４１ａが表示されている。 In the example of FIG. 2, the user holding the camera 1 uses the gap between the trees 43a and 43b and tries to capture and pick up the sound of the bird 41 at a position relatively distant from the bird 41 that is the subject. .. On the other hand, the recorder 2 is arranged on a tripod installed relatively close to the tree 43a and picks up the bark of the bird 41 as a target sound at a position relatively close to the bird 41. The frame surrounding the bird 41 in FIG. 2 indicates the shooting range of the camera 1, and the image 41a of the shot bird 41 is displayed on the display screen 16a of the display unit 16 of the camera 1 which will be described later. ing.

カメラ１と鳥４１との間の地面には草４２が生えている。従って、カメラ１の後述するＳＴ収音部１２には、対象音声である鳥４１の鳴き声以外に、草４２が風でなびくザワザワという音やその他の音（以下、対象音声以外の音を雑音という）が収音される。レコーダ２においても、対象音声である鳥４１の鳴き声だけでなくその他の周囲の雑音も収音される。なお、以下、説明を簡略化するために、カメラ１及びレコーダ２において収音可能な範囲において、音は対象音声と草４２による雑音のみが発生しているものとする。 Grass 42 grows on the ground between the camera 1 and the bird 41. Therefore, in the ST sound collecting unit 12 of the camera 1 which will be described later, in addition to the sound of the bird 41, which is the target sound, the sound of the grass 42 fluttering in the wind and other sounds (hereinafter, sounds other than the target sound are called noise ) Is picked up. Also in the recorder 2, not only the sound of the bird 41, which is the target voice, but also other ambient noises are picked up. In order to simplify the description, it is assumed that, in the range in which the camera 1 and the recorder 2 can collect sound, only the target voice and the noise due to the grass 42 are generated.

鳥４１から発せられた対象音声はカメラ１及びレコーダ２において収音される。対象音声のうち主にカメラ１に到達する直接音である対象音声Ａ１及びレコーダ２に到達する直接音である対象音声Ａ２について考慮すると、これらの対象音声Ａ１，Ａ２は、振幅は異なるが相互に同一周波数であって略同一の周波数特性を有するものと考えられる。即ち、対象音声Ａ１，Ａ２は、振幅を無視すると、伝搬される距離の相違による到達時間のずれのみを相違点とするものと考えられる。 The target sound emitted from the bird 41 is picked up by the camera 1 and the recorder 2. Considering the target voice A1 which is the direct sound mainly reaching the camera 1 and the target voice A2 which is the direct sound reaching the recorder 2, among these target voices, these target voices A1 and A2 are different from each other in amplitude. It is considered that they have the same frequency and substantially the same frequency characteristic. That is, it is considered that the target voices A1 and A2 have only the difference in arrival time due to the difference in the propagation distance, when the amplitude is ignored.

図３Ａ及び図３Ｂはそれぞれ対象音声である鳥４１の鳴き声と草４２のなびく音等の雑音とが分離されているとして、これらの音がカメラ１及びレコーダ２に到達する時点での波形を説明するものである。図３Ａは対象音声について示すものであり、レコーダ２の位置における対象音声に対してカメラ１の位置における対象音声は、距離の相違に基づく到達時間遅れ及び振幅の減少のみを有する。図３Ａの例では到達時間差はＴａであり、レコーダ２及びカメラ１に到達する対象音声は、到達時間差Ｔａに相当する分だけ位相がずれている。 FIGS. 3A and 3B describe the waveforms at the time when these sounds reach the camera 1 and the recorder 2, assuming that the sound of the bird 41 that is the target sound and the noise such as the sound of the grass 42 are separated. To do. FIG. 3A shows the target voice, and the target voice at the position of the camera 1 has only the arrival time delay and the decrease in amplitude based on the difference in distance with respect to the target voice at the position of the recorder 2. In the example of FIG. 3A, the arrival time difference is Ta, and the target voice reaching the recorder 2 and the camera 1 is out of phase by an amount corresponding to the arrival time difference Ta.

また、図３Ｂは草４２による雑音について示すものである。レコーダ２の位置における雑音とカメラ１の位置における雑音とは、距離の相違に基づく到達時間遅れ及び振幅の変化を有する。図３Ｂの例では到達時間差はＴｂであり、レコーダ２及びカメラ１に到達する雑音は、到達時間差Ｔｂに相当する分だけ波形がずれている。 Further, FIG. 3B shows the noise caused by the grass 42. The noise at the position of the recorder 2 and the noise at the position of the camera 1 have a delay in arrival time and a change in amplitude due to the difference in distance. In the example of FIG. 3B, the arrival time difference is Tb, and the noise reaching the recorder 2 and the camera 1 has a waveform shifted by an amount corresponding to the arrival time difference Tb.

そこで、本実施の形態においては、レコーダ２が収音して得た音声を到達時間差Ｔａ分だけ遅延させてカメラ１が収音して得た音声と加算する。これにより、カメラ１及びレコーダ２が収音した音声に含まれる成分のうち到達時間差Ｔａの対象音声については、位相が一致した状態で加算されて強めあう（振幅が大きくなる）ことになる。なお、カメラ１及びレコーダ２において収音された音声を記録する場合には、記録された音声の再生タイミングを到達時間差Ｔａだけずらして再生することで、対象音声については位相を一致させた再生が可能である。以下の説明では、カメラ１により収音された音声とレコーダ２により収音させた音声との少なくとも一方を遅延させて到達時間差を相殺し、対象音声の位相を一致させる処理を、音声を時間軸上でシフトさせる処理（時間軸シフト処理）というものとする。 Therefore, in the present embodiment, the sound acquired by the recorder 2 is delayed by the arrival time difference Ta and added to the sound acquired by the camera 1. As a result, among the components included in the sound picked up by the camera 1 and the recorder 2, the target sound having the arrival time difference Ta is added in a phase-matched state and strengthened (amplifies the amplitude). When recording the sound picked up by the camera 1 and the recorder 2, the reproduction timing of the recorded sound is shifted by the arrival time difference Ta so that the target sound can be reproduced in phase. It is possible. In the following description, a process of delaying at least one of the sound picked up by the camera 1 and the sound picked up by the recorder 2 to cancel the arrival time difference and match the phase of the target sound This is referred to as the above-mentioned shift processing (time-axis shift processing).

図４はこの処理を説明するものであり、レコーダ２によって収音された音声の波形とこの音声から到達時間差Ｔａでカメラ１において収音された音声の波形を示すと共に、これらの波形を到達時間差Ｔａだけ時間軸上でシフトさせて加算して得られる波形を示している。カメラ１及びレコーダ２において、到達時間差Ｔａで収音された対象音声については、時間軸上でシフトして合成（加算）することにより、合成音は比較的大きなピーク値が得られる。 FIG. 4 illustrates this process, showing the waveform of the voice picked up by the recorder 2 and the waveform of the voice picked up by the camera 1 at the arrival time difference Ta from this voice, and showing these waveforms as the arrival time difference. A waveform obtained by shifting by Ta and adding on the time axis is shown. In the camera 1 and the recorder 2, the target voice picked up with the arrival time difference Ta is shifted on the time axis and synthesized (added) to obtain a relatively large peak value of the synthesized voice.

到達時間差は距離と音速とに基づいて算出可能である。例えば、収音対象、レコーダ２及びカメラ１が一直線上に位置するものとすると、カメラ１とレコーダ２との間の距離を求めることで、求めた距離と音速とから到達時間差を算出することができる。カメラ１をデジタルカメラ等によって構成した場合には、カメラ１は測距が可能である場合が多く、カメラ１によってレコーダ２までの距離を求めることで、到達時間差を算出することができる。図２の例では、こうして求めたカメラ１とレコーダ２との間の距離が８ｍであることを示している。 The arrival time difference can be calculated based on the distance and the speed of sound. For example, assuming that the sound collection target, the recorder 2 and the camera 1 are located on a straight line, by obtaining the distance between the camera 1 and the recorder 2, the arrival time difference can be calculated from the obtained distance and the sound velocity. it can. When the camera 1 is configured by a digital camera or the like, the camera 1 is often capable of distance measurement, and the arrival time difference can be calculated by obtaining the distance to the recorder 2 by the camera 1. The example of FIG. 2 shows that the distance between the camera 1 and the recorder 2 thus obtained is 8 m.

なお、例えば、レコーダ２を収音対象近傍の数ｍの位置に配置し、カメラ１が例えば１００ｍくらい収音対象から離れている場合には、収音対象、レコーダ２及びカメラ１が略一直線上に位置していれば、到達時間差の算出誤差は比較的小さいものと考えられる。 Note that, for example, when the recorder 2 is arranged at a position of several meters near the sound collection target and the camera 1 is separated from the sound collection target by, for example, about 100 m, the sound collection target, the recorder 2 and the camera 1 are on a substantially straight line. If it is located at, it is considered that the calculation error of the arrival time difference is relatively small.

本実施の形態においては、カメラ１及びレコーダ２により収音した音声の少なくとも一方を時間軸上でシフトさせて到達時間差を相殺する時間軸シフト処理を行った後加算する処理（以下、時間軸シフト加算処理という）を行うことにより、対象音声の振幅を増大させる処理（強調処理）を行って出力するようになっていてもよい。 In the present embodiment, a process of shifting at least one of the sounds picked up by the camera 1 and the recorder 2 on the time axis to cancel the arrival time difference, and then performing addition (hereinafter referred to as time axis shift). By performing an addition process), a process of increasing the amplitude of the target voice (emphasis process) may be performed and output.

更に、本実施の形態においては、このような時間軸シフト加算処理によって強めあう成分のみを抽出することで、時間軸シフト加算処理によって得た合成音から対象音声の成分を抽出して対象音声の強調処理を行うようになっていてもよい。 Furthermore, in the present embodiment, by extracting only the components that strengthen each other by such time-axis shift addition processing, the components of the target speech are extracted from the synthesized speech obtained by the time-axis shift addition processing, The emphasis processing may be performed.

例えば、所定の短い期間において対象音声の周波数に変化がないものとすると、この期間において時間軸シフト加算処理により所定周期で強めあう周波数成分は対象音声の成分であると考えてもよく、当該周波数成分を抽出することで、収音した音声から対象音声のみの合成音を抽出することが可能である。 For example, if it is assumed that the frequency of the target voice does not change in a predetermined short period, it may be considered that the frequency components that strengthen each other in a predetermined cycle by the time-axis shift addition process in this period are the components of the target voice. By extracting the component, it is possible to extract the synthesized sound of only the target sound from the collected sound.

また、例えば、対象音声がチュンチュンと不連続に発する鳥の鳴き声である場合には、時間軸シフト加算処理により強めあう成分のピーク期間を含む所定期間のみを取り出すことで、主に対象音声が含まれる合成音の抽出を行う強調処理を行うようになっていてもよい。例えば、図４の合成した波形のピーク位置を含む所定期間を対象音声の抽出結果として出力するのである。 Also, for example, when the target voice is a song of a bird that is discontinuous with Chun-Chun, the target voice is mainly included by extracting only a predetermined period including the peak period of the components that strengthen each other by the time axis shift addition process. It is also possible to perform emphasis processing for extracting a synthesized sound to be generated. For example, the predetermined period including the peak position of the synthesized waveform of FIG. 4 is output as the extraction result of the target voice.

図１において、撮像装置を構成するカメラ１には制御部１０が設けられている。制御部１０は、ＣＰＵやＦＰＧＡ等を用いたプロセッサによって構成されていてもよく、図示しないメモリに記憶されたプログラムに従って動作して各部を制御するものであってもよいし、ハードウェアの電子回路で機能の一部又は全部を実現するものであってもよい。 In FIG. 1, a control unit 10 is provided in the camera 1 that constitutes the image pickup apparatus. The control unit 10 may be configured by a processor using a CPU, an FPGA, or the like, may be a unit that operates according to a program stored in a memory (not shown) to control each unit, or an electronic circuit of hardware. May realize some or all of the functions.

カメラ１は、撮像部１１及びＳＴ収音部１２を備えている。画像取得部としての撮像部１１は、光学系１１ａ及び図示しない撮像素子を有している。光学系１１ａは、ズームやフォーカシングのための図示しないレンズや絞り等を備えている。光学系１１ａは、これらのレンズを駆動する図示しないズーム（変倍）機構、ピント及び絞り機構を備えている。撮像素子は、ＣＣＤやＣＭＯＳセンサ等によって構成されており、光学系１１ａによって被写体光学像が撮像素子の撮像面に導かれるようになっている。撮像素子は、被写体光学像を光電変換して被写体の撮像画像（撮像信号）を取得する。 The camera 1 includes an imaging unit 11 and an ST sound pickup unit 12. The image pickup unit 11 as an image acquisition unit has an optical system 11a and an image pickup device (not shown). The optical system 11a includes a lens, a diaphragm, and the like (not shown) for zooming and focusing. The optical system 11a includes a zoom (variable magnification) mechanism, a focus and a diaphragm mechanism (not shown) that drives these lenses. The image pickup device is composed of a CCD, a CMOS sensor, or the like, and the optical system 11a guides the subject optical image to the image pickup surface of the image pickup device. The image pickup device photoelectrically converts the subject optical image to obtain a picked-up image (image pickup signal) of the subject.

制御部１０に構成された撮影制御部１０ａは、光学系１１ａのズーム機構、ピント機構及び絞り機構を駆動制御して、ズーム、絞り及びピントを調節することができるようになっている。ピント及び画角情報部１０ｃは、光学系１１ａからズーム、絞り及びピントに関する情報を取得して撮影制御部１０ａに出力するようになっている。このフィードバックによって、撮影制御部１０ａはズーム、絞り及びピントを所望の設定値に設定することができるようになっている。撮像部１１は、撮影制御部１０ａに制御されて撮像を行い、撮像画像（動画像及び静止画像）の撮像信号を制御部１０に出力する。 The imaging control unit 10a configured in the control unit 10 is configured to drive and control the zoom mechanism, the focus mechanism, and the aperture mechanism of the optical system 11a to adjust the zoom, aperture, and focus. The focus/angle-of-view information unit 10c is configured to acquire information regarding zoom, aperture, and focus from the optical system 11a and output the information to the photographing control unit 10a. By this feedback, the photographing control unit 10a can set the zoom, aperture, and focus to desired setting values. The imaging unit 11 is controlled by the imaging control unit 10a to perform imaging, and outputs an imaging signal of a captured image (moving image and still image) to the control unit 10.

制御部１０には収音制御及び処理部１０ｄが構成されており、収音制御及び処理部１０ｄは、ＳＴ収音部１２を制御する。ＳＴ収音部１２は、ステレオマイクロホン等により構成されており、収音制御及び処理部１０ｄに制御されて、カメラ１の周囲の音声を収音して音声信号を取得し、取得した音声（以下、第１音声ともいう）を入力部としても機能する収音制御及び処理部１０ｄに出力することができるようになっている。ＳＴ収音部１２は、カメラ１の撮影方向、即ち、光学系１１ａの光軸方向に感度のピークを有していた方がよい。なお、ＳＴ収音部１２に代えて、モノラルマイクロホンを採用してもよい。 The control section 10 includes a sound collection control/processing section 10d, and the sound collection control/processing section 10d controls the ST sound collection section 12. The ST sound collection unit 12 is configured by a stereo microphone or the like, and is controlled by the sound collection control and processing unit 10d to collect the sound around the camera 1 to acquire a sound signal, and acquire the acquired sound (hereinafter , Also referred to as the first voice) can be output to the sound collection control and processing unit 10d that also functions as an input unit. It is preferable that the ST sound collecting unit 12 has a peak of sensitivity in the shooting direction of the camera 1, that is, in the optical axis direction of the optical system 11a. A monaural microphone may be adopted instead of the ST sound collecting unit 12.

カメラ１には操作部１５が設けられている。操作部１５は、レリーズボタン、ファンクションボタン、撮影モード設定、パラメータ操作等の各種スイッチ、ダイヤル、リング部材等（図示省略）を含み、ユーザ操作に基づく操作信号を制御部１０に出力する。制御部１０は、操作部１５からの操作信号に基づいて、各部を制御するようになっている。 The camera 1 is provided with an operation unit 15. The operation unit 15 includes a release button, a function button, various switches for shooting mode setting, parameter operation, etc., a dial, a ring member, etc. (not shown), and outputs an operation signal based on a user operation to the control unit 10. The control unit 10 controls each unit based on an operation signal from the operation unit 15.

制御部１０は、撮像部１１からの撮像画像（動画像及び静止画像）を取込む。制御部１０の画像処理部１０ｂは、取込んだ撮像画像に対して、所定の信号処理、例えば、色調整処理、マトリックス変換処理、ノイズ除去処理、その他各種の信号処理を行う。また、収音制御及び処理部１０ｄは、ＳＴ収音部１２からの音声を取り込み、取り込んだ音声に対して所定の信号処理を施すことができる。 The control unit 10 captures the captured image (moving image and still image) from the image capturing unit 11. The image processing unit 10b of the control unit 10 performs predetermined signal processing on the captured image, such as color adjustment processing, matrix conversion processing, noise removal processing, and other various signal processing. Further, the sound collection control and processing unit 10d can take in the sound from the ST sound collecting unit 12 and perform a predetermined signal processing on the taken-in sound.

カメラ１には表示部１６が設けられており、表示部１６は、例えば、ＬＣＤ（液晶表示装置）等により構成された表示画面を有している。この表示画面は例えばカメラ１の筐体背面等に設けられる。制御部１０は、画像処理部１０ｂによって信号処理された撮像画像を表示部１６に表示させるようになっている。また、制御部１０は、カメラ１の各種メニュー表示や警告表示等を表示部１６に表示させることもできるようになっている。 A display unit 16 is provided in the camera 1, and the display unit 16 has a display screen configured by, for example, an LCD (liquid crystal display device) or the like. This display screen is provided, for example, on the back surface of the housing of the camera 1. The control unit 10 causes the display unit 16 to display the captured image signal-processed by the image processing unit 10b. The control unit 10 can also display various menu displays, warning displays, etc. of the camera 1 on the display unit 16.

カメラ１には通信部１８ａ，１８ｂが設けられている。通信部１８ａ，１８ｂは、所定の通信規格に対応した通信デバイスであり、制御部１０に制御されて、レコーダ２との間で情報を送受することができるようになっている。通信部１８ａは、例えば、ブルートゥース（登録商標）等の近距離無線による通信が可能であり、通信部１８ｂは、例えば、Ｗｉ−Ｆｉ（登録商標）等の無線ＬＡＮによる通信が可能である。なお、通信部１８ａ，１８ｂは、ブルートゥースやＷｉ−Ｆｉに限らず、各種通信方式での通信を採用することが可能である。制御部１０は、通信部１８ａ又は１８ｂを介して、記録部１４に記録されている情報を再生装置３に送信することができる。 The camera 1 is provided with communication units 18a and 18b. The communication units 18a and 18b are communication devices that comply with a predetermined communication standard, and are controlled by the control unit 10 so as to be able to send and receive information to and from the recorder 2. The communication unit 18a can perform short-range wireless communication such as Bluetooth (registered trademark), and the communication unit 18b can perform wireless LAN communication such as Wi-Fi (registered trademark). Note that the communication units 18a and 18b are not limited to Bluetooth and Wi-Fi, but can employ communication in various communication systems. The control unit 10 can transmit the information recorded in the recording unit 14 to the reproduction device 3 via the communication unit 18a or 18b.

カメラ１には、記録部１４が設けられている。記録部１４は、ハードディスクやメモリ媒体等の所定の記録媒体によって構成されて、制御部１０から与えられた情報を記録すると共に、記録されている情報を制御部１０に出力することができる。記録部１４としては、例えばカードインターフェースを採用することができ、記録部１４はメモリカード等の記録媒体に画像データを記録可能である。 The camera 1 is provided with a recording unit 14. The recording unit 14 is configured by a predetermined recording medium such as a hard disk or a memory medium, and can record the information given from the control unit 10 and output the recorded information to the control unit 10. A card interface, for example, can be used as the recording unit 14, and the recording unit 14 can record image data on a recording medium such as a memory card.

制御部１０は、信号処理後の撮像画像を圧縮処理し、圧縮後の画像を記録部１４に与えて記録させることができる。また、制御部１０は、収音制御及び処理部１０ｄによる信号処理後の音声を圧縮処理し、圧縮後の音声又は未圧縮の音声を記録部１４に与えて記録させることができる。カメラ１には時計部１９が設けられており、制御部１０は、時計部１９からの時間情報を用いて、ＳＴ収音部１２によって取得した音声に時間情報を付加すると共に、時間情報を付した画像に関連付けて記録部１４に記録することができる。 The control unit 10 can compress the picked-up image after the signal processing and give the compressed image to the recording unit 14 to record the image. Further, the control unit 10 can compress the sound after the signal processing by the sound collection control and processing unit 10d, and give the compressed sound or the uncompressed sound to the recording unit 14 to record the sound. The camera 1 is provided with a clock unit 19, and the control unit 10 uses the time information from the clock unit 19 to add the time information to the sound acquired by the ST sound collecting unit 12 and attach the time information. It can be recorded in the recording unit 14 in association with the selected image.

また、カメラ１には画像特徴抽出部１３が設けられている。画像特徴抽出部１３は、ＣＰＵ等を用いたプロセッサにより構成されていてもよく、撮像部１１によって撮像された被写体の撮像画像に対する画像解析を行って、その画像特徴を抽出して抽出結果を制御部１０に出力するようになっている。また、カメラ１には、位置及び方位センサ部１７が設けられている。位置及び方位センサ部１７は、位置センサ、ジャイロセンサ、磁気センサ等により構成されており、カメラ１の位置及び方位を検出して検出結果を制御部１０に出力するようになっている。 The camera 1 is also provided with an image feature extraction unit 13. The image feature extraction unit 13 may be configured by a processor using a CPU or the like, performs image analysis on a captured image of a subject captured by the imaging unit 11, extracts the image feature, and controls the extraction result. The data is output to the section 10. Further, the camera 1 is provided with a position/direction sensor unit 17. The position/direction sensor unit 17 is composed of a position sensor, a gyro sensor, a magnetic sensor, and the like, and detects the position and direction of the camera 1 and outputs the detection result to the control unit 10.

本実施の形態においては、記録部１４は、連携情報部１４ａを有している。連携情報部１４ａには、レコーダ２及び再生装置３との間の通信に関する情報が記録されており、制御部１０は、連携情報部１４ａから読み出した情報に基づいて通信部１８ａ，１８ｂを制御することで、レコーダ２及び再生装置３との間で通信により情報の授受が可能である。制御部１０は、ＳＴ収音部１２が収音して得た第１音声及び撮像部１１により撮像して得た画像を再生装置３に送信することができるようになっている。 In the present embodiment, the recording unit 14 has a cooperation information unit 14a. Information related to communication between the recorder 2 and the reproduction device 3 is recorded in the cooperation information unit 14a, and the control unit 10 controls the communication units 18a and 18b based on the information read from the cooperation information unit 14a. As a result, information can be exchanged between the recorder 2 and the playback device 3 by communication. The control unit 10 can transmit the first sound obtained by the ST sound collection unit 12 and the image obtained by the image pickup unit 11 to the reproduction device 3.

本実施の形態においては、制御部１０には距離及び到達時間差判定部１０ｅが構成されている。距離及び到達時間差判定部１０ｅは、各種測距方法を採用して、カメラ１からレコーダ２までの距離を求め、求めた距離と音速との演算によって、到達時間差を判定する。なお、制御部１０は、例えば、レコーダ２を撮像することで像面位相差法による測距を実行してもよく、撮像画像からコントラストを判定する山登り方式のフォーカス処理を利用した測距を行ってもよい。また、制御部１０は、位置及び方位センサ部１７の検出結果を用いてレコーダ２までの距離を求めてもよい。 In the present embodiment, the control unit 10 includes a distance and arrival time difference determination unit 10e. The distance and arrival time difference determination unit 10e employs various distance measuring methods to obtain the distance from the camera 1 to the recorder 2, and determines the arrival time difference by calculating the obtained distance and the sound velocity. Note that the control unit 10 may perform distance measurement by the image plane phase difference method by imaging the recorder 2, for example, and performs distance measurement using a hill-climbing focus process that determines contrast from a captured image. May be. Further, the control unit 10 may obtain the distance to the recorder 2 using the detection result of the position and orientation sensor unit 17.

制御部１０は、収音対象を、ピント及び画角情報部１０ｃによって合焦状態の被写体に設定してもよく、また、撮像画像の中央に位置する被写体に設定してもよく、また、ユーザ操作によって指定された被写体に設定してもよい。収音対象、レコーダ２及びカメラ１が略直線上に位置しない場合には、制御部１０は、カメラ１の位置とレコーダ２及び収音対象までの距離とその方向とを求めて、収音対象からカメラ１までの距離、収音対象からレコーダ２までの距離及びカメラ１からレコーダ２までの距離を求めてもよい。 The control unit 10 may set the sound collection target to a subject in focus by the focus and angle-of-view information unit 10c, or may set a subject located in the center of the captured image. You may set to the to-be-photographed object designated by operation. When the sound collection target, the recorder 2 and the camera 1 are not located on a substantially straight line, the control unit 10 obtains the position of the camera 1, the distance to the recorder 2 and the sound collection target and the direction thereof, and collects the sound collection target. To the camera 1, the distance from the sound collection target to the recorder 2, and the distance from the camera 1 to the recorder 2 may be obtained.

制御部１０は、ＳＴ収音部１２からの音声に、距離及び到達時間差判定部１０ｅが求めた到達時間差の情報（到達時間差情報）を関連付けて記録部１４に記録するようになっている。 The control unit 10 records the sound from the ST sound collecting unit 12 in the recording unit 14 in association with the information on the arrival time difference (arrival time difference information) obtained by the distance and arrival time difference determination unit 10e.

また、制御部１０には、発生タイミング推測部１０ｆが設けられている。発生タイミング推測部１０ｆは、撮像部１１からの撮像画像及び画像特徴抽出部１３から画像特徴の情報が与えられ、画像解析によって、収音対象が発する音の発生タイミングを推測する。例えば、収音対象が鳥であり鳥の鳴き声を対象音声として収音する場合には、発生タイミング推測部１０ｆは、鳥のくちばしが開閉するタイミング及び開閉の度合いに応じて鳥が鳴き声を発するタイミングを推測して推測結果を出力する。 Further, the control unit 10 is provided with an occurrence timing estimation unit 10f. The generation timing estimation unit 10f receives the captured image from the imaging unit 11 and the image feature information from the image feature extraction unit 13, and estimates the generation timing of the sound emitted by the sound collection target by image analysis. For example, when the sound collection target is a bird and the sound of the bird is collected as the target sound, the generation timing estimation unit 10f causes the timing at which the bird's beak opens and closes and the timing at which the bird calls out according to the degree of opening and closing. Guess and output the guess result.

制御部１０は、ＳＴ収音部１２からの音声に、発生タイミング推測部１０ｆが求めた音の発生タイミングの推測結果の情報（発生タイミングの時間情報）を関連付けて記録部１４に記録するようになっている。 The control unit 10 associates the sound from the ST sound collecting unit 12 with the information of the estimation result of the sound generation timing obtained by the generation timing estimation unit 10f (time information of the generation timing) and records it in the recording unit 14. Is becoming

レコーダ２には、制御部２０が設けられている。制御部２０は、ＣＰＵ等を用いたプロセッサによって構成されていてもよく、図示しないメモリに記憶されたプログラムに従って動作して各部を制御するものであってもよいし、ハードウェアの電子回路で機能の一部又は全部を実現するものであってもよい。レコーダ２は、通信部２１ａ，２１ｂを有している。通信部２１ａ，２１ｂは、所定の通信規格に対応した通信デバイスであり、制御部２０に制御されて、カメラ１及び再生装置３との間で情報を送受することができるようになっている。通信部２１ａは、例えば、ブルートゥース（登録商標）等の近距離無線による通信が可能であり、通信部２１ｂは、例えば、Ｗｉ−Ｆｉ（登録商標）等の無線ＬＡＮによる通信が可能である。なお、通信部２１ａ，２１ｂは、ブルートゥースやＷｉ−Ｆｉに限らず、各種通信方式での通信を採用することが可能である。 The recorder 2 is provided with a controller 20. The control unit 20 may be configured by a processor using a CPU or the like, may be a unit that operates according to a program stored in a memory (not shown) to control each unit, or may function as an electronic circuit of hardware. May be realized partially or entirely. The recorder 2 has communication units 21a and 21b. The communication units 21a and 21b are communication devices that comply with a predetermined communication standard, and are controlled by the control unit 20 so that information can be transmitted and received between the camera 1 and the playback device 3. The communication unit 21a is capable of short-range wireless communication such as Bluetooth (registered trademark), and the communication unit 21b is capable of wireless LAN communication such as Wi-Fi (registered trademark). Note that the communication units 21a and 21b are not limited to Bluetooth and Wi-Fi, but can employ communication in various communication systems.

レコーダ２にはＳＴ収音部２２が設けられており、ＳＴ収音部２２は、例えば図示しないステレオマイクロホンにより構成することができる。なお、ＳＴ収音部２２に代えて、モノラルマイクロホンを採用してもよい。ＳＴ収音部２２は、ステレオマイクロホンによって音声（以下、第２音声ともいう）を取得するようになっている。制御部２０には、収音制御部２０ａが構成されており、収音制御部２０ａは、ＳＴ収音部２２の収音を制御する。入力部として機能する収音制御部２０ａは、ＳＴ収音部２２からの第２音声を取り込むようになっている。 The recorder 2 is provided with the ST sound collecting unit 22, and the ST sound collecting unit 22 can be configured by, for example, a stereo microphone (not shown). A monaural microphone may be adopted instead of the ST sound collecting unit 22. The ST sound collecting unit 22 is adapted to acquire a voice (hereinafter, also referred to as a second voice) with a stereo microphone. The control unit 20 is configured with a sound collection control unit 20a, and the sound collection control unit 20a controls the sound collection of the ST sound collection unit 22. The sound collection control unit 20a that functions as an input unit is configured to take in the second sound from the ST sound collection unit 22.

レコーダ２には操作部２３が設けられている。操作部２３は、録音モード設定、パラメータ操作等のための図示しない各種スイッチ、ダイヤル、リング部材等を含み、ユーザ操作に基づく操作信号を制御部２０に出力する。制御部２０は、操作部２３からの操作信号に基づいて、各部を制御するようになっている。また、制御部２０は、通信部２１ａ，２１ｂを介してカメラ１の制御部１０から制御情報が与えられた場合には、この制御情報に基づいて各部を制御するようになっていてもよい。この場合には、カメラ１の制御部１０によって、レコーダ２における録音制御が可能である。 The recorder 2 is provided with an operation unit 23. The operation unit 23 includes various switches, dials, ring members and the like (not shown) for recording mode setting, parameter operation, etc., and outputs an operation signal based on a user operation to the control unit 20. The control unit 20 controls each unit based on an operation signal from the operation unit 23. Further, when the control information is given from the control unit 10 of the camera 1 via the communication units 21a and 21b, the control unit 20 may control each unit based on this control information. In this case, the recording section of the recorder 2 can be controlled by the control section 10 of the camera 1.

レコーダ２には、記録部２５が設けられている。記録部２５は、ハードディスクやメモリ媒体等の所定の記録媒体によって構成されて、制御部２０から与えられた情報を記録すると共に、記録されている情報を制御部２０に出力することができる。記録部２５としては、例えばカードインターフェースを採用することができ、記録部２５はメモリカード等の記録媒体に音声データを記録可能である。制御部２０は、信号処理後の第２音声を音声記録部２５に与えて記録させることができる。レコーダ２には時計部２４が設けられており、制御部２０は、時計部２４からの時間情報を用いて、ＳＴ収音部２２によって取得した第２音声に時間情報を付加して記録部２５に記録することができる。 The recorder 2 is provided with a recording unit 25. The recording unit 25 is configured by a predetermined recording medium such as a hard disk or a memory medium, and can record the information given from the control unit 20 and output the recorded information to the control unit 20. A card interface, for example, can be used as the recording unit 25, and the recording unit 25 can record audio data on a recording medium such as a memory card. The control unit 20 can give the second sound after the signal processing to the sound recording unit 25 to record the second sound. The recorder 2 is provided with a clock unit 24, and the control unit 20 uses the time information from the clock unit 24 to add the time information to the second sound acquired by the ST sound collecting unit 22, and the recording unit 25. Can be recorded in.

また、記録部２５は、連携情報部２５ａを有している。連携情報部２５ａには、カメラ１及び再生装置３との間の通信に関する情報が記録されており、制御部２０は、連携情報部２５ａから読み出した情報に基づいて通信部２１ａ，２１ｂを制御することで、カメラ１及び再生装置３との間で通信により情報の授受が可能である。制御部２０は、ＳＴ収音部２２が収音して得た第２音声を再生装置３に送信することができるようになっている。 The recording unit 25 also has a cooperation information unit 25a. Information related to communication between the camera 1 and the playback device 3 is recorded in the cooperation information unit 25a, and the control unit 20 controls the communication units 21a and 21b based on the information read from the cooperation information unit 25a. As a result, information can be exchanged between the camera 1 and the reproduction device 3 by communication. The control unit 20 can transmit the second sound acquired by the ST sound collecting unit 22 to the reproducing device 3.

再生装置３は、コンピュータや、スマートフォンやタブレット端末等によって構成されていてもよい。再生装置３には、制御部３０が構成されている。音声強調部として機能する制御部３０は、ＣＰＵやＦＰＧＡ等を用いたプロセッサによって構成されていてもよく、図示しないメモリに記憶されたプログラムに従って動作して各部を制御するものであってもよいし、ハードウェアの電子回路で機能の一部又は全部を実現するものであってもよい。 The playback device 3 may be configured by a computer, a smartphone, a tablet terminal, or the like. The playback device 3 includes a control unit 30. The control unit 30 that functions as a voice emphasizing unit may be configured by a processor that uses a CPU, an FPGA, or the like, and may operate according to a program stored in a memory (not shown) to control each unit. Alternatively, some or all of the functions may be realized by an electronic circuit of hardware.

再生装置３には、操作部３２が設けられている。操作部３２は、再生モード設定、パラメータ操作等のための図示しない各種スイッチ、ダイヤル、リング部材等を含み、ユーザ操作に基づく操作信号を制御部３０に出力する。制御部３０は、操作部３２からの操作信号に基づいて、各部を制御するようになっている。通信部３１は、所定の通信規格に対応した通信デバイスであり、制御部３０に制御されて、カメラ１及びレコーダ２との間で通信を行って情報を授受することができるようになっている。再生装置３には、記録部３４が設けられている。記録部３４は、ハードディスクやメモリ媒体等の所定の記録媒体により構成されており、制御部３０から与えられた情報を記録するようになっている。 The playback device 3 is provided with an operation unit 32. The operation unit 32 includes various switches, dials, ring members and the like (not shown) for setting a reproduction mode and operating parameters, and outputs an operation signal based on a user operation to the control unit 30. The control unit 30 controls each unit based on an operation signal from the operation unit 32. The communication unit 31 is a communication device compatible with a predetermined communication standard, and is controlled by the control unit 30 to communicate with the camera 1 and the recorder 2 to exchange information. .. The reproducing device 3 is provided with a recording unit 34. The recording unit 34 is configured by a predetermined recording medium such as a hard disk or a memory medium, and records the information given from the control unit 30.

制御部３０は、通信部３１を介して、カメラ１からの画像及び第１音声を受信すると共に、レコーダ２からの第２音声を受信する。なお、上述したように、画像、第１音声及び第２音声には時間情報が付加されている。また、第１音声に対応付けられた到達時間差情報及び音の発生タイミングの推測結果の情報も通信部３１によって受信されるようになっている。第１及び第２音声に付加されたこれらの情報を連携情報というものとする。制御部３０は、カメラ１及びレコーダ２から受信した情報を記録部３４に与えて記録するようになっている。 The control unit 30 receives the image and the first sound from the camera 1 and the second sound from the recorder 2 via the communication unit 31. As described above, the time information is added to the image, the first sound, and the second sound. The communication unit 31 also receives the arrival time difference information associated with the first voice and the information about the estimation result of the sound generation timing. These pieces of information added to the first and second voices are called cooperation information. The control unit 30 supplies the information received from the camera 1 and the recorder 2 to the recording unit 34 to record the information.

制御部３０には、時間軸シフト加算処理部３０ａが設けられている。時間軸シフト加算処理部３０ａは、記録部３４に記録された情報を読み出し、受信された第１音声及び第２音声の少なくとも一方に対して時間軸シフト処理を施すことで、対象音声が収音対象からレコーダ２に到達する時間とカメラ１に到達するまでの到達時間差を相殺し、対象音声については位相を一致させるようになっている。時間軸シフト加算処理部３０ａは時間軸シフト処理後の第１及び第２音声を合成して合成音を得る時間軸シフト加算処理を行う。 The control unit 30 is provided with a time axis shift addition processing unit 30a. The time-axis shift addition processing unit 30a reads out the information recorded in the recording unit 34 and performs time-axis shift processing on at least one of the received first voice and second voice, so that the target voice is collected. The difference between the arrival time from the target to the recorder 2 and the arrival time from the arrival at the camera 1 is canceled, and the phase of the target voice is matched. The time-axis shift addition processing unit 30a performs time-axis shift addition processing for synthesizing the first and second voices after the time-axis shift processing to obtain a synthetic sound.

時間軸シフト加算処理部３０ａは、時間軸シフト処理における音声の遅延時間を、画像に基づく音の発生タイミングの推測結果に応じて調整してもよい。カメラ１によって取得された画像と音声は同期して記録されており、画像に基づく音の発生タイミングの推測結果と、第１音声の波形の立ち上がりタイミングとは一致していると考えられる。従って、時間軸シフト加算処理部３０ａは、音の発生タイミングの推測結果に応じて、音声の遅延時間を調整することで、到達時間差を確実に相殺した時間軸シフト処理が可能となる。 The time axis shift addition processing unit 30a may adjust the sound delay time in the time axis shift processing according to the estimation result of the sound generation timing based on the image. The image and the sound acquired by the camera 1 are recorded in synchronization with each other, and it is considered that the estimation result of the sound generation timing based on the image matches the rising timing of the waveform of the first sound. Therefore, the time-axis shift addition processing unit 30a can perform the time-axis shift processing that reliably cancels the arrival time difference by adjusting the audio delay time according to the estimation result of the sound generation timing.

制御部３０には、対象音声抽出部３０ｂが設けられている。対象音声抽出部３０ｂは、時間軸シフト加算処理により得られた合成音の周波数成分のうち所定周期で強めあう周波数成分（対象音声の成分）を検出して、当該周波数成分を抽出するフィルタ処理を行う。対象音声抽出部３０ｂは、フィルタ処理後の音声（合成音）を再生部３３及び記録部３４に出力する。 The control unit 30 is provided with a target voice extraction unit 30b. The target speech extraction unit 30b detects a frequency component (a target speech component) that reinforces each other in a predetermined cycle among the frequency components of the synthetic sound obtained by the time-axis shift addition processing, and performs filter processing for extracting the frequency component. To do. The target speech extraction unit 30b outputs the filtered speech (synthesized sound) to the reproduction unit 33 and the recording unit 34.

また、対象音声抽出部３０ｂは、時間軸シフト加算処理により所定周期で強めあう成分のピーク期間を含む所定期間のみの合成音を抽出するフィルタ処理を行って、このフィルタ処理後の音声（合成音）を再生部３３及び記録部３４に出力するようになっていてもよい。 Further, the target voice extraction unit 30b performs a filter process of extracting a synthesized sound only for a predetermined period including the peak period of the components that strengthen each other in a predetermined cycle by the time-axis shift addition process, and the filtered voice (synthesized voice ) May be output to the reproducing unit 33 and the recording unit 34.

記録部３４は、対象音声抽出部３０ｂからの音声が与えられて、記録するようになっている。再生部３３は、図示しないスピーカを備えており、制御部３０に制御されて、対象音声抽出部３０ｂからの音声を再生出力するようになっている。 The recording unit 34 receives the voice from the target voice extracting unit 30b and records the voice. The reproduction unit 33 includes a speaker (not shown), and is controlled by the control unit 30 to reproduce and output the sound from the target sound extraction unit 30b.

なお、再生装置３は、対象音声抽出部３０ｂを省略し、時間軸シフト加算処理部３０ａによる時間軸シフト加算処理後の合成音を再生部３３及び記録部３４に出力するようになっていてもよい。 It should be noted that the reproduction device 3 may omit the target voice extraction unit 30b and output the synthesized sound after the time-axis shift addition processing by the time-axis shift addition processing unit 30a to the reproduction unit 33 and the recording unit 34. Good.

次に、このように構成された実施の形態の動作について図５から図９を参照して説明する。図５はカメラ１の動作を説明するためのフローチャートであり、図６はレコーダ２の動作を説明するためのフローチャートであり、図７は再生装置３の動作を説明するためのフローチャートである。また、図８はカメラ１の撮影時の状態を説明するための説明図であり、図９は再生装置３の再生時の状態を示す説明図である。 Next, the operation of the embodiment configured as described above will be described with reference to FIGS. 5 is a flow chart for explaining the operation of the camera 1, FIG. 6 is a flow chart for explaining the operation of the recorder 2, and FIG. 7 is a flow chart for explaining the operation of the reproducing apparatus 3. Further, FIG. 8 is an explanatory diagram for explaining a state of the camera 1 at the time of shooting, and FIG. 9 is an explanatory diagram showing a state of the reproducing device 3 during reproduction.

（録音）
先ず、収音及び録音の動作を説明する。図２の例に示すように、鳥４１が樹木４３ａの枝に留まっており、樹木４３ａの比較的近くにレコーダ２が設置されているものとする。また、樹木４３ａから比較的離れた位置において、ユーザがカメラ１を携帯して、鳥４１の撮影を行う。カメラ１のＳＴ収音部１２及びレコーダ２のＳＴ収音部２２のいずれにも、収音の対象である鳥４１の鳴き声（対象音声）とそれ以外の周囲の雑音（草４２から発する音）とが収音される。 (recording)
First, the operation of collecting and recording sound will be described. As shown in the example of FIG. 2, it is assumed that the bird 41 remains on the branch of the tree 43a and the recorder 2 is installed relatively close to the tree 43a. In addition, the user carries the camera 1 and photographs the bird 41 at a position relatively distant from the tree 43a. In both the ST sound collecting unit 12 of the camera 1 and the ST sound collecting unit 22 of the recorder 2, the bark of the bird 41 that is the target of sound collection (target sound) and other ambient noise (sound from the grass 42) And are picked up.

カメラ１の筐体１ａの背面には、表示部１６の表示画面１６ａが設けられている。ユーザは、例えば、筐体１ａを手で把持して、表示画面１６ａを見ながら被写体である鳥４１を視野範囲に捉えた状態で、シャッタボタン１５ａを押下操作することで撮影を行う。 A display screen 16a of the display unit 16 is provided on the back surface of the housing 1a of the camera 1. The user, for example, grips the casing 1a with his/her hand and, while watching the display screen 16a, captures the bird 41, which is the subject, in the visual field range, and presses the shutter button 15a to perform photographing.

カメラ１の制御部１０は、電源が投入されると、図５のステップＳ１において、撮像モードが指示されたか否かを判定する。撮像モードが指示されていない場合には、制御部１０は、指定されたモード、例えば、レコーダ２との連携のための設定や送受信を行う連携モードや記録画像の再生を行う再生モードに移行する。 When the power is turned on, the control unit 10 of the camera 1 determines whether the imaging mode is instructed in step S1 of FIG. When the imaging mode is not instructed, the control unit 10 shifts to a designated mode, for example, a cooperation mode for performing settings and transmission/reception for cooperation with the recorder 2 or a reproduction mode for reproducing recorded images. ..

例えば、連携モードでは、制御部１０は、カメラ１とレコーダ２との間の距離の測定を行う。これにより、制御部１０は、収音対象である鳥４１とレコーダ２との間の距離と鳥４１とカメラ１との間の距離との距離差を求める。例えば、カメラ１により鳥４１だけでなくレコーダ２も撮影可能な場合には、制御部１０は撮影時のピント合わせ操作において、測距が可能であり距離差を求めることができる。また、制御部１０は、位置及び方位センサ部１７からの位置及び撮影方位の情報を用いることで、距離差を求めることが可能である。また、レコーダ２が測位機能を有している場合には、制御部３０はレコーダ２から位置情報を取得することで、距離差を算出してもよい。 For example, in the cooperation mode, the control unit 10 measures the distance between the camera 1 and the recorder 2. As a result, the control unit 10 obtains the distance difference between the distance between the bird 41, which is the sound collection target, and the recorder 2 and the distance between the bird 41 and the camera 1. For example, when not only the bird 41 but also the recorder 2 can be photographed by the camera 1, the control unit 10 can measure the distance and can obtain the distance difference in the focusing operation at the time of photographing. Further, the control unit 10 can obtain the distance difference by using the information on the position and the shooting direction from the position and direction sensor unit 17. If the recorder 2 has a positioning function, the control unit 30 may calculate the distance difference by acquiring the position information from the recorder 2.

制御部１０は、表示部１６に距離差の情報を表示するようにしてもよい。例えば、制御部１０は、鳥４１とレコーダ２とカメラ１とが略直線状に配置されているものとして、カメラ１からレコーダ２までの距離を、第１音声と第２音声との伝達距離の距離差として表示してもよい。図８はカメラ１からレコーダ２までの距離が８ｍであることを示す表示４１ｂが表示画面１６ａ上に表示されていることを示している。 The control unit 10 may display information on the distance difference on the display unit 16. For example, assuming that the bird 41, the recorder 2 and the camera 1 are arranged substantially linearly, the control unit 10 determines the distance from the camera 1 to the recorder 2 as the transmission distance between the first voice and the second voice. You may display as a distance difference. FIG. 8 shows that the display 41b indicating that the distance from the camera 1 to the recorder 2 is 8 m is displayed on the display screen 16a.

撮像モードが指示されると、制御部１０は、次のステップＳ２において動画記録を開始し、ステップＳ３においてＳＴ収音部１２による収音を開始する。なお、動画及び音声の取得時においては、時計部１９は計時を開始し、制御部１０は時計部１９からの時間情報も同時に記録する。 When the imaging mode is instructed, the control unit 10 starts moving image recording in the next step S2, and starts sound collection by the ST sound collection unit 12 in step S3. When the moving image and the sound are acquired, the clock unit 19 starts clocking, and the control unit 10 records the time information from the clock unit 19 at the same time.

制御部１０は、次のステップＳ４において、レコーダ２との連携が指定されているか否かを判定する。制御部１０は、連携が指定されていない場合には、処理をステップＳ８に移行して、記録終了操作が行われたか否かを判定する。また、制御部１０は、連携が指定されている場合には、ステップＳ５において、レコーダ２と連携して収音するために、レコーダ２に連携を依頼する情報を送信して、ステップＳ６に進む。 In the next step S4, the control unit 10 determines whether or not the cooperation with the recorder 2 is designated. When the cooperation is not designated, the control unit 10 moves the process to step S8 and determines whether the recording end operation is performed. If cooperation is designated, the control unit 10 transmits information requesting cooperation to the recorder 2 to collect sound in cooperation with the recorder 2 in step S5, and proceeds to step S6. ..

レコーダ２の制御部２０は、電源が投入されると、図６のステップＳ２１において、録音モードが指定されているか否かを判定する。録音モードが指定されていない場合には、制御部２０は、指定されたモード、例えば、レコーダ１との連携のための設定や送受信を行う連携モードや記録音声の再生を行う再生モードに移行する。なお、制御部２０は、カメラ１から記録開始を示す情報が送信されることで、録音が指示されたものと判定するようになっていてもよい。 When the power is turned on, the control unit 20 of the recorder 2 determines whether or not the recording mode is designated in step S21 of FIG. When the recording mode is not designated, the control unit 20 shifts to a designated mode, for example, a collaborative mode for performing setting and transmission/reception for cooperation with the recorder 1 or a reproduction mode for reproducing recorded sound. .. Note that the control unit 20 may determine that the recording is instructed by transmitting the information indicating the recording start from the camera 1.

録音モードが指定されると、制御部２０は、次のステップＳ２２において、音声の記録を開始する。即ち、制御部２０は、ＳＴ収音部２２に収音を開始させ、収音された音声を第２音声として記録部２５に与えて記録を開始する。なお、制御部２０は、音声の取得時には、時計部２４に計時を開始させ、時計部２４からの時間情報も同時に記録する。 When the recording mode is designated, the control unit 20 starts recording the voice in the next step S22. That is, the control unit 20 causes the ST sound collecting unit 22 to start collecting sound, gives the collected sound as the second sound to the recording unit 25, and starts recording. It should be noted that the control unit 20 causes the clock unit 24 to start timekeeping and records the time information from the clock unit 24 at the same time when the voice is acquired.

制御部２０は、次のステップＳ２３において、カメラ１との連携が指定されている否かを判定する。連携が指定されていない場合には、制御部２０は、処理をステップＳ２７に移行して録音の終了が指定された否かを判定する。 In the next step S23, the control unit 20 determines whether or not the cooperation with the camera 1 is designated. When the cooperation is not designated, the control unit 20 shifts the processing to step S27 and determines whether or not the end of recording is designated.

連携が指定されると、制御部２０は、ステップＳ２４において連携情報を第２音声に付加して記録部２５に記録する。また、制御部２０は、収音制御部２０ａを制御して、カメラ１の制御部１０から指定された感度設定に従った録音を行わせる。これにより、収音制御部２０ａは、例えば、録音レベルが適切となるようにマイク感度を自動設定したり、収音対象からレコーダ２までの距離が長い程感度を高くする等の感度設定を行う。 When the cooperation is designated, the control unit 20 adds the cooperation information to the second sound and records the second sound in the recording unit 25 in step S24. The control unit 20 also controls the sound collection control unit 20a to perform recording in accordance with the sensitivity setting designated by the control unit 10 of the camera 1. Thereby, the sound collection control unit 20a performs sensitivity setting such as automatically setting the microphone sensitivity so that the recording level becomes appropriate, or increasing the sensitivity as the distance from the sound collection target to the recorder 2 becomes longer. ..

次に、制御部２０は、カメラ１の制御部１０に対して連携応答の送信を行う。カメラ１の制御部１０は、ステップＳ６において、レコーダ２の制御部２０から連携応答を受信し、連携情報を第１音声に付加して記録部１４に記録する。なお、ステップＳ６，Ｓ２６においては、カメラ１の制御部１０とレコーダ２の制御部２０との間で互いに連携応答の送信及び受信が行われて、記録する音声の同期を確立させる。例えば、カメラ１の時間情報とレコーダ２の時間情報とを所定の時間基準に一致させる処理を行う。これにより、時間軸シフト処理によって到達時間差を確実に相殺することが可能となる。 Next, the control unit 20 transmits a cooperation response to the control unit 10 of the camera 1. In step S6, the control unit 10 of the camera 1 receives the cooperation response from the control unit 20 of the recorder 2, adds the cooperation information to the first voice, and records it in the recording unit 14. In steps S6 and S26, the control unit 10 of the camera 1 and the control unit 20 of the recorder 2 transmit and receive the cooperation response to each other to establish the synchronization of the voices to be recorded. For example, a process of matching the time information of the camera 1 and the time information of the recorder 2 with a predetermined time reference is performed. This makes it possible to reliably cancel the arrival time difference by the time axis shift processing.

なお、図５から図７の例は、カメラ１及びレコーダ２において相互に独立して録音を行い、記録された音声データを再生装置３に送信して再生を行う例を示しているが、例えば、レコーダ２において収音した第２音声をそのままカメラ１に送信する場合等においては、通信による遅延を無視すると、第１音声と第２音声との同期は確立しているので、カメラ１の時間情報とレコーダ２の時間情報とを所定の時間基準に一致させる処理は省略可能である。また、再生装置３において、第１音声と第２音声との波形の形状を利用した時間軸シフト処理を行うことで、カメラ１の時間情報とレコーダ２の時間情報とに多少の誤差があったとしても、到達時間差を確実に相殺することが可能である。 Note that the examples of FIGS. 5 to 7 show an example in which the camera 1 and the recorder 2 perform recording independently of each other, and the recorded audio data is transmitted to the reproducing device 3 to reproduce. , When transmitting the second sound picked up by the recorder 2 to the camera 1 as it is, neglecting the delay due to the communication, the synchronization between the first sound and the second sound has been established. The process of matching the information and the time information of the recorder 2 with a predetermined time reference can be omitted. Further, in the reproducing apparatus 3, there is some error between the time information of the camera 1 and the time information of the recorder 2 by performing the time axis shift processing using the waveform shapes of the first sound and the second sound. However, it is possible to reliably cancel the arrival time difference.

制御部１０の距離及び到達時間差判定部１０ｅは、ステップＳ６の次のステップＳ７において、収音対象までの距離及びマイク距離等の情報に基づいて、到達時間差を判定する。制御部１０は、到達時間差の情報を記録部１４に与えて第１音声に関連付けて記録する。これにより、カメラ１を移動させながら撮影を行ってカメラ１とレコーダ２との間の距離が変化する場合でも逐次到達時間差が求められるので、時間軸シフト処理によって到達時間差を確実に相殺することが可能となる。また、ステップＳ７においては、発生タイミング推測部１０ｆによって対象音声の発生タイミングの推測が行われ、推測結果が第１音声に関連付けて記録される。 The distance and arrival time difference determination unit 10e of the control unit 10 determines the arrival time difference based on information such as the distance to the sound pickup target and the microphone distance in step S7 subsequent to step S6. The control unit 10 gives the information of the arrival time difference to the recording unit 14 and records the information in association with the first voice. As a result, the arrival time difference can be calculated successively even when the distance between the camera 1 and the recorder 2 is changed by taking an image while moving the camera 1. Therefore, it is possible to reliably cancel the arrival time difference by the time axis shift processing. It will be possible. In step S7, the generation timing estimation unit 10f estimates the generation timing of the target voice, and the estimation result is recorded in association with the first voice.

制御部１０は、ステップＳ８において、録画、録音の終了操作が行われたか否かを判定する。終了操作が行われていない場合には、制御部１０は、処理をステップＳ１に戻し、終了操作が行われると、次のステップＳ９においてレコーダ連携の有無を判定する。制御部１０は、レコーダ２との連携が行われている場合には、ステップＳ９からステップＳ１０に移行して、レコーダ２に対して連携終了を示す送信を行った後処理をステップＳ１１に移行する。 In step S8, the control unit 10 determines whether or not a recording operation or a recording ending operation has been performed. If the ending operation has not been performed, the control unit 10 returns the process to step S1. When the ending operation is performed, the control unit 10 determines whether or not the recorder cooperation is performed in the next step S9. When the cooperation with the recorder 2 is being performed, the control unit 10 proceeds from step S9 to step S10, performs the transmission indicating the termination of the collaboration to the recorder 2, and proceeds the post-processing to step S11. ..

レコーダ２の制御部２０は、ステップＳ２６の次のステップＳ２７において、録音終了が指示されたか否かを判定する。制御部２０は、制御部１０から連携終了が通知された場合、又はユーザの操作によって録音の終了が指示された場合には、録音の終了が指示されたものと判定して、次のステップＳ２８に移行して第２音声のファイル化を行う。なお、第２音声には時間情報がメタデータとして付加されるか又は独立したファイルとして第２音声に関連付けられてファイル化される。なお、制御部２０は、録音の終了が指示されていないものと判定した場合には、処理をステップＳ２１に戻す。 The control unit 20 of the recorder 2 determines whether or not the end of recording is instructed in step S27 following step S26. When the controller 10 is notified of the termination of the cooperation or when the user's operation instructs the end of the recording, the control unit 20 determines that the end of the recording is instructed, and the next step S28. Then, the second voice is converted into a file. Note that time information is added to the second sound as metadata or is associated with the second sound and made into a file as an independent file. If the control unit 20 determines that the end of recording is not instructed, the process returns to step S21.

カメラ１の制御部１０は、ステップＳ９において、レコーダ２との連携が行われていないと判定した場合又はステップＳ９の終了後にステップＳ１１に移行して、記録されている画像及び第１音声をファイル化し、処理をステップＳ１に戻す。なお、この場合には、時間情報、到達時間差の情報及び発生タイミングの推測結果の情報も第１音声に関連付けられ、第１音声のメタデータとしてあるいは独立したファイルとしてファイル化される。 When the control unit 10 of the camera 1 determines in step S9 that the cooperation with the recorder 2 is not performed, or after the end of step S9, the control unit 10 moves to step S11 to save the recorded image and first audio file. And returns the process to step S1. In this case, the time information, the arrival time difference information, and the occurrence timing estimation result information are also associated with the first voice and are filed as metadata of the first voice or as an independent file.

（再生）
次に、カメラ１及びレコーダ２によって収音した第１音声及び第２音声を再生するものとする。例えば、再生装置３によってカメラ１により取得された動画像の再生を行うものとする。再生装置３の制御部３０は、ステップＳ３１において、連携動画再生モードが指定されているか否かを判定する。連携動画再生モードは、動画の再生に際して、当該動画に対応する第１音声の再生時に第２音声を利用して、第１音声及び第２音声により連携して音声を再生するモードである。 (Playback)
Next, it is assumed that the first voice and the second voice picked up by the camera 1 and the recorder 2 are reproduced. For example, it is assumed that the playback device 3 plays back a moving image acquired by the camera 1. In step S31, the control unit 30 of the playback device 3 determines whether or not the linked moving image playback mode is designated. The cooperative moving image reproduction mode is a mode in which, when reproducing a moving image, the second sound is used when the first sound corresponding to the moving image is reproduced, and the sound is reproduced in cooperation with the first sound and the second sound.

制御部３０は、連携動画再生モードが指定されていない場合には、通常の再生モード等の指定されたモードに移行する。連携動画再生モードが指定されている場合には、再生装置３の制御部３０は通信部３１を介してカメラ１からの画像及び第１音声を受信すると共に、レコーダ２からの第２音声を受信する。なお、画像、第１音声及び第２音声には時間情報が付加され、第１音声には到達時間差情報及び音の発生タイミングの推測結果の情報も付加されて、制御部３０にはこれらの連携情報も受信される（ステップＳ３１）。制御部３０は、受信した情報を記録部３４に与えて記録する。 When the linked moving image playback mode is not designated, the control unit 30 shifts to a designated mode such as a normal playback mode. When the linked moving image reproduction mode is designated, the control unit 30 of the reproduction device 3 receives the image and the first sound from the camera 1 via the communication unit 31, and receives the second sound from the recorder 2. To do. Time information is added to the image, the first sound, and the second sound, arrival time difference information and information on the estimation result of the sound generation timing are added to the first sound, and the control unit 30 links these information. Information is also received (step S31). The control unit 30 gives the received information to the recording unit 34 and records it.

制御部３０の時間軸シフト加算処理部３０ａは、記録部３４に記録された情報を読み出す。時間軸シフト加算処理部３０ａは、記録部３４に記録されている情報に基づいて、到達時間差（時間軸シフト量）を算出する（ステップＳ３３）。時間軸シフト加算処理部３０ａは、算出した時間軸シフト量を用いて、受信された第１音声及び第２音声の少なくとも一方に対して時間軸シフト処理を施すことで、対象音声が収音対象からレコーダ２に到達する時間とカメラ１に到達するまでの到達時間差を相殺し、対象音声については位相を一致させる。更に、時間軸シフト加算処理部３０ａは時間軸シフト処理後の第１及び第２音声を合成して合成音を得る（ステップＳ３４）。この合成音は、対象音声については第１音声と第２音声同士で位相が一致しており、相互に強めあう（合成音の振幅が大きくなる）ものとなる。時間軸シフト加算処理部３０ａは、時間情報を用いて合成音と動画像とを同期させる。 The time-axis shift addition processing unit 30a of the control unit 30 reads the information recorded in the recording unit 34. The time axis shift addition processing unit 30a calculates the arrival time difference (time axis shift amount) based on the information recorded in the recording unit 34 (step S33). The time-axis shift addition processing unit 30a performs time-axis shift processing on at least one of the received first voice and second voice using the calculated time-axis shift amount, so that the target voice is the target sound pickup target. From the arrival time to the recorder 2 and the arrival time difference to reach the camera 1 are canceled, and the phases of the target voices are matched. Further, the time-axis shift addition processing unit 30a synthesizes the first and second voices after the time-axis shift processing to obtain a synthetic sound (step S34). Regarding the target voice, the first voice and the second voice have the same phase with respect to the target voice, so that they are mutually strengthened (the amplitude of the synthesized voice increases). The time axis shift addition processing unit 30a synchronizes the synthetic sound and the moving image using the time information.

制御部３０は、時間軸シフト加算処理部３０ａにより合成された動画像を再生部３３に与えて再生してもよい。この場合には、対象音声が強調されて再生されることになり、ユーザは、再生音から鳥４１の鳴き声を明瞭に聞き取ることができる。 The control unit 30 may give the moving image combined by the time-axis shift addition processing unit 30a to the reproducing unit 33 to reproduce the moving image. In this case, the target sound is emphasized and reproduced, and the user can clearly hear the cry of the bird 41 from the reproduced sound.

なお、時間軸シフト加算処理部３０ａは、画像に基づく音の発生タイミングの推測結果に応じて時間軸シフト処理において音声を遅延させる時間を調整してもよい。カメラ１によって取得された画像と音声は同期して記録されており、音の発生タイミングの推測結果に応じて、音声の遅延時間を調整することで、到達時間差を確実に相殺した時間軸シフト処理が可能となる。 The time-axis shift addition processing unit 30a may adjust the time for delaying the sound in the time-axis shift processing according to the estimation result of the sound generation timing based on the image. The image and the sound acquired by the camera 1 are recorded in synchronization with each other, and by adjusting the delay time of the sound according to the estimation result of the sound generation timing, the time axis shift processing that reliably cancels the arrival time difference. Is possible.

図９は表示画面１６ａ上に表示された撮像画像を示している。図９の例は再生装置３をタブレット端末３ａによって構成した例を示しており、再生部３３を構成する表示画面３３ａ上には、カメラ１によって撮像された撮像画像を表示する領域３５が設けられており、領域３５には鳥４１の画像３５ａが表示されている。また、表示画面３３ａ上には、マイク距離表示３６によってマイクまでの距離が８ｍであること、収音対象距離表示３７によって、収音対象である鳥４１までの距離が１０ｍであることが示されている。 FIG. 9 shows a captured image displayed on the display screen 16a. The example of FIG. 9 shows an example in which the playback device 3 is configured by the tablet terminal 3a, and an area 35 for displaying a captured image captured by the camera 1 is provided on the display screen 33a that configures the playback unit 33. The image 35 a of the bird 41 is displayed in the area 35. Further, on the display screen 33a, the microphone distance display 36 indicates that the distance to the microphone is 8 m, and the sound collection target distance display 37 indicates that the distance to the bird 41 that is the sound collection target is 10 m. ing.

制御部３０は、時間軸シフト加算処理部３０ａからの合成音を再生部３３のスピーカに与えて、表示画面３３ａ上の画像に同期させて音声を再生出力してもよい。 The control unit 30 may give the synthesized sound from the time-axis shift addition processing unit 30a to the speaker of the reproduction unit 33 to reproduce and output the sound in synchronization with the image on the display screen 33a.

更に、制御部３０は、ステップＳ３４において、対象音声抽出部３０ｂにより、時間軸シフト加算処理部３０ａからの合成音から対象音声を抽出した後、撮像画像と同期させて再生部３３に出力してもよい。対象音声抽出部３０ｂは、例えば、時間軸シフト加算処理により得られた合成音の周波数成分のうち所定周期で強めあう周波数成分を検出して、当該周波数成分を抽出するフィルタ処理を行う。このフィルタ処理によって、対象音声のみが抽出されることになり、対象音声は更に一層強調される。これにより、ユーザは、再生部３３のスピーカから音響として出力される再生音により、鳥４１の鳴き声を極めて明瞭に聞き取ることが可能となる。 Further, in step S34, the control unit 30 causes the target voice extraction unit 30b to extract the target voice from the synthesized sound from the time-axis shift addition processing unit 30a, and then outputs the target voice to the reproduction unit 33 in synchronization with the captured image. Good. The target speech extraction unit 30b detects, for example, a frequency component that strengthens in a predetermined cycle among the frequency components of the synthetic sound obtained by the time-axis shift addition process, and performs a filtering process that extracts the frequency component. By this filtering process, only the target voice is extracted, and the target voice is further emphasized. As a result, the user can very clearly hear the bark of the bird 41 by the reproduced sound output from the speaker of the reproduction unit 33 as sound.

制御部３０は、時間軸シフト加算処理部３０ａからの合成音及び画像を関連付けて記録部３４に記録する。また、制御部３０は、対象音声抽出部３０ｂからの合成音及び画像を関連付けて記録部３４に記録してもよい。 The control unit 30 records the synthesized sound and the image from the time-axis shift addition processing unit 30a in the recording unit 34 in association with each other. Further, the control unit 30 may record the synthesized sound and the image from the target voice extraction unit 30b in the recording unit 34 in association with each other.

このように本実施の形態においては、収音対象から第１の距離だけ離れて配置され上記収音対象から発せられた音声を収音して第１音声を取得する第１のマイクロホンから上記第１音声を取り込み、上記収音対象から上記第１の距離と異なる第２の距離だけ離れて配置され上記収音対象から発せられた音声を収音して第２音声を取得する第２のマイクロホンから上記第２音声を取り込む入力部と、上記第１及び第２音声のうち上記第１の距離と第２の距離との距離差に基づく成分を強調する強調処理を行う音声強調部とを具備したことを特徴とする音声取得装置を得ることができる。 As described above, in the present embodiment, the first microphone is arranged apart from the sound collection target by the first distance and the sound emitted from the sound collection target is collected to acquire the first sound from the first microphone. A second microphone that captures one voice, is arranged at a second distance different from the first distance from the sound collection target, and collects the sound emitted from the sound collection target to obtain the second sound. And an audio enhancement unit that performs enhancement processing for enhancing the component based on the distance difference between the first distance and the second distance of the first and second voices. It is possible to obtain a voice acquisition device characterized by the above.

即ち、本実施の形態においては、対象音声を第１及び第２のマイクロホンにより収音すると共に、収音対象から第１のマイクロホンまでの距離と第２のマイクロホンまでの距離との距離差に基づいて到達時間差を求め、この到達時間差に応じたシフト量で収音した第１音声及び第２音声の少なくとも一方を時間軸シフトさせて合成することにより、対象音声を強調することを可能にしている。これにより、収音対象以外の位置から発せられる雑音に対して対象音声を強調することができ、収音対象を明瞭に高音質で聞き取ることが可能となる。更に、時間軸シフト加算処理後の合成音の周波数成分のうち所定周期で強めあう周波数成分を検出して、当該周波数成分を抽出するフィルタ処理を行うことで、対象音声のみを抽出することができ、対象音声を更に一層強調することも可能である。また、収音対象、第１及び第２のマイクロホンが略直線上に配置されている場合には、到達時間差は、第１のマイクロホンと第２のマイクロホンとの間の距離から算出しており、測距等を簡単にすることができる。 That is, in the present embodiment, the target sound is picked up by the first and second microphones, and based on the distance difference between the distance from the sound pickup target to the first microphone and the distance to the second microphone. It is possible to emphasize the target voice by deriving the arrival time difference and shifting and synthesizing at least one of the first voice and the second voice picked up by the shift amount according to the arrival time difference by time-axis shifting. .. As a result, the target voice can be emphasized with respect to noise generated from a position other than the sound pickup target, and the sound pickup target can be clearly heard with high sound quality. Furthermore, only the target voice can be extracted by detecting the frequency components that reinforce each other in a predetermined cycle among the frequency components of the synthesized voice after the time-axis shift addition processing and performing the filter processing to extract the frequency components. It is also possible to further emphasize the target voice. Further, when the sound pickup target and the first and second microphones are arranged on a substantially straight line, the arrival time difference is calculated from the distance between the first microphone and the second microphone, Distance measurement etc. can be simplified.

（第２の実施の形態）
図１０は本発明の第２の実施の形態を示すブロック図である。図１０において図１と同一の構成要素には同一符号を付して説明を省略する。本実施の形態は再生装置における強調処理が第１の実施の形態と異なる。 (Second embodiment)
FIG. 10 is a block diagram showing the second embodiment of the present invention. 10, the same components as those in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted. This embodiment differs from the first embodiment in the emphasis processing in the reproducing apparatus.

なお、図１０においても音声取得装置を撮像装置としてのカメラ、レコーダび再生装置に分散して構成する例を示しているが、本実施の形態においても、音声取得装置は、カメラ内に構成してもよく、レコーダに構成してもよく、カメラ及びレコーダに分散して構成してもよく、更に、これらの装置とは独立した装置として構成してもよい。 Note that FIG. 10 also shows an example in which the voice acquisition device is configured by being distributed to a camera as an imaging device, a recorder, and a playback device, but in the present embodiment also, the voice acquisition device is configured in the camera. Alternatively, it may be configured as a recorder, may be configured as distributed in the camera and the recorder, and may be configured as a device independent of these devices.

第１の実施の形態においては、説明を簡略化するために、カメラ１及びレコーダ２において収音可能な範囲において、音は対象音声と草４２による雑音のみが発生しているものと仮定した。しかし、第１の実施の形態においては、収音対象からカメラ１までの距離とレコーダ２までの距離との距離差に基づく到達時間差を求めて時間軸シフト加算処理を実行しており、カメラ１とレコーダ２の連携により収音可能な音声としては、到達時間差だけずれて到達する全ての音が対象となる。例えば、カメラ１とレコーダ２を結ぶ直線上の全ての位置から発せられた音が対象音声として強調されるだけでなく、音が到達時間差だけずれて収音される全ての位置からの音が対象音声として強調されることになる。 In the first embodiment, in order to simplify the description, it is assumed that only the target voice and the noise due to the grass 42 are generated in the range in which the camera 1 and the recorder 2 can collect sound. However, in the first embodiment, the time base shift addition processing is executed by obtaining the arrival time difference based on the distance difference between the distance from the sound pickup target to the camera 1 and the distance to the recorder 2. The sound that can be picked up by the cooperation of the recorder 2 is all sounds that arrive with a difference in arrival time. For example, not only the sound emitted from all positions on the straight line connecting the camera 1 and the recorder 2 is emphasized as the target sound, but also the sound from all positions where the sound is picked up by the arrival time difference is targeted. It will be emphasized as voice.

そこで、本実施の形態においては、カメラ１と収音対象とを結ぶ直線上の所定の１箇所の位置から発せられる音のみを対象音声として取得することを可能にしたものである。更に、本実施の形態においては、レコーダ２の位置を規定することで、カメラ１と収音対象とを結ぶ直線上の既知の１箇所から発せられる音のみを対象音声として取得することも可能である。 Therefore, in the present embodiment, it is possible to acquire, as the target voice, only the sound emitted from a predetermined one position on the straight line connecting the camera 1 and the sound pickup target. Further, in the present embodiment, by defining the position of the recorder 2, it is possible to acquire only the sound emitted from one known place on the straight line connecting the camera 1 and the sound collection target as the target sound. is there.

なお、第１の実施の形態においては、カメラ１のＳＴ収音部１２としては、モノラルマイクロホンを採用可能であったが、本実施の形態においては、ＳＴ収音部１２はステレオマイクロホンである必要がある。 In the first embodiment, a monaural microphone can be adopted as the ST sound collecting unit 12 of the camera 1, but in the present embodiment, the ST sound collecting unit 12 needs to be a stereo microphone. There is.

先ず、図１１から図１３を参照して第２の実施の形態における収音の仕方について説明する。図１１は鳥４１、草４２、レコーダ２及びカメラ１の位置関係をＸＹ座標上で示す説明図である。図１２及び図１３は、横軸に時間をとり縦軸に振幅を取って、収音された音声の時間ずれを説明するための説明図である。 First, a sound collecting method in the second embodiment will be described with reference to FIGS. 11 to 13. FIG. 11 is an explanatory diagram showing the positional relationship among the bird 41, the grass 42, the recorder 2 and the camera 1 on the XY coordinates. FIG. 12 and FIG. 13 are explanatory diagrams for explaining the time lag of the collected voice, with the horizontal axis representing time and the vertical axis representing amplitude.

図１１の例は、鳥４１とレコーダ２とカメラ１とは、一直線上に配置されていない。ＳＴ収音部１２は右マイクロホン１２Ｒと左マイクロホン１２Ｌとを有するステレオマイクロホンである。一般的なデジタルカメラ等に採用されるステレオマイクロホンと同様に、ＳＴ収音部１２のマイクロホン１２Ｒ，１２Ｌは、撮像部１１の光軸ＶＣに沿って相互に同一の指向特性を有し、感度も共通である。従って、マイクロホン１２Ｒ，１２Ｌによってそれぞれ収音されて得られた右第１音声及び左第１音声のうち、撮像部１１の光軸ＶＣ上の位置から発せられる音声の成分同士は相互に同一となる。 In the example of FIG. 11, the bird 41, the recorder 2, and the camera 1 are not arranged on a straight line. The ST sound collecting unit 12 is a stereo microphone having a right microphone 12R and a left microphone 12L. Similar to a stereo microphone used in a general digital camera or the like, the microphones 12R and 12L of the ST sound pickup unit 12 have the same directional characteristics along the optical axis VC of the image pickup unit 11 and have the same sensitivity. It is common. Therefore, of the right first voice and the left first voice obtained by picking up by the microphones 12R and 12L, the components of the voices emitted from the position on the optical axis VC of the imaging unit 11 are the same. ..

従って、カメラ１の撮像時に、鳥４１が光軸ＶＣ上に位置するように、即ち、鳥４１を画面中央に捉えた状態においては、鳥４１からの対象音声については、マイクロホン１２Ｒ，１２Ｌにより取得される成分Ａ１Ｒ，Ａ１Ｌは同一波形となる。例えば、所定の短い期間において対象音声の周波数に変化がないものとすると、この期間において波形同一（同一位相）となる周波数成分は対象音声の成分であると考えてもよく、当該周波数成分を抽出する処理（以下、同一位相成分抽出処理という）を行うことで、収音した左右の第１音声（左第１音声及び右第１音声）から対象音声のみを抽出することが可能である。なお、撮像部１１の光軸ＶＣ上の全ての位置から発せられる音については、右第１音声と左第１音声が同一位相となる。換言すると、同一位相成分抽出処理によって得られる音（以下、第３音声という）は、撮像部１１の光軸ＶＣ上から発せられた音と考えることができる。 Therefore, the target sound from the bird 41 is acquired by the microphones 12R and 12L so that the bird 41 is located on the optical axis VC when the camera 1 captures the image, that is, when the bird 41 is captured in the center of the screen. The generated components A1R and A1L have the same waveform. For example, if the frequency of the target voice does not change in a predetermined short period, it may be considered that the frequency components having the same waveform (same phase) in this period are the components of the target voice, and the frequency component is extracted. It is possible to extract only the target voice from the left and right first voices (the left first voice and the right first voice) that are collected by performing the processing (hereinafter, referred to as the same phase component extraction processing). As for sounds emitted from all positions on the optical axis VC of the image pickup unit 11, the first right sound and the first left sound have the same phase. In other words, the sound obtained by the in-phase component extraction processing (hereinafter referred to as the third sound) can be considered as the sound emitted from the optical axis VC of the image pickup unit 11.

図１２及び図１３はそれぞれ対象音声である鳥４１の鳴き声と草４２のなびく音等の雑音とが分離されているとして、カメラ１が収音した音声に基づく第３音声とレコーダ２が収音する音声の波形を説明するものである。図１２は対象音声について示すものであり、レコーダ２の位置における対象音声Ａ２に対してカメラ１の位置における第３音声の対象音声の成分Ａ１Ｒ，Ａ１Ｌは、距離の相違に基づく到達時間遅れ及び振幅の減少のみを有する。図１２の例では到達時間差はＴｏであり、レコーダ２に到達する対象音声と第３音声中の対象音声成分とは、到達時間差Ｔｏに相当する分だけ位相がずれている。 12 and 13 show that the sound of the bird 41 and the noise of the sound of the grass 42, which are the target sounds, are separated, respectively, and the third sound based on the sound collected by the camera 1 and the recorder 2 collect the sound. The waveform of the voice is shown. FIG. 12 shows the target voice. The components A1R and A1L of the target voice of the third voice at the position of the camera 1 with respect to the target voice A2 at the position of the recorder 2 are the arrival time delay and the amplitude based on the difference in distance. Has only a decrease in In the example of FIG. 12, the arrival time difference is To, and the target voice reaching the recorder 2 and the target voice component in the third voice are out of phase by an amount corresponding to the arrival time difference To.

また、図１３は草４２による雑音について示すものである。レコーダ２の位置における雑音ＡＮ２とカメラ１が収音した音声に基づく第３音声中の雑音成分ＡＮＬ，ＡＮＲは、距離の相違に基づく到達時間遅れ及び振幅の変化を有する。図１３の例では到達時間差はＴｂであり、レコーダ２に到達する雑音ＡＮ２と第３音声中の雑音ＡＮＬ，ＡＮＲとは、到達時間差Ｔｂに相当する分だけ位相がずれている。 Further, FIG. 13 shows the noise caused by the grass 42. The noise AN2 at the position of the recorder 2 and the noise components ANL and ANR in the third voice based on the voice picked up by the camera 1 have the arrival time delay and the amplitude change based on the difference in distance. In the example of FIG. 13, the arrival time difference is Tb, and the noise AN2 reaching the recorder 2 and the noises ANL and ANR in the third voice are out of phase by an amount corresponding to the arrival time difference Tb.

そこで、本実施の形態においても、レコーダ２が収音して得た音声を到達時間差Ｔｏ分だけ遅延させてカメラ１が収音して得た音声と加算する。これにより、カメラ１及びレコーダ２が収音した音声に含まれる成分のうち到達時間差Ｔｏの対象音声については、位相が一致した状態で加算されて強めあう（振幅が大きくなる）ことになる。 Therefore, also in the present embodiment, the sound obtained by the recorder 2 is delayed by the arrival time difference To and is added to the sound obtained by the camera 1. As a result, among the components included in the voices collected by the camera 1 and the recorder 2, the target voice having the arrival time difference To is added in a phase-matched state and strengthened (the amplitude increases).

ところで、レコーダ２は、任意の位置に配置される。本実施の形態では、レコーダ２は、光軸ＶＣ上に配置されていてもいなくてもよい。収音対象である鳥４１とレコーダ２との間の距離をＬＡとし、鳥４１とカメラ１との間の距離をＬＣとすると、距離差（ＬＣ−ＬＡ）から対象音声の到達時間差を算出することができる。レコーダ２が所定の位置に固定されているものとすると、レコーダ２からの距離がＬＡで、カメラ１から離間する方向の光軸ＶＣ上の位置は、鳥４１の位置に一義的に決定される。 By the way, the recorder 2 is arranged at an arbitrary position. In the present embodiment, recorder 2 may or may not be arranged on optical axis VC. Letting LA be the distance between the bird 41 and the recorder 2 that are the sound pickup targets, and let LC be the distance between the bird 41 and the camera 1, the arrival time difference of the target voice is calculated from the distance difference (LC-LA). be able to. Assuming that the recorder 2 is fixed at a predetermined position, the distance from the recorder 2 is LA, and the position on the optical axis VC in the direction away from the camera 1 is uniquely determined to the position of the bird 41. ..

従って、カメラ１とレコーダ２における対象音声の到達時間差に対応した遅延時間で、レコーダ２が収音した第２音声とカメラ１によって得られる第３音声の少なくとも一方を遅延させて加算する時間軸シフト加算処理を行うことで、鳥４１の位置から発せられた音、即ち、収音対象から発せられた対象音声を強調することが可能である。 Therefore, a time axis shift for delaying and adding at least one of the second sound picked up by the recorder 2 and the third sound obtained by the camera 1 with a delay time corresponding to the arrival time difference between the target sounds at the camera 1 and the recorder 2. By performing the addition processing, it is possible to emphasize the sound emitted from the position of the bird 41, that is, the target sound emitted from the sound collection target.

また、レコーダ２を設置する位置を指定することで、カメラ１に対してある距離の位置から発せられた音のみを対象音声として強調することも可能である。 Further, by designating the position where the recorder 2 is installed, it is possible to emphasize only the sound emitted from the position at a certain distance to the camera 1 as the target voice.

図１０の音声取得装置は、再生装置３に代えて再生装置５０を採用する。再生装置５０は、制御部３０に代えて、制御部３０に同一位相成分抽出処理部３０ｃを付加した制御部５１を採用する。なお、制御部５１は、ＣＰＵやＦＰＧＡ等を用いたプロセッサによって構成されていてもよく、図示しないメモリに記憶されたプログラムに従って動作して各部を制御するものであってもよいし、ハードウェアの電子回路で機能の一部又は全部を実現するものであってもよい。 The audio acquisition device in FIG. 10 employs a reproduction device 50 instead of the reproduction device 3. The playback device 50 employs, instead of the control unit 30, a control unit 51 in which the same phase component extraction processing unit 30c is added to the control unit 30. The control unit 51 may be configured by a processor using a CPU, an FPGA, or the like, and may be a unit that operates according to a program stored in a memory (not shown) to control each unit, or a hardware unit. An electronic circuit may realize some or all of the functions.

上述したように、カメラ１は、収音対象を画面中央に捉えた状態で撮像を行うと共に収音を行う。こうして取得されたカメラ１からの撮像画像、第１音声及び第１音声に付加される時間情報、到達時間差情報及び音の発生タイミングの推測結果の情報は再生装置５０に与えられて記録部３４に記録される。また、レコーダ２からの第２音声及び第２音声に付加される時間情報も再生装置５０に与えられて記録部３４に記録される。 As described above, the camera 1 picks up an image while picking up the sound pickup target in the center of the screen. The captured image thus obtained from the camera 1, the first sound, the time information added to the first sound, the arrival time difference information, and the information of the estimation result of the sound generation timing are given to the reproducing apparatus 50 and are stored in the recording unit 34. To be recorded. Further, the second sound from the recorder 2 and the time information added to the second sound are also given to the reproducing apparatus 50 and recorded in the recording unit 34.

再生時には、制御部５１の同一位相成分抽出処理部３０ｃは、記録部３４から第１音声に含まれる左音声及び右音声に対して同一位相成分抽出処理を行って第３音声を取得する。同一位相成分抽出処理部３０ｃは、第３音声を時間軸シフト加算処理部３０ａに与える。時間軸シフト加算処理部３０ａは、第１音声に代えて第３音声を用いて時間軸シフト加算処理を行うようになっている。 During reproduction, the in-phase component extraction processing unit 30c of the control unit 51 performs in-phase component extraction processing on the left audio and the right audio included in the first audio from the recording unit 34 to obtain the third audio. The in-phase component extraction processing unit 30c gives the third voice to the time-axis shift addition processing unit 30a. The time-axis shift addition processing unit 30a is configured to perform the time-axis shift addition processing by using the third voice instead of the first voice.

次に、このように構成された実施の形態の動作について図１４を参照して説明する。図１４は再生装置の制御を示すフローチャートである。図１４において図７と同一の手順には同一符号を付して説明を省略する。 Next, the operation of the embodiment thus configured will be described with reference to FIG. FIG. 14 is a flowchart showing the control of the playback device. 14, the same steps as those in FIG. 7 are designated by the same reference numerals and the description thereof will be omitted.

カメラ１及びレコーダ２の作用は、図５及び図６と同様である。本実施の形態においては、ユーザは、レコーダ連携時に撮像を行う場合には、収音対象を画面中央に捉えた状態で撮像を行う。これにより、ＳＴ収音部１２のマイクロホン１２Ｒ，１２Ｌによって収音される音声のうち、撮像部１１の光軸上の位置から発せられた音声の成分については、左音声及び右音声の波形は同一となる。 The operations of the camera 1 and the recorder 2 are the same as those in FIGS. In the present embodiment, when capturing an image at the time of cooperation with the recorder, the user captures the sound pickup target in the center of the screen. As a result, of the voices picked up by the microphones 12R and 12L of the ST sound pickup unit 12, the waveforms of the left voice and the right voice are the same for the components of the voice emitted from the position on the optical axis of the image pickup unit 11. Becomes

再生装置５０は、図１４のステップＳ４１において、記録部３４に記録された左右の第１音声に対する同一位相成分抽出処理によって、第３音声を取得する。この第３音声は、撮像部１１の光軸上の位置から発せられた音に対応する。 In step S41 of FIG. 14, the reproducing device 50 acquires the third audio by the same phase component extraction process for the left and right first audio recorded in the recording unit 34. The third sound corresponds to the sound emitted from the position on the optical axis of the image pickup unit 11.

時間軸シフト加算処理部３０ａは、第３音声及び第２音声の少なくとも一方を到達時間差に基づくシフト量で時間軸シフト処理を行って加算することで、対象音声を強調する。更に、対象音声抽出部３０ｂは、強調された対象音声を抽出する処理を行う。 The time-axis shift addition processing unit 30a emphasizes the target voice by performing at least one of the third voice and the second voice by performing the time-axis shift processing with the shift amount based on the arrival time difference and adding the result. Further, the target voice extraction unit 30b performs a process of extracting the emphasized target voice.

時間軸シフト加算処理部３０ａ又は対象音声抽出部３０ｂからの合成音が再生部３３に与えられて音響として出力される。 The synthetic sound from the time-axis shift addition processing unit 30a or the target voice extraction unit 30b is given to the reproduction unit 33 and output as a sound.

他の作用は第１の実施の形態と同様である。 Other actions are similar to those of the first embodiment.

このように本実施の形態においては、対象音声をステレオのマイクロホンにより収音して第１音声を得る。そして、左右の第１音声のうち同一位相となる周波数成分を抽出して第３音声を取得する。第３音声は、直線上の位置から発せられた音に対応する。そして、収音対象からステレオのマイクロホンまでの距離と第２のマイクロホンまでの距離との距離差に基づいて到達時間差を求め、この到達時間差に応じたシフト量で取得した第３音声及び第２音声の少なくとも一方を時間軸シフトさせて合成することにより、対象音声を強調する。これにより、ステレオのマイクロホンから所定の距離に位置する収音対象以外の位置から発せられる雑音に対して対象音声を確実に強調することができ、収音対象を明瞭に高音質で聞き取ることが可能となる。 As described above, in the present embodiment, the target voice is picked up by the stereo microphone to obtain the first voice. Then, frequency components having the same phase are extracted from the left and right first voices to obtain the third voice. The third voice corresponds to the sound emitted from the position on the straight line. Then, the arrival time difference is obtained based on the distance difference between the distance from the sound pickup target to the stereo microphone and the distance to the second microphone, and the third voice and the second voice acquired with the shift amount according to the arrival time difference. The target voice is emphasized by shifting at least one of the above and synthesizing it. As a result, the target voice can be reliably emphasized against noise generated from a position other than the sound pickup target located at a predetermined distance from the stereo microphone, and the sound pickup target can be clearly heard with high sound quality. Becomes

なお、上記実施の形態においては、収音対象を画面中央に捉えた状態で収音を行うものと説明した。しかし、例えば、収音対象の画面上の位置や、位置及び方位センサ部１７によって取得した収音対象の位置に基づいて左右の第１音声の到達時間差を算出することも可能である。この場合には、収音対象を画面中央に捉えた状態で収音を行わない場合でも、この到達時間差に応じて左右の第１音声の少なくとも一方を遅延させたのち同一位相成分抽出処理を行うことで、撮像部１１の光軸上の所定位置から発せられた音に基づく第３音声を取得可能である。また、逆に、左右の第１音声の少なくとも一方を所定の到達時間差だけ遅延させて同一位相成分抽出処理を行うことで、撮像部１１の光軸に対して所定の角度方向から発せられた音に基づく第３音声を取得可能である。 In addition, in the above-mentioned embodiment, it has been described that the sound is collected in a state where the sound collection target is captured in the center of the screen. However, for example, the arrival time difference between the left and right first voices can be calculated based on the position of the sound collection target on the screen or the position of the sound collection target acquired by the position and direction sensor unit 17. In this case, even when sound is not collected with the sound collection target being captured in the center of the screen, at least one of the left and right first sounds is delayed according to the arrival time difference, and then the same phase component extraction process is performed. As a result, it is possible to obtain the third sound based on the sound emitted from the predetermined position on the optical axis of the imaging unit 11. On the contrary, by delaying at least one of the left and right first sounds by the predetermined arrival time difference and performing the same phase component extraction process, the sound emitted from the predetermined angular direction with respect to the optical axis of the imaging unit 11 is detected. It is possible to acquire the third sound based on

（第３の実施の形態）
図１５は本発明の第３の実施の形態に係る音声取得装置を示すブロック図である。図１５において図１と同一の構成要素には同一符号を付して説明を省略する。 (Third Embodiment)
FIG. 15 is a block diagram showing a voice acquisition device according to the third embodiment of the present invention. 15, the same components as those in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted.

上記各実施の形態においては、カメラ１及びレコーダ２において第１及び第２音声を取得すると共に連携情報を取得し、再生装置３において、取得された情報を用いて自動的に対象音声を強調処理するものであった。しかし、この場合には、連携情報として到達時間差の情報が必要であり、カメラ１とレコーダ２との間の距離を求める必要があったり、収音対象、カメラ１及びレコーダ２の位置関係によっては、収音対象の位置についても取得する必要があった。このため、上記各実施の形態においては、カメラ１によりレコーダ２までの距離や収音対象までの距離及び方位を求めたり、レコーダ２の測位機能を利用したりすることで、収音対象、カメラ１及びレコーダ２の位置関係を求めた。 In each of the above-described embodiments, the camera 1 and the recorder 2 acquire the first and second sounds and the cooperation information, and the reproducing apparatus 3 automatically emphasizes the target sound by using the acquired information. It was something to do. However, in this case, the information on the arrival time difference is required as the cooperation information, the distance between the camera 1 and the recorder 2 needs to be calculated, and depending on the sound pickup target, the positional relationship between the camera 1 and the recorder 2, It was also necessary to acquire the position of the sound collection target. Therefore, in each of the above-described embodiments, the camera 1 determines the distance to the recorder 2 and the distance and direction to the sound pickup target, and the positioning function of the recorder 2 is used to detect the sound pickup target and the camera. The positional relationship between 1 and recorder 2 was obtained.

これに対し、本実施の形態は、到達時間差の算出を省略可能にしたものである。本実施の形態は強調処理のための時間軸シフト処理におけるシフト量を人間の聴覚を利用して取得するものである。なお、図１５は第１の実施の形態に適用する例を示しているが、本実施の形態は図１０の第２の実施の形態にも同様に適用可能である。 On the other hand, in the present embodiment, the calculation of the arrival time difference can be omitted. In the present embodiment, the shift amount in the time axis shift process for the enhancement process is acquired by utilizing human hearing. Note that FIG. 15 shows an example applied to the first embodiment, but this embodiment is also applicable to the second embodiment of FIG.

本実施の形態はカメラ１に代えてカメラ６０を採用すると共に再生装置３を省略したものであり、カメラ６０は再生装置３の機能を内蔵するものである。カメラ６０は、制御部６１を備える。制御部６１は、ＣＰＵやＦＰＧＡ等を用いたプロセッサによって構成されていてもよく、図示しないメモリに記憶されたプログラムに従って動作して各部を制御するものであってもよいし、ハードウェアの電子回路で機能の一部又は全部を実現するものであってもよい。 In the present embodiment, a camera 60 is used instead of the camera 1 and the reproducing apparatus 3 is omitted, and the camera 60 has the function of the reproducing apparatus 3 built therein. The camera 60 includes a control unit 61. The control unit 61 may be configured by a processor using a CPU, an FPGA, or the like, may be a unit that operates according to a program stored in a memory (not shown) to control each unit, or an electronic circuit of hardware. May realize some or all of the functions.

制御部６１は、図１の制御部１０から距離及び到達時間差判定部１０ｅと発生タイミング推測部１０ｆとを省略すると共に、図１の制御部３０の時間軸シフト加算処理部３０ａ及び対象音声抽出部３０ｂの構成を追加したものである。更に、制御部６１は、シフト量調整部６１ａを備える。なお、カメラ６０は、図１のカメラ１から画像特徴抽出部１３と位置及び方位センサ部１７とを省略すると共に、表示部１６に代えて再生部３３を採用し、記録部１４に代えて記録部６２を採用している。 The control unit 61 omits the distance/arrival time difference determination unit 10e and the occurrence timing estimation unit 10f from the control unit 10 of FIG. 1, and the time-axis shift addition processing unit 30a and the target voice extraction unit of the control unit 30 of FIG. The configuration of 30b is added. Further, the control unit 61 includes a shift amount adjusting unit 61a. Note that the camera 60 omits the image feature extraction unit 13 and the position and orientation sensor unit 17 from the camera 1 of FIG. 1, uses the reproduction unit 33 instead of the display unit 16, and records instead of the recording unit 14. The part 62 is adopted.

記録部６２は、ハードディスクやメモリ媒体等の図示しない記録媒体により構成されており、制御部６１から与えられた情報を記録すると共に、記録された情報を読み出して制御部６１に出力する。記録部６２にはレコーダ２との通信に関する情報が記録される連携情報部１４ａの他に、第２音声情報部１４ｂを有している。第２音声情報部１４ｂは、レコーダ２から与えられた第２音声が記録されるようになっている。 The recording unit 62 is configured by a recording medium (not shown) such as a hard disk or a memory medium, records the information given from the control unit 61, reads the recorded information and outputs it to the control unit 61. The recording unit 62 has a second audio information unit 14b in addition to the cooperation information unit 14a in which information about communication with the recorder 2 is recorded. The second voice information section 14b is adapted to record the second voice given from the recorder 2.

制御部６１のシフト量調整部６１ａは、操作部１５のユーザ操作に基づく操作信号が与えられる。シフト量調整部６１ａは、操作信号に応じて、時間軸シフト加算処理部３０ａにおける時間軸シフト処理のシフト量を変化させるようになっている。 The shift amount adjusting unit 61a of the control unit 61 is given an operation signal based on a user operation of the operation unit 15. The shift amount adjustment unit 61a is configured to change the shift amount of the time axis shift processing in the time axis shift addition processing unit 30a according to the operation signal.

次に、このように構成された実施の形態の動作について図１６から図１８を参照して説明する。図１６は収音及び録音時の様子を示す説明図である。また、図１７はカメラの制御を示すフローチャートであり、図１８はレコーダの制御を示すフローチャートである。なお、図１７及び図１８においてそれぞれ図５又は図６と同一の手順には同一符号を付して説明を省略する。 Next, the operation of the embodiment configured as described above will be described with reference to FIGS. 16 to 18. FIG. 16 is an explanatory diagram showing a state during sound pickup and recording. 17 is a flowchart showing the control of the camera, and FIG. 18 is a flowchart showing the control of the recorder. 17 and 18, the same steps as those in FIG. 5 or 6 are designated by the same reference numerals and the description thereof will be omitted.

本実施の形態においては、カメラ６０は、レコーダ２と連携して収音する場合には、レコーダ２が収音した音声をカメラ６０に転送させるようになっている。即ち、カメラ６０の制御部６１は、図１７のステップＳ４においてレコーダ連携が指定されていると判定した場合には、ステップＳ５において連携依頼通信を行う。レコーダ２の制御部２０は、図１８のステップＳ２３においてカメラ連携が指定されると、ステップＳ６１に処理を移行して音声信号の送信を行う。即ち、収音制御部２０ａに制御されてＳＴ収音部２２により収音された第２音声は、通信部２１ａ又は２１ｂを介してカメラ６０に送信される。なお、この第２音声は、記録部２５に供給されて記録されるようになっていてもよい。カメラ６０の制御部６１は、ステップＳ５の次のステップＳ５１において、レコーダ２からの第２音声を通信部１８ａ又は１８ｂを介して受信する。 In this embodiment, when the camera 60 collects sound in cooperation with the recorder 2, the sound collected by the recorder 2 is transferred to the camera 60. That is, when the control unit 61 of the camera 60 determines in step S4 of FIG. 17 that recorder cooperation is designated, cooperation control communication is performed in step S5. When the camera cooperation is designated in step S23 of FIG. 18, the control unit 20 of the recorder 2 shifts the processing to step S61 and transmits the audio signal. That is, the second sound controlled by the sound collection control unit 20a and collected by the ST sound collection unit 22 is transmitted to the camera 60 via the communication unit 21a or 21b. The second voice may be supplied to the recording unit 25 and recorded. In step S51 subsequent to step S5, the control unit 61 of the camera 60 receives the second sound from the recorder 2 via the communication unit 18a or 18b.

本実施の形態においては、制御部６１は、受信した第２音声を記録部６２の第２音声情報部１４ｂに与えて記録させると共に、記録部６２に記録中の第１音声と、第２音声情報部１４ｂに記録中の第２音声とを時間軸シフト加算処理部３０ａに与えて時間軸シフト加算処理を実行させる。例えば、初期状態では時間軸シフト処理のシフト量は０であるものとしてもよい。この場合には、記録時の遅延及び第２音声の伝送遅延を無視するものとすると、収音中の第１音声と第２音声とが時間軸シフト加算処理部３０ａにおいて時間軸シフト処理されることなくそのまま合成されることになる。時間軸シフト加算処理部３０ａからの合成音はそのまま再生部３３に供給されるか又は対象音声抽出部３０ｂを介して再生部３３に供給される。なお、時間軸シフト加算処理部３０ａ又は対象音声抽出部３０ｂからの合成音は記録部１４に供給されて記録される。 In the present embodiment, the control unit 61 gives the received second voice to the second voice information unit 14b of the recording unit 62 to record the received second voice, and the first voice and the second voice currently recorded in the recording unit 62. The second voice being recorded in the information section 14b is supplied to the time-axis shift addition processing section 30a to execute the time-axis shift addition processing. For example, the shift amount of the time axis shift process may be 0 in the initial state. In this case, if the delay at the time of recording and the transmission delay of the second voice are ignored, the first voice and the second voice in the sound pickup are subjected to the time axis shift processing in the time axis shift addition processing section 30a. It will be synthesized as it is. The synthetic sound from the time-axis shift addition processing unit 30a is supplied to the reproducing unit 33 as it is, or is supplied to the reproducing unit 33 via the target voice extracting unit 30b. The synthetic sound from the time axis shift addition processing unit 30a or the target voice extraction unit 30b is supplied to the recording unit 14 and recorded therein.

図１６はこの収音及び録音時における様子を示している。カメラ６０の筐体６０ａの背面には、再生部３３を構成する表示部３３ｂが設けられている。ユーザは、例えば、筐体３０ａを手で把持して、表示部３３ｂ上の表示を見ながら被写体である鳥４１を視野範囲に捉えた状態で、シャッタボタン１５ａを押下操作することで撮影を行う。図１６では表示部３３ｂ上に鳥４１の画像４１ａが表示されていることを示している。また、筐体６０ａの上面には、操作部１５を構成するモード切り替え用のボタン１５ｃ及びシフト量調整用のボタン１５ｂが配設されている。 FIG. 16 shows how the sound is collected and recorded. On the back surface of the housing 60a of the camera 60, a display unit 33b that constitutes the reproduction unit 33 is provided. The user, for example, grips the housing 30a with his/her hand and, while watching the display on the display unit 33b, captures the bird 41, which is a subject, in the field of view, and presses the shutter button 15a to perform shooting. .. FIG. 16 shows that the image 41a of the bird 41 is displayed on the display unit 33b. Further, a mode switching button 15c and a shift amount adjusting button 15b, which form the operation unit 15, are arranged on the upper surface of the housing 60a.

本実施の形態においては、再生部３３は合成音を音響として出力するヘッドホン３３ｃを有している。なお、再生部３３は、ヘッドホン３３ｃに限らず筐体６０ａに内蔵された図示しないスピーカを有して、このスピーカから合成音を音響として出力するようになっていてもよい。ユーザはヘッドホン３３ｃからの合成音を聞きながらボタン１５ｂを操作する。ボタン１５ｂの操作に基づく操作信号はシフト量調整部６１ａに供給される。 In the present embodiment, the reproduction unit 33 has headphones 33c that output the synthesized sound as sound. The reproducing unit 33 may have a speaker (not shown) built in the housing 60a, not limited to the headphones 33c, and output the synthesized sound as sound from the speaker. The user operates the button 15b while listening to the synthetic sound from the headphones 33c. An operation signal based on the operation of the button 15b is supplied to the shift amount adjustment unit 61a.

シフト量調整部６１ａは、ユーザのボタン１５ｂの操作に応じて時間軸シフト処理のシフト量を変化させる指示を時間軸シフト加算処理部３０ａに出力する（ステップＳ５２）。時間軸シフト加算処理部３０ａは、シフト量調整部６１ａの指示に従ったシフト量だけ、記録部１４から読み出した第１音声及び第２音声のうちの少なくとも一方に対して時間軸シフト処理を行った後、第１音声及び第２音声を加算する。 The shift amount adjustment unit 61a outputs an instruction to change the shift amount of the time axis shift processing to the time axis shift addition processing unit 30a according to the user's operation of the button 15b (step S52). The time axis shift addition processing unit 30a performs the time axis shift processing on at least one of the first voice and the second voice read from the recording unit 14 by the shift amount according to the instruction of the shift amount adjusting unit 61a. Then, the first voice and the second voice are added.

ユーザのボタン１５ｂの操作量に応じて、時間軸シフト処理のシフト量が変化する。ユーザは、ヘッドホン３３ｃから聞こえる音声が対象音声と考えられる音声として最もよく聞こえるように、ボタン１５ｂを操作する。ユーザがヘッドホン３３ｃから対象音声が最もよく聞こえる、即ち、対象音声が最も適切に強調処理されたと判断すると、ボタン１５ｂの操作を停止する。このボタン１５ｂの操作に対応するシフト量は、対象音声がレコーダ２に到達する時間とカメラ６０に到達する時間との到達時間差に相当する。即ち、本実施の形態においては、ユーザの聴覚を利用することで、測距等を行うことなく、到達時間差の情報を取得することができることになる。 The shift amount of the time axis shift process changes according to the operation amount of the user's button 15b. The user operates the button 15b so that the sound heard from the headphones 33c is best heard as the sound considered to be the target sound. When the user determines that the target voice is best heard from the headphones 33c, that is, the target voice is most appropriately emphasized, the operation of the button 15b is stopped. The shift amount corresponding to the operation of the button 15b corresponds to the arrival time difference between the time when the target voice reaches the recorder 2 and the time when the target voice reaches the camera 60. That is, in the present embodiment, by utilizing the hearing of the user, it is possible to acquire the information on the arrival time difference without performing distance measurement or the like.

カメラ６０は、第１及び第２音声だけでなく、合成音についても記録部６２に与えて記録する（ステップＳ５３）。なお、この場合には、カメラ６０は、到達時間差に関する情報についても合成音に関連付けて記録するようになっていてもよい。 The camera 60 gives not only the first and second sounds but also the synthesized sound to the recording unit 62 and records it (step S53). In this case, the camera 60 may record the information regarding the arrival time difference in association with the synthetic sound.

他の作用は第１及び第２の実施の形態と同様である。 Other functions are similar to those of the first and second embodiments.

このように本実施の形態においては、第１及び第２の実施の形態と同様の効果が得られると共に、対象音声の強調処理のための時間軸シフト処理におけるシフト量を、人間の聴覚を利用して取得するようになっており、到達時間差の算出及びこの算出に必要な測距等の処理を省略することが可能である。 As described above, in the present embodiment, the same effect as in the first and second embodiments can be obtained, and the shift amount in the time axis shift process for enhancing the target voice can be used as the human hearing. It is possible to omit the calculation of the arrival time difference and the processing such as distance measurement necessary for this calculation.

なお、上記実施の形態においては、収音中の第２音声について伝送遅延を無視してカメラの記録部に記録されるものとして説明したが、ユーザの聴覚に従って、時間軸シフト処理のシフト量を調整するものであるので、伝送遅延があったとしても、特に問題はない。 It should be noted that in the above-described embodiment, the description has been made assuming that the transmission delay of the second sound during the sound collection is ignored and the sound is recorded in the recording unit of the camera. Since the adjustment is performed, there is no particular problem even if there is a transmission delay.

また、カメラによって収音した第１音声とレコーダによって収音した第２音声との同期がある程度とれるならば、カメラの記録部に記録された第１及び第２音声を、あるいはカメラの記録部に記録された第１音声とレコーダの記録部に記録された第２音声とを記録後に読み出し、再生時にユーザの聴覚を利用して時間軸シフト処理のシフト量を調整することにより、適切に強調された対象音声を聞くことも可能である。 If the first sound picked up by the camera and the second sound picked up by the recorder can be synchronized to some extent, the first and second sounds recorded in the recording section of the camera or the recording section of the camera can be recorded. The recorded first sound and the second sound recorded in the recording unit of the recorder are read out after recording, and are appropriately emphasized by adjusting the shift amount of the time-axis shift process by utilizing the hearing of the user during reproduction. It is also possible to hear the target voice.

上記実施の形態においては、撮像のための機器として、デジタルカメラを用いて説明したが、カメラとしては、デジタル一眼レフカメラでもコンパクトデジタルカメラでもよく、ビデオカメラ、ムービーカメラでもよく、さらに、携帯電話やスマートフォンなど携帯情報端末（ＰＤＡ：Personal Digital Assist）等に内蔵されるカメラでも勿論構わない。 In the above embodiment, a digital camera is used as an image capturing device, but the camera may be a digital single lens reflex camera, a compact digital camera, a video camera, a movie camera, or a mobile phone. Of course, a camera built in a personal digital assistant (PDA) such as a smartphone or a smart phone may be used.

本発明は、上記各実施形態にそのまま限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素の幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 The present invention is not limited to the above embodiments as they are, and can be embodied by modifying the constituent elements within a range not departing from the gist of the invention in an implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in each of the above embodiments. For example, some of all the constituent elements shown in the embodiment may be deleted. Furthermore, the constituent elements of different embodiments may be combined appropriately.

なお、特許請求の範囲、明細書、および図面中の動作フローに関して、便宜上「まず、」、「次に、」等を用いて説明したとしても、この順で実施することが必須であることを意味するものではない。また、これらの動作フローを構成する各ステップは、発明の本質に影響しない部分については、適宜省略も可能であることは言うまでもない。 It should be noted that even if the description of the claims, the description, and the operation flow in the drawings is made by using “first,” “next,” and the like for convenience, it is essential that they are performed in this order. It does not mean. Further, it goes without saying that the steps constituting these operation flows can be appropriately omitted as long as they do not affect the essence of the invention.

なお、ここで説明した技術のうち、主にフローチャートで説明した制御に関しては、プログラムで設定可能であることが多く、記録媒体や記録部に収められる場合もある。この記録媒体、記録部への記録の仕方は、製品出荷時に記録してもよく、配布された記録媒体を利用してもよく、インターネットを介してダウンロードしたものでもよい。 Of the techniques described here, the control mainly described in the flowchart is often settable by a program and may be stored in a recording medium or a recording unit. The recording medium and the recording unit may be recorded at the time of product shipment, may be distributed recording medium, or may be downloaded via the Internet.

なお、実施例中で、「部」（セクションやユニット）として記載した部分は、専用の回路や、複数の汎用の回路を組み合わせて構成してもよく、必要に応じて、予めプログラムされたソフトウェアに従って動作を行うマイコン、ＣＰＵなどのプロセッサ、あるいはＦＰＧＡなどシーケンサを組み合わせて構成されてもよい。また、その制御の一部または全部を外部の装置が引き受けるような設計も可能で、この場合、有線や無線の通信回路が介在する。通信は、ブルートゥースやＷｉＦｉ、電話回線などで行えばよく、ＵＳＢなどで行っても良い。専用の回路、汎用の回路や制御部を一体としてＡＳＩＣとして構成してもよい。 It should be noted that, in the embodiments, a portion described as a "section" (section or unit) may be configured by a dedicated circuit or a combination of a plurality of general-purpose circuits, and if necessary, preprogrammed software may be used. It may be configured by combining a microcomputer that operates according to the above, a processor such as a CPU, or a sequencer such as an FPGA. It is also possible to design such that a part or all of the control is taken over by an external device, and in this case, a wired or wireless communication circuit intervenes. The communication may be performed using Bluetooth, WiFi, a telephone line, or the like, and may be performed using USB or the like. A dedicated circuit, a general-purpose circuit, and a control unit may be integrated into an ASIC.

１…カメラ、１０…制御部、１０ａ…撮影制御部、１０ｂ…画像処理部、１０ｃ…画角情報部、１０ｄ…収音制御及び処理部、１０ｅ…距離及び到達時間差判定部、１０ｆ…発生タイミング推測部、１１…撮像部、１１ａ…光学系、１２…ＳＴ収音部、１３…画像特徴抽出部、１４…記録部、１４ａ…連携情報部、１６…表示部、１７…位置及び方位センサ部、１８…通信部、２…レコーダ、２０…制御部、２０ａ…収音制御部、２２…ＳＴ収音部、２５…記録部、２５ａ…連携情報部、３…再生装置、３０…制御部、３０ａ…時間軸シフト加算処理部、３０ｂ…対象音声抽出部、３０ｃ…同一位相成分抽出処理部。 DESCRIPTION OF SYMBOLS 1... Camera, 10... Control part, 10a... Shooting control part, 10b... Image processing part, 10c... View angle information part, 10d... Sound collection control and processing part, 10e... Distance and arrival time difference determination part, 10f... Occurrence timing Estimating unit, 11... Imaging unit, 11a... Optical system, 12... ST sound collecting unit, 13... Image feature extracting unit, 14... Recording unit, 14a... Linkage information unit, 16... Display unit, 17... Position and direction sensor unit , 18... communication unit, 2... recorder, 20... control unit, 20a... sound collection control unit, 22... ST sound collection unit, 25... recording unit, 25a... cooperation information unit, 3... playback device, 30... control unit, 30a... Time-axis shift addition processing section, 30b... Target speech extraction section, 30c... In-phase component extraction processing section.

Claims

A first microphone that is arranged at a first distance from the sound collection target and that collects the sound emitted from the sound collection target to obtain the first sound;
A second microphone arranged apart from the sound collecting target by a second distance different from the first distance to collect a sound emitted from the sound collecting target and obtain a second sound;
A voice acquisition device, comprising: a voice enhancement unit that performs enhancement processing for enhancing a component based on a distance difference between the first distance and the second distance of the first and second voices.

An image acquisition unit that acquires the image of the sound collection target,
By the image analysis of the acquired image, a generation timing estimation unit that estimates the generation timing of the sound emitted from the sound collection target,
The voice acquisition device according to claim 1, wherein the voice enhancement unit adjusts the enhancement processing based on a result of estimation of a timing at which the sound is generated.

The voice enhancement unit obtains a time difference in which the voice emitted from the sound collection target reaches the first microphone and the second microphone according to the distance difference between the first distance and the second distance. The voice acquisition according to claim 1, wherein the enhancement processing is performed by delaying and adding at least one of the first voice and the second voice with a delay time based on the obtained arrival time difference. apparatus.

The voice acquisition device according to claim 3, wherein the voice enhancement unit obtains the arrival time difference based on a distance between the first microphone and the second microphone.

The voice acquisition device according to claim 3, further comprising a distance and arrival time difference determination unit that obtains the arrival time difference by obtaining the distance difference between the first distance and the second distance.

The voice emphasizing unit delays at least one of the first voice and the second voice with a delay time based on the arrival time difference, and adds the first and the second voices from a synthesized voice obtained by a time-axis shift addition process. The voice acquisition apparatus according to claim 3, further comprising a target voice extraction unit that extracts components to be emphasized by the time-axis shift addition processing for the second voice.

The first microphone is a stereo microphone that picks up a left first sound and a right first sound,
It further comprises an in-phase component extraction processing unit for extracting, as a third voice, a component in which the left and right first voices picked up by the first microphone have the same phase.
The voice emphasizing unit performs an emphasizing process for emphasizing a component based on a distance difference between the first distance and the second distance in the third sound and the second sound instead of the first sound. The voice acquisition device according to claim 1.

An image acquisition unit for acquiring the captured image of the sound collection target,
The voice acquisition according to claim 7, wherein the same-phase component extraction processing unit extracts the third voice in a state in which the image acquisition unit positions the sound collection target at the center of the captured image. apparatus.

The voice emphasizing unit is based on a time difference in which a voice emitted from the sound collection target reaches the first microphone and the second microphone according to a distance difference between the first distance and the second distance. A time axis shift addition processing unit that delays and adds at least one of the first voice and the second voice with a delay time to obtain a synthesized voice;
A playback unit for playing back the above synthetic sound,
The voice acquisition device according to claim 1, further comprising a shift amount adjustment unit that adjusts the delay time according to an operation signal based on a user operation.

A first microphone arranged at a first distance from the sound collecting target collects a sound emitted from the sound collecting target to obtain a first sound,
A second microphone arranged apart from the sound collecting target by a second distance different from the first distance collects a sound emitted from the sound collecting target to obtain a second sound,
A voice acquisition method characterized by performing an emphasis process for emphasizing a component based on a distance difference between the first distance and the second distance among the first and second voices.

The time difference in which the sound emitted from the sound pickup target reaches the first microphone and the second microphone is calculated according to the distance difference between the first distance and the second distance, and the calculated arrival time difference is obtained. 11. The voice acquisition method according to claim 10, wherein the enhancement processing is performed by delaying and adding at least one of the first voice and the second voice with a delay time based on the delay time.

The first microphone picks up a left first sound and a right first sound,
A component in which the phases of the left and right first sounds picked up by the first microphone are the same is extracted as the third sound,
11. The emphasizing process for emphasizing a component based on a distance difference between the first distance and the second distance of the third voice and the second voice instead of the first voice is performed. The voice acquisition method described.

The delay time based on the time difference between the sound emitted from the sound collection target and the first microphone and the second microphone according to the distance difference between the first distance and the second distance. At least one of the voice and the second voice is delayed and added to obtain a synthetic sound,
Play the above synthetic sound,
The voice acquisition method according to claim 10, wherein the delay time is adjusted by an operation signal based on a user operation.

On the computer,
A first microphone arranged at a first distance from the sound collecting target collects a sound emitted from the sound collecting target to obtain a first sound,
A second microphone arranged apart from the sound collecting target by a second distance different from the first distance collects a sound emitted from the sound collecting target to obtain a second sound,
A voice acquisition program for executing a procedure of performing an emphasis process for emphasizing a component based on a distance difference between the first distance and the second distance among the first and second voices.

The time difference in which the sound emitted from the sound pickup target reaches the first microphone and the second microphone is calculated according to the distance difference between the first distance and the second distance, and the calculated arrival time difference is calculated. 15. The voice acquisition program according to claim 14, which executes a procedure of performing the emphasizing process by delaying and adding at least one of the first voice and the second voice with a delay time based on the delay time.

The first microphone picks up a left first sound and a right first sound,
A component in which the phases of the left and right first sounds picked up by the first microphone are the same is extracted as the third sound,
15. A procedure for executing an emphasizing process for emphasizing a component based on a distance difference between the first distance and the second distance of the third sound and the second sound instead of the first sound. The voice acquisition program described in.

The delay time based on the time difference between the sound emitted from the sound collection target and the first microphone and the second microphone according to the distance difference between the first distance and the second distance. At least one of the voice and the second voice is delayed and added to obtain a synthetic sound,
Play the above synthetic sound,
15. The voice acquisition program according to claim 14, which executes a procedure of adjusting the delay time by an operation signal based on a user operation.