JP4861723B2

JP4861723B2 - Monitoring system

Info

Publication number: JP4861723B2
Application number: JP2006049595A
Authority: JP
Inventors: 弘美青柳
Original assignee: Ikegami Tsushinki Co Ltd
Current assignee: Ikegami Tsushinki Co Ltd
Priority date: 2006-02-27
Filing date: 2006-02-27
Publication date: 2012-01-25
Anticipated expiration: 2026-02-27
Also published as: JP2007228459A

Description

本発明は、監視区域を監視カメラで撮影し、かつ監視区域内及びその周囲の音声を記録して、通信ケーブル、インターネット、携帯電話回線網等を通して監視可能にした監視システムに関するものである。 The present invention relates to a monitoring system in which a surveillance area is photographed by a surveillance camera and voices in and around the surveillance area are recorded so that the surveillance area can be monitored through a communication cable, the Internet, a mobile phone network, or the like.

従来、監視システムには、監視区域を映像と音声とにより異常を監視するものがあり、物音がしたり、大きな音がする等の音声の音量レベルが所定閾値以上であると、監視カメラが作動するようにしたものである。図１０を参照して説明すると、カメラユニット１に光学ユニット２が装着され、カメラユニット１には、ＣＣＤ３からの画像信号を処理するカメラ制御回路部４とカメラ全体を制御する全体制御回路部５とが設けられており、音センサ８と光センサ９からの情報が全体制御回路部５に入力され、全体制御回路部５は、音センサ８が一定以上の音量を検出した場合にだけ異常状態と認識し、カメラ制御回路部４を制御して撮影するようにしている。異常時撮影された映像は、カメラユニット１とケーブル６で接続されたモニタ７に映像として映し出され、監視区域の状況を監視することができる。（例えば、特許文献１参照） Conventionally, there are monitoring systems that monitor abnormalities in the monitoring area with video and audio. When the sound volume level is higher than a predetermined threshold, such as when there is a noise or a loud sound, the monitoring camera is activated. It is what you do. Referring to FIG. 10, the optical unit 2 is mounted on the camera unit 1, and the camera unit 1 has a camera control circuit unit 4 for processing an image signal from the CCD 3 and an overall control circuit unit 5 for controlling the entire camera. The information from the sound sensor 8 and the optical sensor 9 is input to the overall control circuit unit 5, and the overall control circuit unit 5 is in an abnormal state only when the sound sensor 8 detects a sound volume above a certain level. And the camera control circuit unit 4 is controlled to take a picture. The video imaged in the event of an abnormality is displayed as a video image on a monitor 7 connected to the camera unit 1 via the cable 6, and the status of the monitoring area can be monitored. (For example, see Patent Document 1)

特開２００３−３３３５８３号公報（明細書の段落〔００１４〕〜〔００１７〕，図面図１）Japanese Patent Laying-Open No. 2003-333583 (paragraphs [0014] to [0017] of the specification, drawing 1)

しかしながら、従来の監視システムは、監視区域内で聴取可能な音の音量レベルが所定閾値を超えた際に監視区域の撮影を開始しており、監視員が異常を捉えるためには常時モニタを監視する必要があるが、監視側では音声情報がどのような内容又は原因によるものかをモニタの映像で確認して判断するしかなかった。即ち、単なる物が落下した衝撃音であったり、異常とは認められない要因による音で作動して映像が映し出される場合があり、このような音量レベルのみで映像を監視する監視システムでは、監視区域内又はその周囲で発生した音の発生要因を把握することができず、また、異常でもないのに騒音で頻繁に映像が映し出されて、監視員を悩ませる結果となり、好ましいものではなく、映像に加えて異常事態を音声により判別できるような監視システムではなかった。 However, the conventional monitoring system starts shooting the monitoring area when the volume level of the sound that can be heard in the monitoring area exceeds a predetermined threshold, and the monitor constantly monitors the monitor in order to detect an abnormality. However, the monitoring side only has to determine what kind of content or cause the audio information is from on the monitor image. In other words, it may be an impact sound when a simple object is dropped, or it may be activated by a sound caused by a factor that is not recognized as abnormal, and the image is projected. The cause of the sound generated in or around the area cannot be grasped, and although it is not abnormal, the image is frequently projected with noise, which causes annoyance to the surveillance staff. In addition to video, it was not a monitoring system that could detect abnormal situations by voice.

本発明は、上述のような課題に鑑みなされたものであり、監視区域の映像に加えて監視区域内及びその周囲の聴取可能な音を識別して異常を警告することが可能な監視システムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and provides a monitoring system capable of identifying an audible sound in and around a monitoring area in addition to an image of the monitoring area and warning an abnormality. The purpose is to provide.

本発明は上記課題を達成するためになされたものであり、請求項１の発明は、監視区域を撮影するカメラの撮影画像を監視装置に記録する監視システムにおいて、
前記監視装置には前記監視区域の音を拾うマイクが設けられ、前記カメラから出力される映像信号を画像処理して画像データとする画像処理手段と、該画像データを表出するモニタとを備え、
前記監視装置が、特徴データベースからの教師信号に基づく学習機能を有し、前記マイクで拾った音と音素データとを比較評価して解析し、該音の音素を認識し該音素の時系列データとする時間遅れニューラルネットワークによる音声／非音声認識手段と、
前記音声／非音声認識手段による該音素の時系列データから所定の音声／非音声の認識をして、警戒対象の異常音声または異常音であるか否かを判定する異常音声／異常音識別手段と、
前記異常音声／異常音識別手段により、異常を検知した場合、監視区域内の異常音声／異常音を検出した時点の異常発生画像データとその前後の画像データとを記憶する画像記憶手段と、
前記異常音声／異常音識別手段からの異常音声／異常音を警告文字データに変換する異常音声文字データ出力手段と、
前記前記異常音声／異常音を音声合成により報知する異常音声出力手段と、
前記警告文字データを前記画像記憶手段から得られる異常発生画像データに付加してネットワークに送出する第１の送出手段と、
前記警告文字データを付加した前記異常発生画像データを携帯電話端末が受信可能な画像データに画像サイズ変換手段により変換してネットワークに送出する第２の送出手段とからなり、
前記第２の送出手段から監視区域内の音による異常発生時の前記異常発生画像データと前後の画像データとを、ネットワークを介して前記携帯電話端末に通知することを特徴とする監視システムである。 The present invention has been made to achieve the above object, and the invention of claim 1 is a monitoring system for recording a captured image of a camera that captures a surveillance area in a monitoring device.
The monitoring device is provided with a microphone for picking up sound in the monitoring area, and includes an image processing means that performs image processing on a video signal output from the camera to form image data, and a monitor that displays the image data. ,
The monitoring device has a learning function based on a teacher signal from a feature database, compares and analyzes the sound picked up by the microphone and phoneme data, recognizes the phoneme of the sound, and time-series data of the phoneme A speech / non-speech recognition means using a time-delay neural network,
Abnormal voice / abnormal sound identification means for recognizing predetermined voice / non-voice from the time-series data of the phonemes by the voice / non-voice recognition means and determining whether or not the alarm is an abnormal voice or abnormal sound When,
An image storage means for storing an abnormal occurrence image data at the time of detecting an abnormal voice / abnormal sound in the monitoring area and image data before and after the abnormal sound / abnormal sound when the abnormal sound / abnormal sound identifying means detects an abnormality;
Abnormal voice character data output means for converting abnormal voice / abnormal sound from the abnormal voice / abnormal sound identification means into warning character data;
Abnormal voice output means for notifying the abnormal voice / abnormal sound by voice synthesis;
First sending means for adding the warning character data to the abnormality occurrence image data obtained from the image storage means and sending it to the network;
The abnormality occurrence image data with the warning character data added thereto is converted to image data that can be received by a mobile phone terminal by image size conversion means, and is sent to a network;
The monitoring system characterized by notifying the cellular phone terminal of the abnormality occurrence image data and the preceding and following image data when an abnormality occurs due to sound in the monitoring area from the second sending means via a network. .

また、請求項２の発明は、前記非音声が悲鳴、物音等の異常音であることを特徴とする請求項１に記載の監視システムである。 The invention according to claim 2 is the monitoring system according to claim 1, wherein the non-speech is an abnormal sound such as a scream or a noise .

また、請求項３の発明は、前記監視装置が時刻データ送出手段を備え、前記警告文字データを前記異常発生画像データに付加するとともに、該時刻データ送出手段からの時刻情報に基づいて、監視区域の位置情報に対応して異常発生日時を前記異常発生画像データに付加して出力することを特徴とする請求項１又は２に記載の監視システムである。 According to a third aspect of the present invention, the monitoring device includes time data transmission means, adds the warning character data to the abnormality occurrence image data, and based on time information from the time data transmission means, an abnormality occurrence time corresponding to the position information of a monitoring system according to claim 1 or 2, characterized that you output in addition to the abnormality image data.

なお、上記音声／非音声認識手段は、ニューラルネットワークに限定するものではないが、ニューラルネットワークは、音素の特徴的パラメータを格納する特徴データベース部を備えており、特徴データベース部から教師信号をニューラルネットワークに加えて識別効率を高めることによって、監視カメラが設置される場所に応じて最適な警戒すべき音声／非音声の識別効率を高めることができ、他の音声認識手段と比較してニューラルネットワークが好ましい。 The speech / non-speech recognition means is not limited to a neural network, but the neural network includes a feature database unit for storing phoneme characteristic parameters, and a teacher signal is transmitted from the feature database unit to the neural network. In addition to improving the discrimination efficiency, it is possible to increase the optimal voice / non-speech discrimination efficiency to be warned according to the location where the surveillance camera is installed. preferable.

請求項１の発明では、監視区域を撮影するカメラの撮影画像を監視装置に記録する監視システムにおいて、
前記監視装置には前記監視区域の音を拾うマイクが設けられ、前記カメラから出力される映像信号を画像処理して画像データとする画像処理手段と、該画像データを表出するモニタとを備え、
前記監視装置が、特徴データベースからの教師信号に基づく学習機能を有し、前記マイクで拾った音と音素データとを比較評価して解析し、該音の音素を認識し該音素の時系列データとする時間遅れニューラルネットワークによる音声／非音声認識手段と、
前記音声／非音声認識手段による該音素の時系列データから所定の音声／非音声の認識をして、警戒対象の異常音声または異常音であるか否かを判定する異常音声／異常音識別手段と、
前記異常音声／異常音識別手段により、異常を検知した場合、監視区域内の異常音声／異常音を検出した時点の異常発生画像データとその前後の画像データとを記憶する画像記憶手段と、
前記異常音声／異常音識別手段からの異常音声／異常音を警告文字データに変換する異常音声文字データ出力手段と、
前記前記異常音声／異常音を音声合成により報知する異常音声出力手段と、
前記警告文字データを前記画像記憶手段から得られる異常発生画像データに付加してネットワークに送出する第１の送出手段と、
前記警告文字データを付加した前記異常発生画像データを携帯電話端末が受信可能な画像データに画像サイズ変換手段により変換してネットワークに送出する第２の送出手段とからなり、
前記第２の送出手段から監視区域内の音による異常発生時の前記異常発生画像データと前後の画像データとを、ネットワークを介して前記携帯電話端末に通知することを特徴とする監視システムであるので、音声／非音声認識手段が防犯ブザーの音や悲鳴などの非音声と音声の音素を検出し、時系列で出力される音素を異常音声／異常音識別手段にて監視対象として異常と認められる音声または非音声であるか否かを判定することが可能であり、モニタ画面に異常を知らせる警告文を異常発生時の映像に重畳して表示することができ、単なる映像よりも異常を認識し易いといった利点があり、監視業務の効率化が図れる利点がある。 In the invention of claim 1, in the monitoring system for recording the captured image of the camera that captures the monitoring area in the monitoring device,
The monitoring device is provided with a microphone for picking up sound in the monitoring area, and includes an image processing means that performs image processing on a video signal output from the camera to form image data, and a monitor that displays the image data. ,
The monitoring device has a learning function based on a teacher signal from a feature database, compares and analyzes the sound picked up by the microphone and phoneme data, recognizes the phoneme of the sound, and time-series data of the phoneme A speech / non-speech recognition means using a time-delay neural network,
Abnormal voice / abnormal sound identification means for recognizing predetermined voice / non-voice from the time-series data of the phonemes by the voice / non-voice recognition means and determining whether or not the alarm is an abnormal voice or abnormal sound When,
An image storage means for storing an abnormal occurrence image data at the time of detecting an abnormal voice / abnormal sound in the monitoring area and image data before and after the abnormal sound / abnormal sound when the abnormal sound / abnormal sound identifying means detects an abnormality;
Abnormal voice character data output means for converting abnormal voice / abnormal sound from the abnormal voice / abnormal sound identification means into warning character data;
Abnormal voice output means for notifying the abnormal voice / abnormal sound by voice synthesis;
First sending means for adding the warning character data to the abnormality occurrence image data obtained from the image storage means and sending it to the network;
The abnormality occurrence image data with the warning character data added thereto is converted to image data that can be received by a mobile phone terminal by image size conversion means, and is sent to a network;
The monitoring system characterized by notifying the cellular phone terminal of the abnormality occurrence image data and the preceding and following image data when an abnormality occurs due to sound in the monitoring area from the second sending means via a network. Therefore, the voice / non-speech recognition means detects non-speech and voice phonemes such as security buzzer sounds and screams, and the time-sequential output phoneme is recognized as abnormal by the abnormal voice / abnormal sound identification means. It is possible to determine whether the sound is non-voiced or non-voiced, and a warning message to notify the abnormality on the monitor screen can be displayed superimposed on the video at the time of the abnormality, which recognizes the abnormality rather than just a video There is an advantage that it is easy to perform, and there is an advantage that the efficiency of the monitoring work can be improved.

また、請求項１の発明では、前記監視装置が前記異常音声／異常音を音声合成により報知する音声出力手段を有し、警告文が映像に映し出されるのみならず、異常を知らせる警告情報を音声で報知することができ、単なる映像よりも異常を認識し易い利点がある。 Further, in the invention of claim 1, wherein the monitoring device is the abnormal sound / abnormal sound have a sound output means for notifying by speech synthesis, warning not only displayed on the video, audio warning information for notifying the abnormality Can be notified, and there is an advantage that it is easier to recognize an abnormality than a simple video.

また、請求項１の発明では、前記監視装置が前記警告文字データを前記画像データに重畳してネットワークに送出する第１の送出手段を備えており、音声または非音声による異常を認識した場合に、異常をインターネットに配信することが可能であり、監視者等の特定の者は離れた場所の監視区域の異常状態を監視することができる利点がある。 According to a first aspect of the present invention, the monitoring device includes a first sending unit that sends the warning character data to the network by superimposing the warning character data on the image data, and recognizes an abnormality caused by voice or non-voice. It is possible to distribute the abnormality to the Internet, and there is an advantage that a specific person such as a monitor can monitor the abnormal state of the monitoring area in a remote place.

また、請求項１の発明では、前記監視装置が、前記警告文字データを重畳した前記画像データを携帯電話端末が受信可能な画像データに画像サイズ変換手段により変換してネットワークに送出する第２の送出手段を備えており、監視区域の異常をインターネットを経由して携帯電話網に配信することが可能であり、監視者等の特定の者は、離れた場所であっても携帯電話端末により異常状態を確認することができる利点がある。 Further, in the invention of claim 1, wherein the monitoring device, the second to be sent to the network is converted by the image size conversion unit the image data obtained by superimposing the warning text data into image data mobile phone terminal can receive comprises a delivery means, an abnormality of the monitored zone it is possible to deliver to the mobile phone network via the Internet, the specific person, such as supervisor, abnormal by the mobile phone terminal even away There is an advantage that the state can be confirmed.

また、請求項２の発明では、前記非音声が悲鳴、物音等の異常音であることを特徴とする請求項１に記載の監視システムであり、ニューラルネットワークによる音素によって音声による異常を判断するのみならず、異常音を識別できるので、種々の監視に利用できる利点がある。また、各監視区域にカメラとマイクとをセットした監視装置を設置し、ネットワークを介して遠隔地でコンピュータや携帯電話端末等で複数の監視区域を監視することができ、音声または非音声による異常事態を認識した場合に、映像とともに警告文がインターネットに配信して監視者等の特定の者が監視区域の異常を離れた場所であっても確認することができる利点があり、監視効率が良好である利点がある。 The invention according to claim 2 is the monitoring system according to claim 1 , wherein the non-speech is an abnormal sound such as a scream or a noise, and only an abnormality due to the sound is determined by a phoneme by a neural network. In addition, since abnormal sounds can be identified, there is an advantage that they can be used for various types of monitoring. Further, according to the monitoring area monitoring device that sets a camera and microphone installed in, it is possible to monitor multiple surveillance zone on a computer or a cellular phone terminal or the like at a remote location via a network, voice or non-voice When an abnormal situation is recognized, there is an advantage that a warning text can be distributed to the Internet together with the video and a specific person such as a monitor can check even if it is away from the abnormality in the monitoring area, and the monitoring efficiency is improved. There is an advantage of being good .

なお、上記音声／非音声認識手段が、時間遅れニューラルネットワークであるので、監視区域で監視する警戒事項を表す「火事だ，ドロボー，助けて、キャー」等のキーワードや防犯ブザー音等の音素を学習させることで、種々の監視区域の監視項目に対応することができ、汎用性にある監視システムを提供できる利点がある。また、本発明では監視区域の異常情報がインターネットを介して配信されるので、離れた場所から監視区域を監視できる利点がある。 It is to be noted that the voice / non-voice recognition means, because it is a time delay neural network, representing the vigilance matters to be monitored by surveillance zone "It's a fire, burglar, help, Kya" phonemes such as keywords and crime prevention buzzer sound such as By learning, it is possible to cope with the monitoring items in various monitoring areas, and there is an advantage that a versatile monitoring system can be provided. Further, in the present invention since the abnormality information monitoring area is distributed over the Internet, there is an advantage that can monitor the monitoring area from a remote location.

また、請求項３の発明では、前記監視装置が、時刻データ送出手段を備え、前記警告文字データを前記画像データに重畳するとともに、該時刻データ送出手段からの時刻情報に基づいて、監視区域の位置情報に対応して異常発生日時を該画像データに重畳して出力することを特徴とする請求項１又は２に記載の監視システムであるので、映像とともに異常が発生した日時と監視区域（場所）を特定することができる利点がある。 According to a third aspect of the present invention, the monitoring device includes a time data transmission unit, superimposes the warning character data on the image data, and based on time information from the time data transmission unit, The monitoring system according to claim 1 or 2 , wherein an abnormality occurrence date and time is superimposed on the image data and output in correspondence with the position information. There is an advantage that can be specified).

以下、本発明に係る監視システムについて図面を参照して説明する。図１は、本発明の一実施形態を示すブロック図であり、図２は監視装置の機能ブロック図であり、図３は監視システムのブロック図である。図４は音声／非音声認識処理部の一例を示すニューラルネットワークの図である。図５は異常発生時の画像を選択する方法を説明するための説明図であり、図６は携帯電話の表示画面を示す図である。また、図７は、本発明の他の実施形態を示すブロック図であり、図８はその監視装置の機能ブロック図であり、図９は監視システムのブロック図である。 Hereinafter, a monitoring system according to the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a functional block diagram of a monitoring device, and FIG. 3 is a block diagram of a monitoring system. FIG. 4 is a diagram of a neural network showing an example of a voice / non-voice recognition processing unit. FIG. 5 is an explanatory diagram for explaining a method of selecting an image when an abnormality occurs, and FIG. 6 is a diagram showing a display screen of a mobile phone. 7 is a block diagram showing another embodiment of the present invention, FIG. 8 is a functional block diagram of the monitoring apparatus, and FIG. 9 is a block diagram of the monitoring system.

（実施形態１）
図１〜図６を参照し、本発明の一実施形態の監視システムについて説明する。図１に示すように、本実施形態の監視システムは、監視装置１０にマイクＭとカメラＣとが接続され、監視装置１０はインターネットＮを経由してクライアント・コンピュータＰＣに接続され、またインターネットＮを経由して携帯電話回線網Ｄに接続されて携帯電話端末Ｔと接続されている。マイクＭは監視区域Ｅ_１内又はその周辺の音を拾い、カメラＣは監視区域Ｅ_１内を撮影し、この音声情報と映像情報とが監視装置１０に送り込まれ、監視者は、クライアント・コンピュータＰＣや携帯電話端末Ｔで監視区域Ｅ_１の監視を映像と音声情報とで監視することができる。監視装置１０は複数の監視区域Ｅ_１〜Ｅｎにそれぞれ設置され、それらのマイクＭとカメラＣとが通信ケーブルを介して監視装置１０に接続される。無論、監視装置１０が監視する監視区域が一箇所の場合や複数箇所の場合があるし、マイクＭはカメラＣと一体であってもよいし、カメラＣに隣接または離間させて配置してもよい。 (Embodiment 1)
With reference to FIGS. 1-6, the monitoring system of one Embodiment of this invention is demonstrated. As shown in FIG. 1, in the monitoring system of this embodiment, a microphone M and a camera C are connected to a monitoring device 10, and the monitoring device 10 is connected to a client computer PC via the Internet N. The mobile phone terminal T is connected to the mobile phone network T via the mobile phone network D. Mike M picked up sounds around ₁ or in the monitoring zone E, the camera C is photographed in the surveillance area E _1, and the audio and video information is fed to the monitoring device 10, the monitoring person, the client computer monitoring the surveillance zone E ₁ a PC or a mobile phone T can be monitored by the video and audio information. The monitoring device 10 is installed in each of a plurality of monitoring areas E _{1 to} En, and the microphone M and the camera C are connected to the monitoring device 10 via a communication cable. Of course, there may be one or a plurality of monitoring areas monitored by the monitoring device 10, and the microphone M may be integrated with the camera C, or may be arranged adjacent to or separated from the camera C. Good.

監視装置１０は、マイクＭで捉えた音が監視項目に係わる音声または非音声であるか否かを判定して監視項目に係わる音声または非音声であれば、警告すべき事態（アラーム信号の発生）であると判定して警告文字データを出力し、アラーム発生時点の画像を選択し、この画像に警告文字データを重畳し、さらに、警告文字データを音声に変換して警告する機能を有する。また、監視装置１０は、警告文字データを画面に重畳した警告表示映像をインターネットに配信するためのアップロード機能（送出手段）を有する。 The monitoring device 10 determines whether or not the sound captured by the microphone M is a voice or non-voice related to the monitoring item, and if the sound is a voice or non-voice related to the monitoring item, a situation to be warned (occurrence of an alarm signal) ), The warning character data is output, the image at the time of the alarm is selected, the warning character data is superimposed on this image, and the warning character data is converted into voice and warned. The monitoring device 10 also has an upload function (sending means) for distributing a warning display video in which warning character data is superimposed on the screen to the Internet.

監視者は、インターネットＮを介してクライアント・コンピュータＰＣによって、アップロードされた警告表示映像をモニタに表示して監視することができるし、インターネットＮと携帯電話回線網Ｄとを接続してアップロードされた警告表示映像を携帯電話端末Ｔの表示画面（ディスプレィ）に表示させて各監視区域Ｅ_１〜Ｅｎを離れた場所から監視することができる。なお、当然ながらクライアント・コンピュータＰＣ及び携帯電話端末Ｔは、暗唱番号及びパスワード或いは生体認証等による個人認証ができない限り、このネットワーク上の警告表示映像を確認することはできない。また、このネットワークへの外部からの侵入ができないようにルータを設置したり、ファイアーウオールを形成することが望ましい。 The monitor can monitor the uploaded warning display video on the monitor by the client computer PC via the Internet N, and is uploaded by connecting the Internet N and the mobile phone network D. The warning display video can be displayed on the display screen (display) of the mobile phone terminal T to monitor each of the monitoring areas E _{1 to} En from a remote location. Needless to say, the client computer PC and the mobile phone terminal T cannot confirm the warning display video on the network unless the personal identification can be performed by the password and password or biometric authentication. Also, it is desirable to install a router or form a firewall so that the network cannot be invaded from the outside.

さらに、監視装置１０について、図２の機能ブロック図を参照して説明する。先ず、音声信号処理系について説明する。監視装置１０はＣＰＵ（中央演算制御装置）を備えたものであり、マイクＭにより監視区域内で拾った音をデジタル処理して音素を認識する音声／非音声認識手段１１と、音声／非音声認識手段１１により出力された音素の時系列データが入力される制御処理部１２とが設けられ、制御処理部１２には、音声／非音声認識手段１１から得られる音素の時系列データからどのような内容であるかを識別する異常音声／異常音識別手段１２ａと、異常音声／異常音識別手段１２ａにより識別された異常音声／異常音を音声データとして出力し、スピーカＳを駆動するための音声情報を出力する異常音声データ送出手段１２ｂと、異常が発生した日時を特定するための時刻データ送出手段１２ｄ等との機能を有し、さらに、異常音声データ送出手段１２ｂからの異常音声データを音声信号として出力する異常音声出力手段１３と、異常音声／異常音識別手段１２ａにより識別された異常音声／異常音を警告文字データ信号として出力する異常音声文字データ送出手段１４とを備えている。なお、異常音声／異常音識別手段１２ａは制御処理部１２の前段で処理してもよい。 Further, the monitoring device 10 will be described with reference to the functional block diagram of FIG. First, the audio signal processing system will be described. The monitoring device 10 is provided with a CPU (Central Processing Control Device), a voice / non-voice recognition means 11 for digitally processing a sound picked up in a monitoring area by a microphone M and recognizing a phoneme, and a voice / non-voice. And a control processing unit 12 to which the phoneme time-series data output by the recognition unit 11 is input. The control processing unit 12 determines how the phoneme time-series data obtained from the speech / non-speech recognition unit 11 is used. Audio for detecting the abnormal content / abnormal sound identifying means 12a and the abnormal voice / abnormal sound identified by the abnormal voice / abnormal sound identifying means 12a as audio data to drive the speaker S It has functions of an abnormal voice data sending means 12b for outputting information, a time data sending means 12d for specifying the date and time when the abnormality occurred, and the abnormal voice data sending means. Abnormal voice output means 13 for outputting abnormal voice data from 2b as a voice signal, and abnormal voice character data sending means for outputting the abnormal voice / abnormal sound identified by the abnormal voice / abnormal sound identification means 12a as a warning character data signal 14. The abnormal sound / abnormal sound identification means 12a may be processed before the control processing unit 12.

なお、上記音素については後述するとし、上記音声／非音声において、音声とは、意味として理解できる人の声であり、例えば、異常状態を示す、ドロボ−、強盗、火事だ、火災だ、助けて、やめて等であり、非音声とは、意味を持たない人の声や物音であり、悲鳴、鳴き声（赤ん坊の泣き声等）、防犯ブザーの音、ガチャン（ガラスが壊れる音、車両がぶつかる音、物を壊す音）等を意味するものとする。音声／非音声による異常を認識した場合、異常音声／異常音認識処理情報をスピーカＳを駆動させて監視員に報知するが、スピーカＳを駆動する音声又は警告文字データ信号で表示される文字は、音声の場合、「「ドロボ−、強盗、火事だ、火災だ、助けて、やめて等」の音声を認識しました。至急対処してください。」等であり、非音声の場合は、「「防犯ブザー音、悲鳴等」を認識しました。至急対処してください。」等である。 The above phonemes will be described later. In the above voice / non-voice, the voice is a voice of a person who can be understood as a meaning, for example, a drool, a burglar, a fire, a fire, a help, indicating an abnormal state. Non-speech is a meaningless person's voice or sound, screaming, screaming (baby crying, etc.), security buzzer sound, slap (glass breaking sound, vehicle crashing sound) , Sound that breaks things). When an abnormality due to voice / non-voice is recognized, the abnormal voice / abnormal sound recognition processing information is notified to the monitor by driving the speaker S, but the character displayed by the voice or warning character data signal driving the speaker S is In the case of voice, I recognized the voice of "Drobbing, robber, fire, fire, help, stop, etc." Please deal with it as soon as possible. In the case of non-speech, etc., "" Security buzzer sound, scream, etc. "was recognized. Please deal with it as soon as possible. Etc.

次に、映像信号処理系について説明する。監視装置１０には、カメラＣからの映像信号をデジタル信号に変換し圧縮処理する画像処理手段１５と、画像処理手段１５により圧縮処理された画像データをフレームまたはフィールド画像毎に繰り返して記憶する画像記憶手段１６と、画像記憶手段１６に記憶されたフレームまたはフィールド画像を伸長してＮＴＳＣ方式の映像信号に変換する映像出力手段１７と、映像出力手段１７からの映像信号に上記異常音声文字データ送出手段１４からの警告文字データ信号に基づく警告文を重畳してモニタＷに出力する映像合成手段１８と、画像記憶手段１６に記憶されたフレームまたはフィールド画像Ａを、異常音声文字データ送出手段１４からの文字信号を取り込み、図６に示す携帯電話端末のディスプレィに表示可能な所定の画像サイズ内に収まるように変換する画像サイズ変換手段１９と、映像に警告情報（アラーム発生日時，アラーム発生場所，警告文）を重畳した警告表示映像をインターネットＮへの配信とインターネットＮを経由して携帯電話網に配信するためにＷｅｂサーバ（図示省略）にアップロードする送出手段２０とを備えている。 Next, the video signal processing system will be described. The monitoring apparatus 10 includes an image processing unit 15 that converts a video signal from the camera C into a digital signal and performs compression processing, and an image that repeatedly stores image data compressed by the image processing unit 15 for each frame or field image. Storage means 16, video output means 17 for expanding the frame or field image stored in image storage means 16 and converting it into an NTSC video signal, and sending the abnormal voice character data to the video signal from video output means 17 The video composition means 18 for superimposing the warning text based on the warning character data signal from the means 14 and outputting it to the monitor W, and the frame or field image A stored in the image storage means 16 from the abnormal voice character data sending means 14 Within a predetermined image size that can be displayed on the mobile phone terminal display shown in FIG. Image size conversion means 19 for converting the image to fit within, and distribution of warning display video in which warning information (alarm occurrence date / time, location of alarm occurrence, warning text) is superimposed on the video to the Internet N and the mobile phone network via the Internet N And sending means 20 for uploading to a web server (not shown).

携帯電話端末のディスプレィは、図６に示すように、表示画面Ａ，Ｂとからなり、表示画面Ａは監視画像を表示する表示領域であり、表示画面Ｂは異常発生時の異常情報を文字で表示する領域である。表示画面Ｂには、異常情報としてアラーム発生日時、アラーム発生場所、異常音声／異常音認識処理情報が表示される。なお、携帯電話の画面サイズは、最大で２４０画素（横）×３２０画素（縦）のＱＶＧＡ（Quarter Video Graphics Array）サイズであり、その実効サイズは２４０画素（横）×２２４画素（縦）である。フレームまたはフィールド画像（表示画面）Ａは、画像サイズ変換手段１９により、この画像サイズ内に収まるように変換される。 As shown in FIG. 6, the display of the mobile phone terminal is composed of display screens A and B. The display screen A is a display area for displaying a monitoring image. This is the area to be displayed. On the display screen B, alarm occurrence date / time, alarm occurrence location, and abnormal voice / abnormal sound recognition processing information are displayed as abnormality information. The screen size of the mobile phone is a maximum of 240 pixels (horizontal) × 320 pixels (vertical) QVGA (Quarter Video Graphics Array) size, and the effective size is 240 pixels (horizontal) × 224 pixels (vertical). is there. The frame or field image (display screen) A is converted by the image size conversion means 19 so as to be within this image size.

続いて、本実施形態について、図２，図３を参照して詳細に説明する。監視装置１０にはマイクＭとカメラＣとが通信ケーブルで接続され、かつマイクＭで捉えた音及びカメラＣで撮影された監視区域の映像はＶＴＲ、ＨＤＤレコーダ、或いはＤＶＤレコーダＲＥに記録されている。また、監視装置１０にはモニタＷが接続され、かつシリアル信号処理部２１を介してキーボード等の入力装置が接続され、この入力装置により監視装置１０を制御することができる。 Next, the present embodiment will be described in detail with reference to FIGS. A microphone M and a camera C are connected to the monitoring device 10 via a communication cable, and a sound captured by the microphone M and a video of a monitoring area captured by the camera C are recorded on a VTR, HDD recorder, or DVD recorder RE. Yes. In addition, a monitor W is connected to the monitoring device 10 and an input device such as a keyboard is connected via the serial signal processing unit 21, and the monitoring device 10 can be controlled by this input device.

音声／非音声認識手段１１は、マイクＭにより監視区域内又はその周辺からの拾った音声／非音声音を増幅してＡ／Ｄ変換器によりデジタル信号に変換する音声／非音声取込み部１１ａと、デジタル化した音声／非音声を１０〜３０チャネル程度の帯域フィルタ群を用いたり、高速フーリエ変換（ＦＦＴ）を用いて直接的に音声スペクトルを計算したりする等、音声認識の前処理にあたる短時間スペクトル分析を行って、音声／非音声を音素に分解する音声分析部１１ｂと、音声分析部１１ｂから得られる音素の時系列データが入力されて音声／非音声を認識する音声認識処理部１１ｃと、音素の特徴的パラメータを格納し、音声認識処理部１１ｃに教師信号とし供給する特徴データーベース部１１ｄとからなる。なお、音声分析部１１ｂでは、非音声が音声と同様な短時間スペクトル分析して音素に分解される。非音声にも子音や母音と同様に特徴的パターンがある。 The voice / non-speech recognition means 11 includes a voice / non-speech capturing unit 11a that amplifies voice / non-speech sound picked up from or around the monitoring area by the microphone M and converts it into a digital signal by the A / D converter. A short period of preprocessing for speech recognition, such as using a band filter group of about 10 to 30 channels for digitized speech / non-speech, or directly calculating speech spectrum using fast Fourier transform (FFT). A speech analysis unit 11b that performs temporal spectrum analysis to decompose speech / non-speech into phonemes, and a speech recognition processing unit 11c that receives time-series data of phonemes obtained from the speech analysis unit 11b and recognizes speech / non-speech. And a feature database 11d that stores phoneme characteristic parameters and supplies them to the speech recognition processor 11c as teacher signals. In the voice analysis unit 11b, the non-voice is analyzed into a phoneme by performing a short-time spectrum analysis similar to that of the voice. Similar to consonants and vowels, non-speech also has a characteristic pattern.

音声認識処理部１１ｃは、例えば、公知の時間遅れニューラルネットワーク（ＴＤＮＮ：Time Delay Neural Network）により構成され、図４はその構造図の概要を示し、ＴＤＮＮは音の認識の単位である音素を認識し、その音素の時系列データから音声／非音声を認識することができる。音声認識処理部１１ｃには、各音素グループに分類されて各グループ毎にＴＤＮＮが設けられ、音声分析部１１ｂから音声／非音声をスペクトル分析した音素の時系列データが各ＴＤＮＮの入力層Ｉに供給され、第１と第２の隠れ層Ｈ１，Ｈ２にて、グループ間の識別を行って音素を識別して出力層Ｏから特定された音素が出力される。なお、日本語の音素は全部で２４種類（ｂ，ｄ，ｇ，ｐ，ｔ，ｋ，ｍ，ｎ，Ｎ，ｓ，ｓｈ，ｈ，ｚ，ｃｈ，ｔｓ，ｒ，ｗ，ｙ，ａ，ｉ，ｕ，ｅ，ｏ，Ｑ（無音））あり、音素は母音と子音に分類され、さらに、有声子音と無声子音とに分けられ、有声子音には、破裂音，摩擦音，流音，鼻音音等があり、無声子音には破裂音，破擦音，摩擦音等があり、ＴＤＮＮは各グル−プ毎に設けられている。また、音声認識処理部１１ｃでは、音声のみならず、非音声が認識処理される。音声分析部１１ｂでは音声以外に防犯ブザーや悲鳴等の非音声が音素として分解されて出力されるので、音声認識処理部１１ｃでは、音声のみならず非音声における特徴的パターンから非音声を認識することができる。 The speech recognition processing unit 11c is configured by, for example, a known time delay neural network (TDNN), FIG. 4 shows an outline of the structure diagram, and TDNN recognizes a phoneme which is a unit of sound recognition. Then, speech / non-speech can be recognized from the time-series data of the phoneme. The speech recognition processing unit 11c is classified into each phoneme group, and a TDNN is provided for each group. Time series data of phonemes obtained by performing spectrum analysis on speech / non-speech from the speech analysis unit 11b is input to the input layer I of each TDNN. In the first and second hidden layers H1 and H2, the phonemes are identified and the phonemes identified from the output layer O are output. There are a total of 24 Japanese phonemes (b, d, g, p, t, k, m, n, N, s, sh, h, z, ch, ts, r, w, y, a, i, u, e, o, Q (silence)), phonemes are classified into vowels and consonants, and further divided into voiced and unvoiced consonants. Voiced consonants include burst sounds, friction sounds, flow sounds, and nasal sounds. There are sounds, etc., and unvoiced consonants include plosives, rubbing sounds, frictional sounds, etc., and a TDNN is provided for each group. In the speech recognition processing unit 11c, not only speech but also non-speech is recognized. In the voice analysis unit 11b, non-voices such as a security buzzer and a scream are decomposed and output as phonemes in addition to the voices. Therefore, the voice recognition processing unit 11c recognizes non-voices not only from voices but also from non-voice characteristic patterns. be able to.

なお、図４のＴＤＮＮは、子音／ｂ／ｄ／ｇを認識するグループを例示したものであり、その構造は、入力層（Input Layer）Ｉ、第１と第２の隠れ層（Hidden Layer）Ｈ１，Ｈ２、出力層（Output Layer）Ｏからなる多層パーセプトロン型を構成している。この型のＴＤＮＮは、特徴データーベース１１ｄからの教師信号に基づいて、教師信号との誤差を逆に伝搬するバックプロパゲーション（Back-Propagation：誤差逆伝搬法）学習ができることに特徴がある。図４のＴＤＮＮにおいて、入力層Ｉは、横軸が時間軸を表し、縦軸が周波数軸を表している。横軸の時間軸は、１０ミリ秒毎にスペクトル分析を行った１５フレームの周波数パターンであり、縦軸の周波数軸は、０〜６０００Ｈｚまでを１６の帯域（１６チャンネルスペクトラム）に分割して、１５フレーム分の音素の特徴成分である時系列データ（１５フレーム×１６次元＝２４０点）が入力層Ｉに入力される。第１隠れ層Ｈ１には、入力層Ｉの３フレーム（３０ミリ秒）の局所的な特徴成分を検出する特徴検出器（素子）が並んでいる。第２の隠れ層Ｈ２には第１隠れ層Ｈ１の５フレーム（５０ミリ秒）のより大局的な特徴成分を検出する特徴検出器が並んでいる。出力層Ｏは、第２隠れ層Ｈ２の出力値の時間方向への総和を出力とし音素「ｂ」を認識することができる。また、母音を含む他の音素グループも同様のＴＤＮＮで認識することができる。因みに、音声認識処理部１１ｃでは、全ての音素グループのＴＤＮＮを用意する必要はなく、認識対象の音素グループのＴＤＮＮを用意すればよい。 The TDNN in FIG. 4 exemplifies a group for recognizing consonant / b / d / g, and its structure is an input layer (Input Layer) I, first and second hidden layers (Hidden Layer). A multilayer perceptron type composed of H1, H2 and an output layer (Output Layer) O is formed. This type of TDNN is characterized in that it can perform back-propagation (Back-Propagation) learning that reversely propagates an error from the teacher signal based on the teacher signal from the feature database 11d. In the TDNN of FIG. 4, in the input layer I, the horizontal axis represents the time axis and the vertical axis represents the frequency axis. The time axis on the horizontal axis is a frequency pattern of 15 frames obtained by performing spectrum analysis every 10 milliseconds. The frequency axis on the vertical axis divides 0 to 6000 Hz into 16 bands (16 channel spectrum), Time series data (15 frames × 16 dimensions = 240 points), which is a characteristic component of phonemes for 15 frames, is input to the input layer I. In the first hidden layer H1, feature detectors (elements) for detecting local feature components of three frames (30 milliseconds) of the input layer I are arranged. In the second hidden layer H2, feature detectors for detecting more global feature components of 5 frames (50 milliseconds) of the first hidden layer H1 are arranged. The output layer O can recognize the phoneme “b” by using the sum of the output values of the second hidden layer H2 in the time direction as an output. Also, other phoneme groups including vowels can be recognized by the same TDNN. Incidentally, in the speech recognition processing unit 11c, it is not necessary to prepare TDNNs for all phoneme groups, and TDNNs for recognition target phoneme groups may be prepared.

このように音声認識処理部１１ｃでは、音素の時系列データがＴＤＮＮの入力層Ｉに与えられ、自己評価、或いは特徴データベース部１１ｄから入力される音素データに基づいて、比較評価を繰り返し行うことによって、各層間の重み付け量が設定され、各音素（母音／子音）の認識処理が行われ、その音素の認識結果が出力層Ｏから出力され、出力層Ｏからの時系列の音素（子音、母音）が出力され、後段で音声／非音声を認識することができる。なお、音声認識処理部１１ｃは、ＴＤＮＮ等の公知のソフトやディバイス等が用いられる。 As described above, in the speech recognition processing unit 11c, the time series data of phonemes is given to the input layer I of the TDNN, and the comparison evaluation is repeatedly performed based on the self-evaluation or the phoneme data input from the feature database unit 11d. A weighting amount between each layer is set, a recognition process of each phoneme (vowel / consonant) is performed, a recognition result of the phoneme is output from the output layer O, and a time-series phoneme (consonant, vowel) from the output layer O ) Is output, and voice / non-voice can be recognized later. The voice recognition processing unit 11c uses known software such as TDNN, a device, or the like.

制御処理部１２はＣＰＵにより演算処理する機能を有する。制御処理部１２については、図２の機能ブロック図を参照し説明する。音声認識処理部１１ｃからの出力される音声／非音声情報が異常音声／異常音識別手段１２ａに入力される。異常音声／異常音識別手段１２ａでは、音声／非音声入力情報を認識し、各監視区域にあった警戒するべき音声や非音声を、理解可能な異常音声／異常音情報（異常音声：ドロボー、火事だ、助けて、やめて等，異常音情報：防犯ブザー音や悲鳴等）として認識し、音声合成と文字化するための音声識別情報とアラーム信号（異常事態検出信号）とを送出する。 The control processing unit 12 has a function of performing arithmetic processing by the CPU. The control processing unit 12 will be described with reference to the functional block diagram of FIG. The voice / non-voice information output from the voice recognition processing unit 11c is input to the abnormal voice / abnormal sound identification unit 12a. The abnormal voice / abnormal sound identification means 12a recognizes the voice / non-voice input information, and can understand the voice / non-voice to be warned in each monitoring area, and can understand the abnormal voice / abnormal sound information (abnormal voice: Drobo, Recognize as abnormal sound information (security buzzer sound, scream, etc.) such as fire, help, stop, etc., and send out voice identification information and alarm signal (abnormal condition detection signal) for voice synthesis and text conversion.

一方、時刻データ送出手段１２ｄは、時刻発生部１２ｃから西暦、月、日、時刻の日時情報が入力され、異常音声識別手段１２ａからの異常音声情報に基づくアラーム信号により、時刻データ送出手段１２ｄから日時情報が出力される。この日時情報は、後述のように、アラーム発生場所、音声識別情報（警告情報）とともに、所定の出力形式のフォーマットの映像信号に重畳或いは付加されて警告表示映像とし出力される。この警告表示映像はモニタＷに表示され、かつインターネットＮを介してクライアント・コンピュータＰＣや携帯電話端末Ｔの画面に表示される。 On the other hand, the time data sending means 12d receives the date / time information of the year, month, day, and time from the time generating section 12c, and from the time data sending means 12d by an alarm signal based on the abnormal voice information from the abnormal voice identifying means 12a. Date and time information is output. As will be described later, this date and time information is output as a warning display video by being superimposed or added to a video signal in a predetermined output format together with an alarm occurrence location and audio identification information (warning information). This warning display image is displayed on the monitor W and on the screen of the client computer PC or the mobile phone terminal T via the Internet N.

また、異常音声データ送出手段１２ｂからの音声識別情報は音声合成処理部１３ａに入力され、既存の音声合成手段によりアナログ音声が作成され、音声合成されたアナログ信号が音声出力部１３ｂにより増幅されてスピーカＳに出力される。さらに、異常音声データ送出手段１２ｂからの音声識別情報が文字データ出力部１４に入力され、文字データ出力部１４では、音声識別情報に基づいて文字データに変換し、文字信号合成部１８，画像変換サーバ部１９及びネットワーク処理部２０にそれぞれ出力する。 The voice identification information from the abnormal voice data sending means 12b is input to the voice synthesis processing section 13a, an analog voice is created by the existing voice synthesis means, and the synthesized analog signal is amplified by the voice output section 13b. Output to the speaker S. Further, the voice identification information from the abnormal voice data sending means 12b is input to the character data output unit 14, and the character data output unit 14 converts the voice identification information into character data based on the voice identification information. The data is output to the server unit 19 and the network processing unit 20, respectively.

次に、映像処理系について説明すると、画像処理手段１５は、映像取込み部１５ａと、映像処理部１５ｂからなり、映像取り込み部１５ａでは、カメラＣが撮影した監視区域の映像であるＮＴＳＣ等のアナログ映像信号を取り込むための入力インターフェースであり、映像処理部１５ｂは、このＮＴＳＣのアナログ映像信号をデコードし、Ａ／Ｄ変換し圧縮処理してデジタル圧縮データとし、画像メモリ（画像記録手段）１６に送出する。 Next, the video processing system will be described. The image processing means 15 includes a video capturing unit 15a and a video processing unit 15b. In the video capturing unit 15a, an analog such as NTSC which is a video of the monitoring area captured by the camera C is used. The video processing unit 15b decodes the NTSC analog video signal, A / D-converts and compresses it into digital compressed data, and stores it in the image memory (image recording means) 16. Send it out.

画像メモリ１６は、複数のフレームまたはフィールドメモリからなり、監視映像であるデジタル圧縮データが複数のフレームまたはフィールドメモリに繰り返し上書き記録されている。映像出力処理部１７は、画像メモリ１６に記録された映像圧縮データを読み出して伸張してＮＴＳＣにエンコードし、文字信号合成部（映像合成手段）１８に送り込み、文字信号合成部（映像合成手段）１８では映像信号に警告文等を重畳してモニタＷに出力する。また、文字信号合成部１８では、監視区域の異常状態を示す警告情報や発生日時、監視区域（場所）等の監視に必要な情報が文字情報として映像信号に重畳されてモニタＷに出力される。 The image memory 16 includes a plurality of frame or field memories, and digital compressed data that is a monitoring video is repeatedly overwritten and recorded on the plurality of frame or field memories. The video output processing unit 17 reads the compressed video data recorded in the image memory 16, decompresses it, encodes it into NTSC, sends it to the character signal synthesis unit (video synthesis unit) 18, and sends it to the character signal synthesis unit (video synthesis unit). In 18, a warning text or the like is superimposed on the video signal and output to the monitor W. Further, in the character signal synthesis unit 18, warning information indicating an abnormal state of the monitoring area, information on occurrence date and time, information necessary for monitoring such as the monitoring area (location) is superimposed on the video signal as character information and output to the monitor W. .

画像メモリ１６では、図５に示したように、異常音声／異常音識別手段１２ａが異常認識処理を開始（アラーム信号が発生した時点）又は異常認識処理が完了した時点ｔ_１で、制御処理部１２の制御機能により、画像メモリ１６への上書き処理を停止し、連続する記録画像認識処理時間ｔ_０分遡って異常が発生した時点の記録画像Ｆ_０を映像出力処理部１７で処理し、文字信号合成部１８を経て、異常発生時の映像又は画像としてモニタＷに出力する。さらに、制御処理部１２では、この異常の発生を検知した時点の記録画像Ｆ_０の前後複数枚の画像を携帯電話端末Ｔ又は監視装置１０のクライアント・コンピュータＰＣが出力できる。 In the image memory 16, as shown in FIG. 5, at the time t ₁ when the abnormal sound / abnormal sound identification unit 12 a starts the abnormality recognition process (when the alarm signal is generated) or when the abnormality recognition process is completed, the control processing unit 12, the overwriting process to the image memory 16 is stopped, and the recorded image F ₀ at the time when an abnormality occurs retroactive to the continuous recorded image recognition processing time t ₀ is processed by the video output processing unit 17, After passing through the signal synthesizer 18, it is output to the monitor W as a video or an image when an abnormality occurs. Further, the control processing unit 12 can output a plurality of images before and after the recorded image F ₀ when the occurrence of the abnormality is detected, by the mobile phone terminal T or the client computer PC of the monitoring device 10.

画像変換サーバ部（画像サイズ変換手段）１９は、映像信号を携帯電話端末Ｔの画面に表示できる画像形式（ＪＰＥＧ等）で、画像サイズ（ＱＶＧＡサイズ等）に表示可能なサイズに変換し、異常状態の情報や発生時刻、監視区域（場所）等の監視に必要な情報が文字情報として所定フォーマットに付加されてネットワーク処理部（ネットワークへの送出手段）２０に送出される。 The image conversion server unit (image size conversion means) 19 converts the video signal into an image format (JPEG or the like) that can be displayed on the screen of the mobile phone terminal T to a size that can be displayed in the image size (QVGA size or the like). Information necessary for monitoring, such as status information, time of occurrence, and monitoring area (location), is added as character information to a predetermined format and sent to the network processing unit (network sending means) 20.

ネットワーク処理部（送出手段）２０は、異常状態の警告情報や発生日時、監視区域（場所）等の監視に必要な情報が文字情報としアラーム発生時の画像を、クライアント・コンピュータＰＣや携帯電話端末ＴにインターネットＮを介してＷＷＷブラウザで表示可能な出力形式として配信（アップロード）する機能を備えている。 The network processing unit (transmission means) 20 uses the alarm information, the date and time of occurrence, the information necessary for monitoring, such as the monitoring area (location), as character information, and the image at the time of occurrence of the alarm as the client computer PC or mobile phone terminal. It has a function of delivering (uploading) to T as an output format that can be displayed on a WWW browser via the Internet N.

また、監視装置１０は、ルータＲを介してインターネットＮに接続され、クライアント・コンピュータＰＣはルータＲを介してインターネットＮに接続され、また、携帯電話端末Ｔが接続された携帯電話網ＤはルータＲを介してインターネットＮに接続されており、クライアント・コンピュータＰＣ及び携帯電話端末Ｔは、監視区域の状態を離れた場所であっても監視することができる。なお、ルータは、ネットワーク上を流れるパケットデータをプロトコルから解析し、どの経路で流せばよいか判断して転送を行う機能を有し、また、データの伝達経路を設定できる機能も併せ持ち、関係の無いデータを流さない機能を有し、システムのセキュリテーを高めるのに効果的である。 The monitoring device 10 is connected to the Internet N via the router R, the client computer PC is connected to the Internet N via the router R, and the mobile phone network D to which the mobile phone terminal T is connected is the router. The client computer PC and the mobile phone terminal T are connected to the Internet N via R, and can monitor even if they are away from the monitoring area. The router has a function to analyze packet data that flows on the network from the protocol, determine which route should be used and transfer it, and also has a function to set the data transmission route. It has the function of not passing any data and is effective in enhancing the security of the system.

（実施形態２）
次に、本発明の他の実施形態について図７から図９を参照して説明する。図７は、本発明の他の実施形態を示すブロック図であり、図８は、他の実施形態の機能ブロック図であり、図９は、他の実施形態の監視システムのブロック図である。 (Embodiment 2)
Next, another embodiment of the present invention will be described with reference to FIGS. FIG. 7 is a block diagram showing another embodiment of the present invention, FIG. 8 is a functional block diagram of the other embodiment, and FIG. 9 is a block diagram of a monitoring system of the other embodiment.

本実施形態は、図７に示すように、上記実施形態とは異なり、各監視区域にカメラＣとマイクＭが設置され、カメラＣとマイクＭには、監視区域で捉えた音と映像をインターネットＮに送出する送出手段２２を備えている。この送出手段２２は、各監視区域の映像と音声をＷＷＷブラウザで検索して視聴することができる形態に変換して自動的にＷｅｂサーバにアップロードする機能を有し、カメラＣは、所謂ネットワークカメラである。この実施形態では、監視装置１０を監視区域に設置される必要はなく、インターネットＮを介して離れた場所に設置したものであり、上記実施形態とは異なっている。携帯電話端末Ｔは上記実施形態と同一である。なお、上記実施形態と同一部分は可能な限り説明を省略する。 As shown in FIG. 7, this embodiment differs from the above embodiment in that a camera C and a microphone M are installed in each monitoring area, and the camera C and the microphone M receive sound and video captured in the monitoring area on the Internet. Sending means 22 for sending to N is provided. The sending means 22 has a function of converting the video and audio of each monitoring area into a form that can be searched and viewed with a WWW browser and automatically uploading it to a Web server. The camera C is a so-called network camera. It is. In this embodiment, the monitoring device 10 does not need to be installed in the monitoring area, but is installed in a place away via the Internet N, which is different from the above embodiment. The mobile phone terminal T is the same as that in the above embodiment. The description of the same parts as those in the above embodiment is omitted as much as possible.

また、音声／非音声認識手段１１は、マイクＭが拾った音声／非音声をネットワークの送出手段２０から受信されるパケット信号として取得し、パケット信号による音声／非音声がどのような内容であるかを認識し、この音声または非音声が異常音声であると認識された場合、異常音声／異常音が制御処理部１２の異常音声／異常音識別手段１２ａに入力され、上記実施形態で説明したように、異常音声／異常音は音声又は警告文字情報に変換してスピーカＳで報知したり、モニタＷに監視映像に重畳した文字情報とした警告表示映像が表示される。また、携帯電話端末Ｔにも同様に警告表示画像を表示することができる。また、画像取得手段１５ａ′は、デジタル化された画像信号を取得し、その画像信号を画像記憶手段１６に記録し、上記実施形態と同様な画像処理を経て警告文字情報が重畳されている。 The voice / non-voice recognition unit 11 acquires the voice / non-speech picked up by the microphone M as a packet signal received from the network sending unit 20, and the contents of the voice / non-speech by the packet signal are as follows. When this voice or non-speech is recognized as an abnormal voice, the abnormal voice / abnormal sound is input to the abnormal voice / abnormal sound identifying means 12a of the control processing unit 12 and described in the above embodiment. As described above, the abnormal voice / abnormal sound is converted into voice or warning character information and notified by the speaker S, or a warning display video as character information superimposed on the monitoring video is displayed on the monitor W. Similarly, a warning display image can be displayed on the mobile phone terminal T. Further, the image acquisition means 15a 'acquires a digitized image signal, records the image signal in the image storage means 16, and is superposed with warning character information through the same image processing as in the above embodiment.

この実施形態では、各監視区域に一対のカメラとマイクを設置してこれらからの音声及び画像が、異常音声や異常音を検出した時のみインターネットＮを経由してクライアント・コンピュータＰＣ又は携帯電話端末Ｔで警告表示文字情報と画像により監視することができ、上記実施形態と比較して各監視区域に監視装置を設置する必要がなく、多くの監視区域を少ない人員で監視できる。 In this embodiment, a pair of cameras and microphones are installed in each surveillance area, and the client computer PC or mobile phone terminal via the Internet N only when the sound and image from these cameras detect abnormal sound or sound. It is possible to monitor with warning display character information and images at T, and it is not necessary to install a monitoring device in each monitoring area as compared with the above embodiment, and many monitoring areas can be monitored with a small number of people.

上述のように本発明は、監視区域で発生した異常を、現地で発生する音声及び非音声の抽出された特徴をニューラルネットワークを利用して認識し、その認識結果が異常を示すものであれば、監視者に異常を警告する際のトリガーとして利用したものである。また、抽出された特徴によるニューラルネットワークの認識結果を文字情報に変換し、異常発生当時の監視画像に付加して配信する。この監視画像により記録した異常状態発生当時の監視画像を、専任の監視者の監視モニタへ送ると同時に、監視場所から離れた場所にいるクライアント・コンピュータＰＣや携帯電話端末へインターネットや携帯電話回線網のネットワークを経由して監視画像を配信することができる。 As described above, the present invention recognizes the abnormalities occurring in the monitoring area by using the neural network to recognize the extracted features of voice and non-voice generated in the field, and the recognition result indicates abnormalities. This is used as a trigger to warn the monitor of abnormality. Moreover, the recognition result of the neural network based on the extracted features is converted into character information, added to the monitoring image at the time of occurrence of the abnormality, and distributed. The monitoring image recorded at the time of occurrence of the abnormal state recorded by the monitoring image is sent to the monitoring monitor of a dedicated supervisor, and at the same time, the client computer PC or mobile phone terminal located away from the monitoring location is connected to the Internet or a mobile phone network. The monitoring image can be distributed via the network.

なお、本発明では、監視区域を監視する際に、火災を監視するのか、進入者を監視するのか、看護を目的とする監視システムであるのか、などによって、異常状態における異常音声の内容が異なり、従って、音声／非音声認識手段は異なった異常音声または異常音を認識しなければならず、音声／非音声認識手段をニューラルネットワークで構成することによって、学習により汎用性のある監視システムを提供することができる。 In the present invention, when monitoring a monitoring area, the content of abnormal sound in an abnormal state differs depending on whether it is a fire monitoring, an intruder monitoring, or a monitoring system for nursing purposes. Therefore, the voice / non-voice recognition means must recognize different abnormal voices or abnormal sounds, and by providing the voice / non-voice recognition means with a neural network, a versatile monitoring system is provided by learning. can do.

本発明の活用例としては、監視者が監視区域とは離れた場所に居たとしても監視映像又は画像に加えて異常音声を警告文字として表示してどのような異常事態であるかを自動的に認識して監視することができる監視システムとして利用することができ、種々の異常事態に対応する監視システムとして活用できる。 As an example of use of the present invention, even if the monitor is away from the monitoring area, an abnormal sound is automatically displayed as a warning character in addition to the monitoring video or image to automatically identify the abnormal situation. It can be used as a monitoring system that can be recognized and monitored, and can be used as a monitoring system corresponding to various abnormal situations.

本発明の監視システムの一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of the monitoring system of this invention. 本実施形態の監視装置の機能ブロック図である。It is a functional block diagram of the monitoring apparatus of this embodiment. 本実施形態の監視システムのブロック図である。It is a block diagram of the monitoring system of this embodiment. 本実施形態における音素認識処理部の一例を示すニューラルネットワークの図である。It is a figure of the neural network which shows an example of the phoneme recognition process part in this embodiment. 本実施形態における異常発生時の画像を選択する方法を説明するための説明図である。It is explanatory drawing for demonstrating the method to select the image at the time of abnormality generation in this embodiment. 本実施形態における携帯電話の表示画面を示す図である。It is a figure which shows the display screen of the mobile telephone in this embodiment. 本発明の他の実施形態を示すブロック図である。It is a block diagram which shows other embodiment of this invention. 図７の実施形態の監視装置の機能ブロック図である。It is a functional block diagram of the monitoring apparatus of embodiment of FIG. 図７の実施形態の監視システムのブロック図である。It is a block diagram of the monitoring system of the embodiment of FIG. 従来の監視システムを示すブロック図である。It is a block diagram which shows the conventional monitoring system.

Explanation of symbols

１０監視装置
１１音声／非音声認識手段
１１ａ音声／非音声取込み部
１１ｂ音声分析部
１１ｃ音声認識処理部
１１ｄ特徴データベース部
１２制御処理部
１２ａ異常音声／異常音識別手段
１２ｂ異常音声データ送出手段
１２ｃ時刻発生部
１２ｄ時刻データ送出手段
１３異常音声出力手段
１３ａ音声合成処理部
１３ｂ音声出力部
１４異常音声文字データ送出手段
１５画像処理手段
１５ａ′ 画像取得手段
１５ａ映像取込み部
１５ｂ映像処理部
１６画像メモリ（画像記録手段）
１７映像出力処理部（映像出力手段）
１８文字信号合成部（映像合成手段）
１９画像変換サーバ部（画像サイズ変換手段）
２０，２２ネットワーク処理部（送出手段）
２１シリアル信号処理部
Ｃカメラ
Ｄ携帯電話回線網
Ｅ_１〜Ｅｎ監視区域
Ｍマイク
Ｎインターネット
ＰＣクライアント・コンピュータ
Ｒルータ
ＲＥＶＴＲ，ＤＶＤレコーダ，ＨＤＤレコーダ
Ｓスピーカ
Ｔ携帯電話端末
Ｗモニタ DESCRIPTION OF SYMBOLS 10 Monitoring apparatus 11 Voice / non-voice recognition means 11a Voice / non-voice capture part 11b Voice analysis part 11c Voice recognition processing part 11d Feature database part 12 Control processing part 12a Abnormal voice / abnormal sound identification means 12b Abnormal voice data transmission means 12c Time Generation unit 12d Time data transmission means 13 Abnormal voice output means 13a Speech synthesis processing section 13b Audio output section 14 Abnormal voice character data transmission means 15 Image processing means 15a 'Image acquisition means 15a Video capture section 15b Video processing section 16 Image memory (image Recording means)
17 Video output processing unit (video output means)
18 Character signal synthesis unit (video synthesis means)
19 Image conversion server unit (image size conversion means)
20, 22 Network processing unit (transmission means)
21 serial signal processing unit C camera D mobile phone network E _{1 to} En monitoring area M microphone N Internet PC client computer R router RE VTR, DVD recorder, HDD recorder S speaker T mobile phone terminal W monitor

Claims

In a surveillance system that records a photographed image of a camera that photographs a surveillance area in a surveillance device,
The monitoring device is provided with a microphone for picking up sound in the monitoring area, and includes an image processing means that performs image processing on a video signal output from the camera to form image data, and a monitor that displays the image data. ,
The monitoring device has a learning function based on a teacher signal from a feature database, compares and analyzes the sound picked up by the microphone and phoneme data, recognizes the phoneme of the sound, and time-series data of the phoneme A speech / non-speech recognition means using a time-delay neural network,
Abnormal voice / abnormal sound identification means for recognizing predetermined voice / non-voice from the time-series data of the phonemes by the voice / non-voice recognition means and determining whether or not the alarm is an abnormal voice or abnormal sound When,
An image storage means for storing an abnormal occurrence image data at the time of detecting an abnormal voice / abnormal sound in the monitoring area and image data before and after the abnormal sound / abnormal sound when the abnormal sound / abnormal sound identifying means detects an abnormality;
Abnormal voice character data output means for converting abnormal voice / abnormal sound from the abnormal voice / abnormal sound identification means into warning character data;
Abnormal voice output means for notifying the abnormal voice / abnormal sound by voice synthesis;
First sending means for adding the warning character data to the abnormality occurrence image data obtained from the image storage means and sending it to the network;
The abnormality occurrence image data with the warning character data added thereto is converted to image data that can be received by a mobile phone terminal by image size conversion means, and is sent to a network;
A monitoring system that notifies the mobile phone terminal of the abnormality occurrence image data and the preceding and following image data when an abnormality occurs due to sound in the monitoring area from the second sending means via a network .

The monitoring system according to claim 1, wherein the non-voice is an abnormal sound such as a scream or a noise .

The monitoring device includes time data transmission means, adds the warning character data to the abnormality occurrence image data, and generates an abnormality corresponding to the position information of the monitoring area based on the time information from the time data transmission means monitoring system according to claim 1 or 2, characterized that you output in addition to the abnormality image data date and time.