KR102374054B1

KR102374054B1 - Method for recognizing voice and apparatus used therefor

Info

Publication number: KR102374054B1
Application number: KR1020210166682A
Authority: KR
Inventors: 임국찬; 배용우; 신동엽; 신승민; 신승호; 이학순; 전진수
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2017-08-09
Filing date: 2021-11-29
Publication date: 2022-03-14
Anticipated expiration: 2037-08-09
Also published as: KR20210148057A; KR20190016851A; KR102372327B1

Abstract

일 실시예에 따른 음성 입력 방법은 음성 인식 장치에 의해 수행되며, 상기 음성 인식 장치의 제1 소리 입력부에 입력된 제1 소리의 크기를 도출하는 단계와, 타 기기의 제2 소리 입력부에 입력된 제2 소리의 크기에 대한 정보가 상기 타 기기로부터 수신되면, 상기 제1 소리의 크기와 상기 제2 소리의 크기를 비교하는 단계와, 상기 비교 결과 상대적으로 크기가 큰 소리에 상대적으로 큰 값의 가중치를 곱하고 상대적으로 크기가 작은 소리에 상대적으로 작은 값의 가중치를 곱하는 단계와, 상기 가중치가 각각 곱해진 제1 소리 및 제2 소리에 대해 음성 인식이 수행되도록 제어하는 단계를 포함한다.A voice input method according to an embodiment is performed by a voice recognition apparatus, and includes the steps of deriving the volume of a first sound input to a first sound input unit of the voice recognition apparatus; Comparing the volume of the first sound with the volume of the second sound when information on the loudness of the second sound is received from the other device; The method includes multiplying a weight and multiplying a sound having a relatively small volume by a weight having a relatively small value, and controlling voice recognition to be performed on the first sound and the second sound multiplied by the weight, respectively.

Description

Speech recognition method and apparatus used therefor

본 발명은 음성 인식 방법 및 이에 사용되는 장치에 관한 것이며, 보다 자세하게는 음성을 입력받는 타 기기와 연동하여서 음성을 인식하는 방법 및 이에 사용되는 장치에 관한 것이다.The present invention relates to a voice recognition method and an apparatus used therefor, and more particularly, to a method for recognizing a voice by interworking with another device receiving a voice input, and an apparatus used therefor.

음성 인식 기반의 대화형 디바이스는 복수 개의 음성 입력부(예컨대 마이크로폰)를 포함할 수 있다. 음성 입력부가 복수 개로 구비되면, 다양한 방향에서 발생되는 음성이 높은 인식률로 수집될 수 있다. 도 1은 복수 개의 음성 입력부(20)를 포함하는 대화형 디바이스(1)의 구성을 개념적으로 도시한 도면이다. 도 1을 참조하면, 대화형 디바이스(1)는 몸체를 구성하는 바디부(10) 그리고 이러한 바디부(10)에 실장되는 복수 개의 음성 입력부(20)를 포함할 수 있다. 복수 개의 음성 입력부(20)는 다양한 방향을 향하도록 지향적으로 배치될 수 있다.The voice recognition-based interactive device may include a plurality of voice input units (eg, microphones). When a plurality of voice input units are provided, voices generated in various directions may be collected at a high recognition rate. 1 is a diagram conceptually illustrating a configuration of an interactive device 1 including a plurality of voice input units 20 . Referring to FIG. 1 , the interactive device 1 may include a body portion 10 constituting a body and a plurality of voice input units 20 mounted on the body portion 10 . The plurality of voice input units 20 may be directionally disposed to face various directions.

도 2는 도 1에 도시된 복수 개의 음성 입력부(20)에 대한 블록도를 도시한 도면이다. 도 2를 참조하면, 복수 개의 음성 입력부(20) 각각은 증폭기(21)에 연결될 수 있고, 증폭기(21)는 마이크로프로세서(MCU, 22)에 연결될 수 있다. 복수 개의 음성 입력부(20) 각각을 통해 입력된 음성은 증폭기(21)에서 증폭된 뒤 마이크로프로세서(22)로 전달된다. 마이크로프로세서(22)는 각각의 증폭기(21)로부터 음성을 전달받은 후 음성 인식을 직접 수행할 수 있으며, 이와 달리 별도의 음성 인식 서버에서 음성 인식이 수행될 수 있도록 음성 인식 서버에게 음성을 전달할 수 있다.FIG. 2 is a diagram illustrating a block diagram of the plurality of voice input units 20 shown in FIG. 1 . Referring to FIG. 2 , each of the plurality of audio input units 20 may be connected to an amplifier 21 , and the amplifier 21 may be connected to a microprocessor (MCU) 22 . The audio input through each of the plurality of audio input units 20 is amplified by the amplifier 21 and then transferred to the microprocessor 22 . The microprocessor 22 may directly perform voice recognition after receiving the voice from each amplifier 21, and in contrast, may transmit voice to a voice recognition server so that voice recognition may be performed in a separate voice recognition server. there is.

대화형 디바이스(1)는 특정 위치에 고정되어 사용되는 고정형 디바이스일 수 있다. 사용자가 대화형 디바이스(1)로부터 근거리만큼 이격된 위치에 있다면, 이러한 사용자가 발한 음성은 대화형 디바이스(1)에서 용이하게 인식 가능하다. 그러나, 사용자가 대화형 디바이스(1)로부터 원거리만큼 이격된 위치에 있다면, 이러한 사용자가 발한 음성은 대화형 디바이스(1)에서 용이하게 인식되기가 어렵다. 왜냐하면, 사용자가 발한 음성이 대화형 디바이스(1)까지 도달하는 과정에서 왜곡될 수 있기 때문이다. 이 밖에도 사용자의 음성에 의한 반향(echo), 잡음원이 발생하는 잡음(noise)에 의한 영향 또는 대화형 디바이스(1) 자체에서 출력되는 소리에 의한 반향 등은 사용자가 발한 음성에 왜곡을 가할 수 있다.The interactive device 1 may be a fixed device that is fixed and used at a specific location. If the user is at a location spaced apart from the interactive device 1 by a short distance, the voice uttered by the user is easily recognizable in the interactive device 1 . However, if the user is at a location separated by a distance from the interactive device 1 , it is difficult for the user's voice to be easily recognized by the interactive device 1 . This is because the user's voice may be distorted in the process of reaching the interactive device 1 . In addition, the echo caused by the user's voice, the effect of noise generated by the noise source, or the echo caused by the sound output from the interactive device 1 itself may add distortion to the user's voice. .

한국특허공개공보, 제 2010-0115783호 (2010.10.28. 공개)Korean Patent Laid-Open Publication No. 2010-0115783 (published on October 28, 2010)

이에 본 발명이 해결하고자 하는 과제는, 사용자가 음성 인식 장치로부터 원거리만큼 이격된 위치에 있거나 대화형 디바이스 부근에 잡음원이 존재하는 경우 음성 인식률을 개선하는 기술을 제공하는 것이다.Accordingly, an object of the present invention is to provide a technique for improving a voice recognition rate when a user is at a location far apart from a voice recognition apparatus or when a noise source exists in the vicinity of an interactive device.

또한, 음성 인식 장치가 자체적으로 출력하는 소리 또는 사용자의 음성에 의한 반향 영향 등을 제거 또는 감소시킴으로써 음성 인식률을 개선하는 기술을 제공하는 것이다.Another object of the present invention is to provide a technique for improving a voice recognition rate by removing or reducing an echo effect of a sound output by the voice recognition device or a user's voice.

다만, 본 발명의 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problems to be solved of the present invention are not limited to those mentioned above, and other problems to be solved that are not mentioned can be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. will be.

일 실시예에 따른 음성 인식 장치는 제1 소리를 입력받는 제1 소리 입력부와, 상기 제1 소리의 크기를 도출하는 음성 인식부와, 제2 소리 입력부를 포함하는 타 기기로부터 상기 제2 소리 입력부에 입력된 제2 소리의 크기에 대한 정보를 수신하는 통신부와, 상기 제1 소리의 크기와 상기 제2 소리의 크기를 비교하고, 상기 비교 결과 상대적으로 큰 소리에 곱해지는 가중치는 상대적으로 큰 값을 갖도록 산출하고 상대적으로 작은 소리에 곱해지는 가중치는 상대적으로 작은 값을 갖도록 산출하며, 상기 산출된 각각의 가중치가 곱해진 제1 소리 및 제2 소리에 대해 음성 인식이 수행되도록 제어하는 제어부를 포함한다.A voice recognition apparatus according to an embodiment includes a first sound input unit receiving a first sound, a voice recognition unit deriving the level of the first sound, and a second sound input unit from another device including a second sound input unit a communication unit for receiving information on the loudness of the second sound input to the , compares the loudness of the first sound with the loudness of the second sound, and as a result of the comparison, a weight multiplied by a relatively loud sound is a relatively large value and a control unit for controlling so that the first sound and the second sound multiplied by the calculated weights are calculated to have a relatively small weight and a weight multiplied by a relatively small sound is performed to perform voice recognition do.

일 실시예에 따르면, 복수 개의 장치 각각이 소리를 입력받을 때, 이러한 소리를 발하는 음원과 각각의 장치 간의 이격 거리가 고려되어서 각각의 소리가 다른 비율로 증폭될 수 있다. 따라서, 어느 하나의 장치가 음원과 원거리만큼 이격되어 있다고 하더라도 다른 장치가 음원과 근거리만큼 이격되어 있으면 이러한 다른 장치로 입력된 소리가 보다 크게 증폭되어서 합성될 수 있으므로, 해당 음원이 발하는 소리에 대한 인식률이 향상될 수 있다.According to an embodiment, when each of a plurality of devices receives a sound, the distance between the sound source emitting such a sound and each device is taken into consideration so that each sound may be amplified at a different rate. Therefore, even if one device is separated from the sound source by a distance, if the other device is spaced apart from the sound source by a short distance, the sound input to the other device can be amplified and synthesized, so the recognition rate for the sound emitted by the sound source This can be improved.

또한, 반향음에 의해 발생 가능한 소리의 왜곡이 경감되거나 제거될 수 있다.In addition, distortion of the sound that may be caused by the reflection sound can be reduced or eliminated.

도 1은 일반적인 대화형 음성 인식 장치의 구성을 개념적으로 도시한 도면이다.
도 2는 도 1에 도시된 대화형 음성 인식 장치의 음성 인식부에 대한 블록도를 도시한 도면이다.
도 3은 일 실시예에 따른 음성 인식 장치가 적용된 음성 인식 시스템의 구성을 개념적으로 도시한 도면이다.
도 4는 도 3에 도시된 타 기기의 구성을 개념적으로 도시한 도면이다.
도 5는 도 3에 도시된 일 실시예에 따른 음성 인식 장치의 구성을 개념적으로 도시한 도면이다.
도 6은 일 실시예에 따른 음성 인식 장치의 동작을 개념적으로 도시한 도면이다.
도 7은 일 실시예에 따른 음성 인식 장치가 동작하는 상황에 대한 제1 예를 개념적으로 도시한 도면이다.
도 8은 일 실시예에 따른 음성 인식 장치가 동작하는 상황에 대한 제2 예를 개념적으로 도시한 도면이다.
도 9는 일 실시예에 따른 음성 인식 장치가 동작하는 상황에 대한 제3 예를 개념적으로 도시한 도면이다.
도 10은 일 실시예에 따른 음성 인식 장치의 동작을 개념적으로 도시한 도면이다.1 is a diagram conceptually illustrating the configuration of a general interactive voice recognition apparatus.
FIG. 2 is a diagram illustrating a block diagram of a voice recognition unit of the interactive voice recognition apparatus shown in FIG. 1 .
3 is a diagram conceptually illustrating a configuration of a voice recognition system to which a voice recognition apparatus according to an embodiment is applied.
FIG. 4 is a diagram conceptually illustrating a configuration of another device shown in FIG. 3 .
FIG. 5 is a diagram conceptually illustrating a configuration of a voice recognition apparatus according to the embodiment shown in FIG. 3 .
6 is a diagram conceptually illustrating an operation of a voice recognition apparatus according to an exemplary embodiment.
7 is a diagram conceptually illustrating a first example of a situation in which a voice recognition apparatus operates according to an embodiment.
8 is a diagram conceptually illustrating a second example of a situation in which a voice recognition apparatus operates according to an embodiment.
9 is a diagram conceptually illustrating a third example of a situation in which a voice recognition apparatus operates according to an exemplary embodiment.
10 is a diagram conceptually illustrating an operation of a voice recognition apparatus according to an exemplary embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, if it is determined that a detailed description of a well-known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification.

도 3은 일 실시예에 따른 음성 인식 장치(100)가 적용된 음성 인식 시스템(1000)의 구성을 개념적으로 도시한 도면이다. 다만, 도 3은 예시적인 것에 불과하므로, 음성 인식 장치(100)가 도 3에 도시된 음성 인식 시스템(1000)에만 한정 적용되는 것으로 해석되지는 않는다.3 is a diagram conceptually illustrating the configuration of a voice recognition system 1000 to which the voice recognition apparatus 100 according to an embodiment is applied. However, since FIG. 3 is merely exemplary, it is not interpreted that the voice recognition apparatus 100 is limitedly applied only to the voice recognition system 1000 illustrated in FIG. 3 .

도 3을 참조하면, 음성 인식 시스템(1000)은 음성 인식 서버(500), 음성 인식 장치(100) 그리고 적어도 하나의 타 기기(200,210)를 포함할 수 있다. 이 때, 이러한 음성 인식 시스템(1000)이 설치된 공간에는 잡음(noise)을 발하는 잡음원(300)이 배치될 수 있다. Referring to FIG. 3 , the voice recognition system 1000 may include a voice recognition server 500 , a voice recognition apparatus 100 , and at least one other device 200 , 210 . In this case, the noise source 300 emitting noise may be disposed in a space in which the voice recognition system 1000 is installed.

음성 인식 서버(500)는 소리로부터 음성을 추출하고 인식하는 기능을 수행하는 서버일 수 있다. 음성 인식 서버(500)에서 처리되는 소리는 음성 인식 장치(100)나 또는 타 기기(200,210)로부터 전달받은 소리일 수 있다. 여기서, 음성 인식 서버(500)는 소리로부터 음성을 추출하고 인식하기 위해 공지된 기술을 사용할 수 있는 바, 이에 대한 설명은 생략하기로 한다.The voice recognition server 500 may be a server that extracts and recognizes a voice from a sound. The sound processed by the voice recognition server 500 may be a sound received from the voice recognition apparatus 100 or other devices 200 and 210 . Here, the voice recognition server 500 may use a known technology to extract and recognize a voice from a sound, and a description thereof will be omitted.

타 기기(200,210)는 외부의 소리를 입력받는 기능을 구비하는 모든 기기를 총칭할 수 있다. 예컨대 이러한 타 기기(200,210)는 스마트폰, 스마트패드, 스마트시계, 소리 입력 기능이 구비된 리모콘 또는 소리 입력 기능이 구비된 스피커 등일 수 있다. 이러한 타 기기(200,210)는 음성 인식 시스템(1000)에서 적어도 한 개 이상 구비될 수 있다. 이러한 타 기기(200,210)에 대하여는 도 4를 참조하여 살펴보기로 한다. The other devices 200 and 210 may collectively refer to all devices having a function of receiving an external sound. For example, the other devices 200 and 210 may be a smart phone, a smart pad, a smart watch, a remote control equipped with a sound input function, or a speaker equipped with a sound input function. At least one of these other devices 200 and 210 may be provided in the voice recognition system 1000 . These other devices 200 and 210 will be described with reference to FIG. 4 .

도 4는 도 3에 도시된 타 기기(200,210)에 대한 구성을 예시적으로 도시한 도면이다. 도 4를 참조하면, 타 기기(200,210)는 통신부(201), 적어도 하나의 스피커(202), 적어도 하나의 소리 입력부(203) 및 제어부(204) 중 적어도 하나를 포함할 수 있으며, 언급되지 않은 다른 구성을 포함할 수도 있다. FIG. 4 is a diagram exemplarily showing configurations of other devices 200 and 210 shown in FIG. 3 . Referring to FIG. 4 , other devices 200 and 210 may include at least one of a communication unit 201 , at least one speaker 202 , at least one sound input unit 203 , and a control unit 204 , which are not mentioned. Other configurations may be included.

통신부(201)는 무선 통신 모듈일 수 있다. 예컨대 통신부(201)는 블루투스 모듈, Wi-Fi 모듈 또는 적외선 통신 모듈 중 어느 하나일 수 있으나 이에 한정되는 것은 아니다. 통신부(201)를 통해서 타 기기(200,210)는 음성 인식 장치(100) 또는 음성 인식 서버(500)와 음성 또는 음성 관련 데이터를 주고받을 수 있다. The communication unit 201 may be a wireless communication module. For example, the communication unit 201 may be any one of a Bluetooth module, a Wi-Fi module, and an infrared communication module, but is not limited thereto. The other devices 200 and 210 may exchange voice or voice-related data with the voice recognition apparatus 100 or the voice recognition server 500 through the communication unit 201 .

스피커(202)는 외부를 향해 소리를 출력하는 구성이다. 이러한 타 기기(200,210)에 채용되는 스피커(202)는 회로 기판에 포함되는 일반적인 스피커일 수 있는 바, 이러한 스피커(202)에 대해서는 설명을 생략하기로 한다.The speaker 202 is configured to output sound toward the outside. The speaker 202 employed in the other devices 200 and 210 may be a general speaker included in a circuit board, and thus the description of the speaker 202 will be omitted.

소리 입력부(203)는 마이크로폰과 같이 소리를 입력받는 구성이며, 입력받은 소리를 증폭시키는 구성까지도 포함할 수 있다. 소리 입력부(203)가 입력받는 소리에는 사람의 음성, 사물로부터 발생되는 소리, 잡음원(300)이 발생시키는 잡음 등이 있을 수 있으며, 다만 이에 한정되는 것은 아니다.The sound input unit 203 is a component that receives a sound like a microphone, and may even include a component that amplifies the input sound. The sound inputted by the sound input unit 203 may include a human voice, a sound generated from an object, or a noise generated by the noise source 300 , but is not limited thereto.

소리 입력부(203)는 복수 개가 타 기기(200,210)에 구비될 수 있다. 복수 개의 소리 입력부(203)는 다양한 방향을 향하도록 지향적으로 배치 및 동작될 수 있다. 복수 개의 소리 입력부(203)는 후술할 제어부(204)에 의해서 선택적으로 동작될 수 있다.A plurality of sound input units 203 may be provided in other devices 200 and 210 . The plurality of sound input units 203 may be directionally disposed and operated to face various directions. The plurality of sound input units 203 may be selectively operated by a control unit 204 to be described later.

제어부(204)는 이하에서 설명할 기능을 수행하도록 프로그램된 명령어를 저장하는 메모리 및 이러한 명령어를 실행하는 마이크로프로세서에 의하여 구현 가능하다. 이하에서는 이러한 제어부(204)에 대하여 구체적으로 살펴보도록 한다.The control unit 204 may be implemented by a memory for storing instructions programmed to perform a function to be described below and a microprocessor for executing these instructions. Hereinafter, the control unit 204 will be described in detail.

제어부(204)는 소리 입력부(203)가 복수 개로 구비된 경우, 이 중 적어도 하나를 선별적으로 동작시킬 수 있다. When a plurality of sound input units 203 are provided, the controller 204 may selectively operate at least one of them.

또한, 제어부(204)는 소리 입력부(203)에 입력된 소리로부터 정보를 추출할 수 있다. 제어부(204)가 소리로부터 추출하는 정보에는 소리 입력부(203)에 소리가 입력된 시간 또는 소리 입력부(203)에 입력된 소리의 주파수나 크기 등이 포함될 수 있으나 이에 한정되는 것은 아니다.Also, the control unit 204 may extract information from the sound input to the sound input unit 203 . The information extracted by the control unit 204 from the sound may include, but is not limited to, the time at which the sound is input to the sound input unit 203 or the frequency or magnitude of the sound input to the sound input unit 203 .

또한, 제어부(204)는 소리에 잡음이 포함되어 있을 경우, 소리로부터 잡음을 추출하고 그 특성을 파악할 수 있으며 또한 이러한 특성을 기초로 잡음을 상쇄시키는 상쇄음을 생성할 수 있다. 다만, 제어부(204)가 소리로부터 잡음을 추출하고 잡음의 특성을 파악하며 이러한 잡음의 특성을 기초로 잡음에 대한 상쇄음을 생성하는 기술은 이미 공지된 기술을 이용하는 것이므로 이에 대한 설명은 생략하기로 한다.Also, when noise is included in the sound, the controller 204 may extract the noise from the sound and determine its characteristics, and may also generate an offsetting sound for canceling the noise based on the characteristics. However, since the control unit 204 extracts noise from the sound, identifies the characteristics of the noise, and generates an offset sound for the noise based on the characteristics of the noise, uses a known technique, so a description thereof will be omitted. do.

잡음원(300)은 잡음을 발생시키는 음원을 지칭한다. 잡음에는 백색 소음 또는 기타 다른 소음 등이 포함될 수 있다.The noise source 300 refers to a sound source that generates noise. Noise may include white noise or other noise.

음성 인식 장치(100)는 사용자(400)가 발하는 음성을 인식하고, 인식된 음성에 대응하여서 대화형 서비스를 제공하는 장치일 수 있다. 또한, 음성 인식 장치(100)는 타 기기(200,210)를 제어함으로써, 이러한 타 기기(200,210)로 하여금 사용자(400)가 발하는 음성을 입력받도록 할 수 있다. 이하에서는 이러한 음성 인식 장치(100)의 구성에 대해서 살펴보도록 한다.The voice recognition apparatus 100 may be a device that recognizes a voice issued by the user 400 and provides an interactive service in response to the recognized voice. Also, by controlling the other devices 200 and 210 , the voice recognition apparatus 100 may allow the other devices 200 and 210 to receive the voice from the user 400 . Hereinafter, the configuration of the voice recognition apparatus 100 will be described.

도 5는 도 3에 도시된 음성 인식 장치(100)의 구성을 예시적으로 도시한 도면이다. 도 5를 참조하면, 음성 인식 장치(100)는 통신부(110), 스피커(120), 소리 입력부(130), 합성부(140), 저장부(150), 음성 인식부(160), 처리부(170) 및 제어부(180)를 포함할 수 있으며, 다만 도 5에 도시된 것과는 달리 이 중에서 적어도 하나를 포함하지 않거나 또는 도면에는 도시되지 않은 구성을 더 포함할 수도 있다.FIG. 5 is a diagram exemplarily illustrating the configuration of the voice recognition apparatus 100 shown in FIG. 3 . Referring to FIG. 5 , the voice recognition apparatus 100 includes a communication unit 110 , a speaker 120 , a sound input unit 130 , a synthesis unit 140 , a storage unit 150 , a voice recognition unit 160 , and a processing unit ( 170) and the controller 180, but unlike the one shown in FIG. 5, at least one of them may not be included or may further include a configuration not shown in the drawing.

통신부(110)는 무선 통신 모듈일 수 있다. 예컨대 통신부(110)는 블루투스모듈, Wi-Fi 모듈 또는 적외선 통신 모듈 중 어느 하나일 수 있으나 이에 한정되는 것은 아니다. 이러한 통신부(110)를 통해서 음성 인식 장치(100)는 음성 인식 서버(500) 또는 타 기기(200,210)와 음성 또는 음성 관련 데이터를 주고받을 수 있다. The communication unit 110 may be a wireless communication module. For example, the communication unit 110 may be any one of a Bluetooth module, a Wi-Fi module, or an infrared communication module, but is not limited thereto. Through the communication unit 110 , the voice recognition apparatus 100 may exchange voice or voice related data with the voice recognition server 500 or other devices 200 and 210 .

스피커(120)는 외부를 향해 소리를 출력하는 구성이다. 음성 인식 장치(100)에 채용되는 스피커(120)는 일반적인 스피커일 수 있는 바, 이러한 스피커(120)에 대해서는 설명을 생략하기로 한다.The speaker 120 is configured to output sound toward the outside. Since the speaker 120 employed in the voice recognition apparatus 100 may be a general speaker, a description of the speaker 120 will be omitted.

소리 입력부(130)는 마이크와 같이 소리를 입력받는 구성이며, 입력받은 소리를 증폭시키는 구성까지도 포함하는 개념일 수 있다. 소리 입력부(130)가 입력받는 소리에는 사람의 음성, 사물로부터 발생되는 소리, 잡음원(300)이 발생시키는 잡음 등이 있을 수 있으며, 다만 이에 한정되는 것은 아니다.The sound input unit 130 is a component for receiving a sound like a microphone, and may be a concept including a component for amplifying the received sound. The sound input by the sound input unit 130 may include a human voice, a sound generated from an object, or a noise generated by the noise source 300 , but is not limited thereto.

소리 입력부(130)는 복수 개가 음성 인식 장치(100)에 구비될 수 있다. 복수 개의 소리 입력부(130)는 다양한 방향을 향하도록 지향적으로 배치 및 동작될 수 있다. 복수 개의 소리 입력부(130)의 동작은 후술할 제어부(180)에 의해서 제어될 수 있다.A plurality of sound input units 130 may be provided in the voice recognition apparatus 100 . The plurality of sound input units 130 may be directionally disposed and operated to face various directions. Operations of the plurality of sound input units 130 may be controlled by the controller 180, which will be described later.

합성부(140)는 복수 개의 소리를 합성하는 구성이며, 필터와 같이 일반적으로 공지된 구성을 포함할 수 있다. 합성부(140)는 음성 인식 장치(100)의 소리 입력부(130)로 입력되는 소리와, 타 기기(200,210)의 소리 입력부(203)로 입력되는 소리를 대상으로 합성할 수 있다. 만약, 음성 인식 장치(100)의 소리 입력부(130)가 복수 개로 구비되거나 타 기기(200)의 소리 입력부(203)가 복수 개로 구비되는 경우, 합성부(140)는 이들 복수 개의 소리 입력부(103,203)로 입력되는 소리를 합성할 수 있다.The synthesizer 140 is a component for synthesizing a plurality of sounds, and may include a generally known component such as a filter. The synthesizer 140 may synthesize a sound input to the sound input unit 130 of the voice recognition apparatus 100 and a sound input to the sound input unit 203 of other devices 200 and 210 as a target. If a plurality of sound input units 130 of the voice recognition apparatus 100 are provided or a plurality of sound input units 203 of other devices 200 are provided, the synthesizer 140 may include the plurality of sound input units 103 and 203 . ) to synthesize the input sound.

합성부(140)는 각각의 소리에 가중치(weight)를 곱할 수 있으며, 따라서 합성부(140)는 가중치가 곱해진 소리를 합성할 수 있다. 가중치는 양(+) 또는 음(-)의 값을 가질 수 있으며, 양의 값은 음성을 합성하기 위해서 해당 소리 입력부(203)의 음성 특성을 강화하는데 사용될 수 있고, 음의 값은 해당 소리 입력부(203)의 음성 특성을 약화하는데 사용될 수 있다. 합성부(140)에 의해서 소리에 가중치가 곱해진다는 것은 소리가 앰프 등을 통과한다는 것을 의미할 수 있으며, 이에 합성부(140)는 앰프 등을 구성요소로서 포함할 수 있다. 가중치가 곱해진 소리는 곱해진 가중치에 따라서 그 크기가 커지거나 줄어들 수 있다. 이러한 가중치는 후술할 제어부(180)로부터 전달받은 것일 수 있다.The synthesizing unit 140 may multiply each sound by a weight, and thus the synthesizing unit 140 may synthesize the sound multiplied by the weight. The weight may have a positive (+) or negative (-) value, and a positive value may be used to enhance the voice characteristics of the corresponding sound input unit 203 in order to synthesize a voice, and a negative value may have a negative value for the corresponding sound input unit. It can be used to weaken the negative characteristics of (203). That the sound is multiplied by a weight by the synthesizer 140 may mean that the sound passes through an amplifier or the like, and the synthesizer 140 may include an amplifier or the like as a component. The sound multiplied by the weight may increase or decrease in size according to the multiplied weight. These weights may be received from the controller 180, which will be described later.

저장부(150)는 데이터를 저장하는 구성이며, 메모리 등으로 구현 가능하다. 저장부(160)에 저장된 데이터에는 예컨대 웨이크업 신호, 소리 입력부(130,203)의 ID 또는 스피커(120)를 통해 출력되는 소리일 수 있으나 이에 한정되는 것은 아니다. 여기서, 웨이크업 신호는 미리 정해진 주파수 특성을 가질 수 있다. 웨이크업 신호가 음성 인식 장치(100)에서 인식되면, 그 이후에 사용자(400)가 발하는 음성은 명령으로 인식될 수 있다.The storage unit 150 is a configuration for storing data, and may be implemented as a memory or the like. The data stored in the storage unit 160 may include, for example, a wake-up signal, IDs of the sound input units 130 and 203 , or a sound output through the speaker 120 , but is not limited thereto. Here, the wake-up signal may have a predetermined frequency characteristic. When the wake-up signal is recognized by the voice recognition apparatus 100 , a voice issued by the user 400 thereafter may be recognized as a command.

음성 인식부(160)는 소리로부터 음성을 추출하여서 그 특성(예컨대 소리의 크기나 주파수, 소리가 입력된 시간 등)을 인식하는 구성이다. 음성 인식부(160)는 소리로부터 음성을 인식하도록 프로그램된 명령어를 저장하는 메모리 및 이러한 명령어를 실행하는 마이크로프로세서에 의하여 구현 가능하다.The voice recognition unit 160 is configured to extract a voice from a sound and recognize its characteristics (eg, the volume or frequency of the sound, the time at which the sound is input, etc.). The voice recognition unit 160 may be implemented by a memory for storing a command programmed to recognize a voice from a sound and a microprocessor for executing the command.

음성 인식부(160)에서 인식되는 소리는 소리 입력부(120)로 입력된 소리 또는 타 기기(200,210)로부터 전달받은 소리일 수 있다. The sound recognized by the voice recognition unit 160 may be a sound input to the sound input unit 120 or a sound transmitted from other devices 200 and 210 .

음성 인식부(160)는 소리로부터 전술한 웨이크업 신호를 소리로부터 추출하여서 인식할 수 있다. The voice recognition unit 160 may recognize the above-mentioned wake-up signal from the sound by extracting it from the sound.

한편, 음성 인식부(160)는 웨이크업 신호 이외에 사용자(400)가 발하는 명령을 인식할 수도 있다. 다만, 이와 달리 음성 인식부(160)는 웨이크업 신호 이외에 사용자(400)가 발하는 명령을 인식하지 않을 수 있으며, 이 경우에 사용자의 명령 인식은 음성 인식 서버(500)에서 수행될 수 있다.Meanwhile, the voice recognition unit 160 may recognize a command issued by the user 400 in addition to the wake-up signal. However, unlike this, the voice recognition unit 160 may not recognize a command issued by the user 400 other than the wake-up signal, and in this case, the user's command recognition may be performed by the voice recognition server 500 .

음성 인식부(160)는 소리로부터 잡음을 추출하여서 그 특성을 인식할 수 있다. 음성 인식부(160)가 소리로부터 잡음을 추출하여서 그 특성을 인식하는데 사용하는 알고리즘은 공지된 것이므로 이에 대한 설명은 생략하기로 한다.The voice recognition unit 160 may extract noise from the sound and recognize its characteristics. An algorithm used by the voice recognition unit 160 to extract noise from a sound and recognize its characteristics is well known, and thus a description thereof will be omitted.

처리부(170)는 사용자(400)에게 대화형 서비스를 제공하는 구성이며, 이러한 처리부(170)는 대화형 서비스를 제공하도록 프로그램된 명령어를 저장하는 메모리 및 이러한 명령어를 실행하는 마이크로프로세서에 의하여 구현 가능하다. 여기서, 처리부(170)는 이미 공지된 알고리즘을 사용하여서 대화형 서비스를 제공하므로, 이에 대해서는 설명을 생략하기로 한다.The processing unit 170 is a configuration that provides an interactive service to the user 400, and this processing unit 170 can be implemented by a memory for storing instructions programmed to provide the interactive service and a microprocessor for executing these instructions. Do. Here, since the processing unit 170 provides an interactive service using a known algorithm, a description thereof will be omitted.

한편, 실시예에 따라서 처리부(170)는 음성 인식 장치(100)에 포함되지 않을 수 있다. 이 경우, 사용자(400)에게 제공되는 대화형 서비스는 음성 인식 서버(500)가 생성한 것이 음성 인식 장치(100)에게 전달된 것일 수 있다. Meanwhile, according to an embodiment, the processing unit 170 may not be included in the voice recognition apparatus 100 . In this case, the interactive service provided to the user 400 may be generated by the voice recognition server 500 and delivered to the voice recognition apparatus 100 .

제어부(180)는 이하에서 설명할 기능을 수행하도록 프로그램된 명령어를 저장하는 메모리 및 이러한 명령어를 실행하는 마이크로프로세서에 의하여 구현 가능하다. 이하에서는 이러한 제어부(180)에 대하여 구체적으로 살펴보도록 한다.The controller 180 may be implemented by a memory for storing instructions programmed to perform a function to be described below and a microprocessor for executing these instructions. Hereinafter, the control unit 180 will be described in detail.

제어부(180)는 음성 인식 장치(100) 주변에 위치한 타 기기(200,210)를 탐색할 수 있다. 예컨대, 통신부(110)가 블루투스 모듈로 구현된 경우, 제어부(180)는 블루투스 연결 히스토리 등을 이용하여서 탐색의 진행을 제어할 수 있다.The controller 180 may search for other devices 200 and 210 located in the vicinity of the voice recognition apparatus 100 . For example, when the communication unit 110 is implemented as a Bluetooth module, the control unit 180 may control the search progress by using a Bluetooth connection history or the like.

탐색이 완료된 경우, 제어부(180)는 탐색된 타 기기(200,210)와 음성 인식 장치(100)를 서로 연결시킬 수 있다. When the search is completed, the controller 180 may connect the searched other devices 200 and 210 and the voice recognition apparatus 100 to each other.

통신이 연결되면, 제어부(180)는 통신이 연결된 타 기기(200,210) 각각에 포함된 적어도 하나의 소리 입력부(203)의 ID와, 음성 인식 장치(100)에 포함된 적어도 하나의 소리 입력부(130) 각각의 ID를 리스트 업(list up)할 수 있다. 이와 같이 리스트 업된 정보는 저장부(150)에 저장될 수 있다.When communication is connected, the controller 180 controls the ID of at least one sound input unit 203 included in each of the other devices 200 and 210 to which communication is connected, and the at least one sound input unit 130 included in the voice recognition apparatus 100 . ) You can list up each ID. The list-up information as described above may be stored in the storage unit 150 .

제어부(180)는 통신이 연결된 타 기기(200,210)에게 소리에 대한 정보를 요청할 수 있다. 이러한 요청에 대응하여서, 소리에 대한 정보가 통신부(110)를 통해 각각의 타 기기(200,210)로부터 수신될 수 있다. 소리에 대한 정보에는 소리가 소리 입력부(203)로 입력된 시간, 소리 입력부(203)로 입력된 소리의 주파수나 크기, 소리에 포함된 잡음의 크기나 주파수 등이 있을 수 있으며 다만 이에 한정되는 것은 아니다.The controller 180 may request information about sound from other devices 200 and 210 to which communication is connected. In response to such a request, information on sound may be received from each of the other devices 200 and 210 through the communication unit 110 . The information about the sound may include the time the sound is inputted to the sound input unit 203, the frequency or size of the sound input to the sound input unit 203, and the size or frequency of noise included in the sound, but is limited thereto. not.

제어부(180)는 합성부(140)에서 소리에 곱해지는 가중치를 산출할 수 있다. 제어부(180)에서 산출된 가중치는 합성부(140)에 전달되며, 합성부(140)는 이러한 가중치를 소리에 곱한 뒤 합성할 수 있다. 도 6은 소리 입력부(103a,b) 및 소리 입력부(203a,b) 각각으로 입력된 소리가 합성부(140)로 전달되고, 제어부(180)에 의해 산출된 가중치가 합성부(140)로 전달되면, 합성부(140)가 소리에 가중치를 곱한 뒤 이들을 합성하는 과정을 도시한 도면이다. 합성부(140)에 의해 합성된 소리는 음성 인식이 수행될 수 있도록 제어부(180)에 의해 제어될 수 있다. 예컨대, 합성부(140)에 의해 합성된 소리가 웨이크업 신호이면 음성 인식부(160)에서 인식될 수 있고, 합성부(140)에 의해 합성된 소리가 웨이크업 신호가 인식된 이후에 입력된 소리이면 통신부(110)를 통해서 음성 인식 서버(500)로 전달될 수도 있다.The controller 180 may calculate a weight by which the sound is multiplied by the synthesizer 140 . The weight calculated by the controller 180 is transmitted to the synthesizer 140 , and the synthesizer 140 may synthesize the sound after multiplying the weight by the sound. 6 shows that the sound input to each of the sound input units 103a and 203a and b is transmitted to the synthesis unit 140 , and the weight calculated by the control unit 180 is transmitted to the synthesis unit 140 . , it is a diagram illustrating a process in which the synthesizing unit 140 multiplies a sound by a weight and then synthesizes them. The sound synthesized by the synthesizer 140 may be controlled by the controller 180 to perform voice recognition. For example, if the sound synthesized by the synthesizer 140 is a wake-up signal, it may be recognized by the voice recognition unit 160, and the sound synthesized by the synthesizer 140 is input after the wake-up signal is recognized. If it is a sound, it may be transmitted to the voice recognition server 500 through the communication unit 110 .

이하에서는 제어부(180)가 가중치를 산출하는 방법에 대해 예시를 들어서 설명하기로 한다.Hereinafter, a method for the controller 180 to calculate the weight will be described with an example.

제어부(180)는 소리의 크기에 따라 가중치를 산출할 수 있다. 예컨대, 제어부(180)는 상대적으로 큰 크기의 소리에는 상대적으로 큰 값의 가중치가 곱해지도록 가중치를 산출하고, 상대적으로 작은 크기의 소리에는 상대적으로 작은 값의 가중치가 곱해지도록 가중치를 산출할 수 있다. 이에 대해서는 도 7을 참조하여서 보다 자세하게 살펴보기로 한다. The controller 180 may calculate a weight according to the loudness of the sound. For example, the controller 180 may calculate a weight so that a relatively large sound is multiplied by a weight of a relatively large value, and calculate the weight so that a sound of a relatively small size is multiplied by a weight of a relatively small value. . This will be described in more detail with reference to FIG. 7 .

도 7은 사용자(400)가 음성 입력 장치(100) 및 타 기기(200)와 각각 서로 상이한 거리만큼 이격되어 있는 상황을 도시한 도면이다. 도 7을 참조하면, 사용자(400)는 음성 입력 장치(100)보다 타 기기(200)에 상대적으로 가깝게 위치해 있다. 따라서, 사용자(400)가 발하는 음성은 음성 입력 장치(100)보다 타 기기(200)에서 보다 큰 크기로 입력될 것이다. 제어부(180)는 타 기기(200)와 음성 입력 장치(100)로 입력된 소리 중에서, 타 기기(200)로 입력된 소리에 곱해지는 가중치가 상대적으로 큰 값을 갖도록 산출할 것이다. 실시예에 따라서는 제어부(180)는 타 기기(200)의 소리 입력부(203a,b) 중에서도 사용자(400)를 향하는 소리 입력부(203b)에 대한 가중치를 가장 큰 값으로 산출할 수 있는데, 이는 소리 입력부(203a)보다는 소리 입력부(203b)에 입력된 소리의 크기가 가장 클 것이기 때문이다.7 is a diagram illustrating a situation in which the user 400 is spaced apart from each other by a different distance from the voice input device 100 and another device 200 . Referring to FIG. 7 , the user 400 is located relatively closer to the other device 200 than to the voice input device 100 . Accordingly, the voice issued by the user 400 may be input in a larger size than the voice input device 100 in the other device 200 . The controller 180 will calculate that, among the sounds input to the other device 200 and the voice input device 100 , a weight multiplied by the sound input to the other device 200 has a relatively large value. According to an exemplary embodiment, the controller 180 may calculate the weight of the sound input unit 203b facing the user 400 as the largest value among the sound input units 203a and b of the other device 200 . This is because the volume of the sound input to the sound input unit 203b is greater than that of the input unit 203a.

즉, 일 실시예에 따르면 소리의 크기에 곱해지는 가중치를 산출할 때 서로 상이한 장치(음성 인식 장치(100)와 타 기기(200,210)) 각각에 입력되는 소리의 크기에 따라 상이한 가중치가 산출될 수 있으며, 실시예에 따라서는 하나의 기기에 구비된 복수 개의 소리 입력부에 대해서도 각각에 입력되는 소리의 크기에 따라 서로 상이한 가중치가 산출될 수 있다.That is, according to an embodiment, when calculating the weight multiplied by the loudness, different weights may be calculated according to the loudness of the sound input to each of different devices (the voice recognition apparatus 100 and the other devices 200 and 210). In addition, according to an embodiment, even for a plurality of sound input units provided in one device, different weights may be calculated according to the loudness of the sound input to each.

따라서, 복수 개의 장치 각각이 소리를 입력받을 때, 이러한 소리를 발하는 음원과 각각의 장치 간의 이격 거리가 고려되어서 각각의 소리가 증폭될 수 있다. 따라서, 어느 하나의 장치가 음원과 원거리만큼 이격되어 있다고 하더라도 다른 장치가 음원과 근거리만큼 이격되어 있으면 이러한 다른 장치로 입력된 소리가 보다 증폭되어서 합성될 수 있으므로, 해당 음원이 발하는 소리에 대한 인식률이 향상될 수 있다. Accordingly, when each of the plurality of devices receives a sound, the respective sound may be amplified by considering the separation distance between the sound source emitting the sound and each device. Therefore, even if one device is separated from the sound source by a distance, if the other device is spaced apart from the sound source by a short distance, the sound input to the other device can be amplified and synthesized, so the recognition rate for the sound emitted by the sound source is lowered. can be improved

한편, 제어부(180)는 소리의 크기에 따라 가중치를 산출한 뒤, 아래와 같은 방법으로 이러한 가중치를 변경할 수 있다. Meanwhile, after calculating the weight according to the loudness of the sound, the controller 180 may change the weight in the following way.

예컨대, 가중치를 변경하는 제1 방법으로서, 제어부(180)는 복수 개의 소리 입력부(130,203) 중에서 가장 늦게 소리를 입력받은 소리 입력부에 대해서는 가중치의 부호를 음(minus)으로 변경할 수 있다. 도 7을 다시 한번 살펴보면, 사용자(400)가 소리를 발하였을 때, 각각의 소리 입력부(203a,203b,130a,130b,130c,130d) 중에서 가장 늦게 소리가 도달한 소리 입력부를 식별번호 130a의 소리 입력부라고 가정하자. 이 때, 소리 입력부(130a)로 가장 늦게 소리가 입력된 이유는, 해당 소리가 반향음이기 때문일 수 있다. 즉, 소리 입력부(130a)를 제외한 다른 다른 소리 입력부(130b,130c,130d,203a,203b)에는 사용자(400)의 소리가 직접 전달되는 반면, 소리 입력부(130a)에는 사용자(400)의 소리가 주변(예컨대 벽이나 천장, 사물 등)에서 반사된 반향음이 입력되기 때문에 가장 늦게 입력될 수 있는 것이다. 제어부(180)는 가장 늦게 소리를 입력받은 소리 입력부(130a)에 대해서는 가중치의 부호를 양(plus)에서 음(minus)으로 변경할 수 있다. 음의 가중치가 곱해진 소리가 합성부(140)에서 합성될 경우, 해당 소리의 특성이 약화되기 때문에 반향음에 의한 왜곡이 경감 내지는 제거될 수 있다.For example, as a first method of changing the weight, the controller 180 may change the sign of the weight to a negative value for the sound input unit to which the latest sound is inputted among the plurality of sound input units 130 and 203 . Referring to FIG. 7 once again, when the user 400 emits a sound, the sound of identification number 130a is the sound input unit to which the latest sound arrives among the sound input units 203a, 203b, 130a, 130b, 130c, and 130d. Let's assume it's an input. At this time, the reason that the sound is inputted to the sound input unit 130a the latest may be that the corresponding sound is a reverberation sound. That is, the sound of the user 400 is directly transmitted to the other sound input units 130b, 130c, 130d, 203a, and 203b except for the sound input unit 130a, whereas the sound of the user 400 is transmitted to the sound input unit 130a. Since the reflected sound from the surroundings (eg, a wall, ceiling, object, etc.) is input, it may be input last. The controller 180 may change the sign of the weight from positive to negative for the sound input unit 130a that has received the latest sound. When a sound multiplied by a negative weight is synthesized by the synthesizer 140 , the characteristic of the corresponding sound is weakened, so that distortion caused by the reflection sound can be reduced or eliminated.

이를 위해, 제어부(180)는 소리 입력부(130a,130b,130c,130d)로 소리가 입력된 시간과 소리의 주파수 등에 대한 정보를 획득할 수 있으며, 또한 타 기기(200,210)의 소리 입력부(203a,203b) 각각으로 소리가 입력된 시간과 소리의 주파수 등에 대한 정보를 획득할 수 있다. 제어부(180)는 이와 같이 획득된 정보를 기초로 동일한 소리에 대해서 어떤 소리 입력부로 가장 늦게 소리가 입력되었는지 여부를 판단할 수 있다.To this end, the control unit 180 may obtain information about the time and frequency of the sound input to the sound input units 130a, 130b, 130c, 130d, and also the sound input units 203a of other devices 200 and 210, 203b), it is possible to obtain information about the input time of the sound and the frequency of the sound, respectively. The controller 180 may determine whether a sound is input last to which sound input unit for the same sound based on the obtained information.

즉, 일 실시예에 따르면 동일한 소리에 대해서 가장 늦게 소리가 도달한 소리 입력부에는 음(minus)의 부호를 갖는 가중치가 곱해지도록 변경함으로써, 반향음에 의해 발생 가능한 소리의 왜곡이 경감되거나 제거되도록 할 수 있다.That is, according to an embodiment, by changing the weight having a minus sign to be multiplied by the sound input unit to which the sound arrives the latest for the same sound, the distortion of the sound that may be caused by the reverberation is reduced or eliminated. can

가중치를 변경하는 제2 방법으로서, 제어부(180)는 주변의 잡음원(300)이 발생시키는 잡음을 고려하여서 가중치를 변경할 수 있는데, 이러한 방법은 웨이크업 신호를 인식하는 과정에서 수행 가능하며, 도 8을 참조하여 살펴보기로 한다.As a second method of changing the weight, the controller 180 may change the weight in consideration of noise generated by the surrounding noise source 300. This method can be performed in the process of recognizing the wake-up signal, as shown in FIG. Let's take a look by referring to .

도 8은 사용자(400)가 음성 인식 장치(100)와 타 기기(200) 사이에 위치해 있고, 잡음원(300)은 음성 인식 장치(100)보다 타 기기(200)에 가까이에 위치해 있는 상황을 도시하고 있다. 도 8에서 음성 인식 장치(100)에 입력된 소리의 크기와 타 기기(200)에 입력된 소리의 크기가 동일하다고 가정하자. 이 경우, 전술한 바대로라면 제어부(180)는 음성 인식 장치(100)의 소리 입력부(130)와 타 기기(200)의 소리 입력부(203) 각각에 대해 동일한 값을 갖는 가중치를 산출해야 한다.8 illustrates a situation in which the user 400 is located between the voice recognition apparatus 100 and another device 200 , and the noise source 300 is located closer to the other device 200 than the voice recognition apparatus 100 . are doing In FIG. 8 , it is assumed that the volume of the sound input to the voice recognition apparatus 100 is the same as the volume of the sound input to the other device 200 . In this case, as described above, the controller 180 needs to calculate a weight having the same value for each of the sound input unit 130 of the voice recognition apparatus 100 and the sound input unit 203 of the other device 200 .

다만, 타 기기(200)에 입력된 소리에는 음성 인식 장치(100)에 입력된 소리보다 상대적으로 많은 잡음이 포함되어 있다. 왜냐하면, 타 기기(200)가 음성 인식 장치(100)보다 잡음원(300)에 상대적으로 가까이에 위치하기 때문이다. 따라서, 타 기기(200)에 입력된 소리보다는 음성 인식 장치(100)에 입력된 소리에 대해 음성 인식을 수행하는 것이 음성 인식률 면에서 유리하다.However, the sound input to the other device 200 includes relatively more noise than the sound input to the voice recognition apparatus 100 . This is because the other device 200 is located relatively closer to the noise source 300 than the voice recognition apparatus 100 . Accordingly, it is advantageous in terms of a voice recognition rate to perform voice recognition on a sound input to the voice recognition apparatus 100 rather than a sound input to the other device 200 .

이를 감안하여서, 웨이크업 신호를 인식하는 과정에서, 제어부(180)는 음성 인식 장치(100)의 소리 입력부(130)로 입력된 소리와 웨이크업 신호와의 유사도를 산출하고, 타 기기(200)의 소리 입력부(203)에 입력되는 소리와 웨이크업 신호와의 유사도를 산출한다. 아울러, 제어부(180)는 상대적으로 높은 유사도를 갖는 소리에 더 큰 값의 가중치가 곱해지도록 가중치를 산출하고 상대적으로 낮은 유사도를 갖는 소리에 더 작은 값의 가중치가 곱해지도록 가중치를 산출할 수 있다. In consideration of this, in the process of recognizing the wake-up signal, the controller 180 calculates a similarity between the sound input to the sound input unit 130 of the voice recognition apparatus 100 and the wake-up signal, and the other device 200 A degree of similarity between the sound input to the sound input unit 203 and the wakeup signal is calculated. In addition, the controller 180 may calculate a weight so that a sound having a relatively high similarity is multiplied by a weight having a larger value, and may calculate a weight so that a weight having a smaller value is multiplied by a sound having a relatively low similarity.

이에 따르면, 도 8에서 음성 인식 장치(100)의 소리 입력부(130)에 입력된 소리의 크기와 타 기기(200)의 소리 입력부(203)에 입력된 소리의 크기가 동일하지만, 소리 입력부(130) 입력된 소리에는 더 적은 양의 잡음이 포함되어 있는 반면 소리 입력부(203)에 입력된 소리에는 더 많은 양의 잡음이 포함되어 있다. 따라서, 제어부(180)는 소리 입력부(130)의 소리에 더 큰 값의 가중치가 곱해지도록 가중치를 산출하고, 소리 입력부(203)의 소리에 더 작은 값의 가중치가 곱해지도록 가중치를 산출할 수 있다. According to this, although the level of the sound input to the sound input unit 130 of the voice recognition apparatus 100 and the sound input unit 203 of the other device 200 are the same in FIG. 8 , the sound input unit 130 ) The input sound includes a smaller amount of noise, while the sound input to the sound input unit 203 includes a larger amount of noise. Accordingly, the controller 180 may calculate the weight so that the sound of the sound input unit 130 is multiplied by the weight of a larger value, and calculate the weight so that the sound of the sound input unit 203 is multiplied by the weight of the smaller value. .

즉, 일 실시예에 따르면 소리의 크기가 동일하더라도 그 안에 포함된 잡음의 양에 따라서 가중치가 달리 산출되어 적용될 수 있다.That is, according to an embodiment, even if the volume of the sound is the same, a weight may be calculated differently and applied according to the amount of noise included therein.

한편, 전술한 제2 방법에서, 제어부(180)는 음성 인식 장치(100) 및 타 기기(200) 중에서 잡음원(300)과 상대적으로 가까이에 있는 객체로 하여금 잡음을 상쇄시키는 상쇄음을 출력하도록 제어할 수 있으며, 도 8의 상황에서 이러한 객체는 타 기기(200)이다. 이를 위해, 음성 인식부(160)는 소리 인식부(130,203)로 입력된 소리로부터 잡음을 추출해서 그 특성을 인식할 수 있다. 제어부(180)는 이러한 잡음의 특성을 기초로 해당 잡음을 상쇄시킬 수 있는 상쇄음을 생성할 수 있고, 이러한 상쇄음이 스피커(120)를 통해 출력되도록 제어할 수 있다.Meanwhile, in the second method described above, the controller 180 controls an object relatively close to the noise source 300 among the voice recognition apparatus 100 and other devices 200 to output an offsetting sound for canceling the noise. This can be done, and in the situation of FIG. 8 , such an object is another device 200 . To this end, the voice recognition unit 160 may extract noise from the sound input to the sound recognition units 130 and 203 and recognize its characteristics. The controller 180 may generate an offset sound capable of canceling the corresponding noise based on the characteristics of the noise, and may control the cancellation sound to be output through the speaker 120 .

이에 따르면, 음성 인식 장치(100)와 타 기기(200) 중에서 잡음원(300)과 상대적으로 먼 거리에 위치한 객체로 하여금 웨이크업 신호를 인식하도록 제어할 수 있으며, 뿐만 아니라 잡음원(300)과 상대적으로 가까운 거리에 위치한 객체에서는 잡음을 상쇄시키는 상쇄음을 발생시킬 수 있으므로, 잡음원(300)과 상대적으로 먼 거리에 위치한 객체에서 웨이크업 신호를 높은 인식률로 인식하도록 할 수 있다.According to this, it is possible to control an object located at a relatively distant distance from the noise source 300 among the voice recognition apparatus 100 and other devices 200 to recognize the wake-up signal, as well as relatively to the noise source 300 . Since an object located at a close distance may generate an offset sound for canceling noise, an object located at a relatively distant distance from the noise source 300 may recognize the wake-up signal with a high recognition rate.

가중치를 변경하는 제3 방법으로서, 제어부(180)는 음성 입력 장치(100)의 스피커(120)가 소리를 출력하는 상황을 고려하여서 가중치를 변경할 수 있는데, 이러한 방법은 웨이크업 신호를 인식하는 과정에서 수행 가능하며, 도 9를 참조하여 살펴보기로 한다.As a third method of changing the weight, the controller 180 may change the weight in consideration of a situation in which the speaker 120 of the voice input device 100 outputs a sound. This method is a process of recognizing a wake-up signal. It can be performed in , and will be described with reference to FIG. 9 .

도 9는 음성 입력 장치(100)가 스피커(120)를 통해 소리를 출력하는 상황을 도시하고 있다. 도 9를 참조하면, 사용자(400)는 음성 인식 장치(100)와 타 기기(200) 사이에 위치해 있다. 도 9에서 음성 인식 장치(100)에 입력된 소리의 크기와 타 기기(200)에 입력된 소리의 크기가 동일하다고 가정하자.9 illustrates a situation in which the voice input device 100 outputs sound through the speaker 120 . Referring to FIG. 9 , a user 400 is located between the voice recognition apparatus 100 and another device 200 . In FIG. 9 , it is assumed that the volume of the sound input to the voice recognition apparatus 100 is the same as the volume of the sound input to the other device 200 .

이 경우, 전술한 바대로라면 제어부(180)는 음성 인식 장치(100)의 소리 입력부(130)와 타 기기(200)의 소리 입력부(203) 각각에 대해 동일한 값을 갖는 가중치를 산출해야 한다.In this case, as described above, the controller 180 must calculate a weight having the same value for each of the sound input unit 130 of the voice recognition apparatus 100 and the sound input unit 203 of the other device 200 .

다만, 음성 인식 장치(100)의 소리 입력부(130)에 입력된 소리에는 음성 인식 장치(100)의 스피커(120)가 출력하는 소리가 포함되어 있을 수 있다. 따라서, 음성 인식 장치(100)에 입력된 소리보다는 타 기기(200)에 입력된 소리에 대해 음성 인식을 수행하는 것이 음성 인식률 면에서 유리하다.However, the sound input to the sound input unit 130 of the voice recognition apparatus 100 may include a sound output by the speaker 120 of the voice recognition apparatus 100 . Therefore, it is advantageous in terms of a voice recognition rate to perform voice recognition on a sound input to another device 200 rather than a sound input to the voice recognition apparatus 100 .

이를 감안하여서, 제어부(180)는 음성 인식 장치(100)의 스피커(120)가 소리를 출력하는 상황을 인지하고, 이를 기초로 음성 인식 장치(100)의 소리 출력부(130)로 입력되는 소리에는 더 작은 값의 가중치가 곱해지도록 가중치를 산출하고, 타 기기(210)의 소리 출력부(203)로 입력되는 소리에는 더 큰 값의 가중치가 곱해지도록 가중치를 산출할 수 있다. In consideration of this, the controller 180 recognizes a situation in which the speaker 120 of the voice recognition apparatus 100 outputs a sound, and based on this, the sound input to the sound output unit 130 of the voice recognition apparatus 100 . A weight may be calculated to be multiplied by a weight having a smaller value, and the weight may be calculated such that a sound input to the sound output unit 203 of the other device 210 is multiplied by a weight having a larger value.

한편, 제어부(180)가 제3 방법에 따라서 가중치를 산출한 이후, 제어부(180)는 상황에 따라서 스피커(120)가 출력하는 소리의 크기, 즉 볼륨을 기존보다 작게 조절할 수 있다. 볼륨을 기존보다 작게 하는 상황에는, 제3 방법에 따라서 가중치를 산출한 이후에, 예컨대 사용자(400)가 발한 음성이 음성 인식 장치(100)에서 타 기기(200)보다 크게 인식된 경우 등이 있을 수 있으나 이에 한정되는 것은 아니다 On the other hand, after the controller 180 calculates the weight according to the third method, the controller 180 may adjust the volume, that is, the volume of the sound output by the speaker 120 to be smaller than before, depending on the situation. In a situation where the volume is lowered than before, after calculating the weight according to the third method, for example, there may be a case in which the voice uttered by the user 400 is recognized by the voice recognition apparatus 100 as being larger than that of the other device 200. can, but is not limited to

즉, 이에 따르면 상황에 따라서 스피커의 소리가 기존보다 작게 조절될 수 있고, 이 경우 사용자가 발하는 향후의 명령이나 다른 사용자의 음성에 대한 인식률이 개선될 수 있다.That is, according to this, the sound of the speaker can be adjusted to be lower than before according to the situation, and in this case, the recognition rate of future commands issued by the user or the voices of other users can be improved.

한편, 전술한 제1 방법 내지 제3 방법은 어느 하나만이 사용되거나 또는 적어도 두 개 이상이 순차적으로 적용될 수도 있으며, 이는 사용자(400)에 의해 설정되거나 또는 기 정해진 알고리즘에 의해 주기적으로 변경될 수도 있다. On the other hand, any one of the first to third methods described above may be used or at least two or more may be sequentially applied, which may be set by the user 400 or may be periodically changed by a predetermined algorithm. .

도 10은 일 실시예에 따른 음성 인식 방법의 절차를 도시한 도면이다. 이러한 방법은 전술한 음성 인식 장치(100)에 의해 수행 가능하며, 다만 도 11에 도시된 절차 중 적어도 하나가 수행되지 않거나 도시된 절차의 순서와는 다르게 수행될 수 있으며, 또한 도시되지 않은 다른 절차가 수행될 수도 있다.10 is a diagram illustrating a procedure of a voice recognition method according to an embodiment. This method may be performed by the above-described voice recognition apparatus 100, but at least one of the procedures shown in FIG. 11 may not be performed or may be performed differently from the illustrated procedure, and also other procedures not shown may be performed.

도 11을 참조하면, 먼저 음성 인식 장치(100)의 소리 입력부(130)를 통해서 제1 소리가 입력되면, 음성 인식부(160)는 제1 소리의 특성, 예컨대 제1 소리의 크기를 도출할 수 있다(S100). 다만, 단계 S100 이전에 도면에는 도시되지 않았지만 다음과 같은 단계들이 먼저 선행될 수 있다. 예컨대, 제어부(180)가 음성 인식 장치(100) 주변에 있는 타 기기(200,210)를 탐색하는 단계, 탐색이 완료되면 타 기기(200,210)와 음성 인식 장치(100)를 연결시키는 단계, 연결되면 제어부(180)가 타 기기(200,210) 각각에게 통신부(110)를 통해서 제1 소리의 특성을 요청하는 단계 등이 수행될 수 있다.Referring to FIG. 11 , first, when a first sound is input through the sound input unit 130 of the voice recognition apparatus 100 , the voice recognition unit 160 determines the characteristics of the first sound, for example, the volume of the first sound. It can be (S100). However, although not shown in the drawings before step S100, the following steps may be preceded. For example, the step of the control unit 180 searching for other devices 200 and 210 in the vicinity of the voice recognition apparatus 100, the step of connecting the other devices 200 and 210 with the voice recognition apparatus 100 when the search is completed, the step of connecting the control unit The step 180 of requesting the characteristics of the first sound from each of the other devices 200 and 210 through the communication unit 110 may be performed.

한편, 통신부(110)를 통해 타 기기(200,210) 각각으로부터 제2 소리의 특성, 예컨대 소리의 크기가 수신되면, 제어부(180)는 제1 소리의 크기와 제2 소리의 크기를 비교할 수 있다(S200).On the other hand, when the characteristic of the second sound, for example, the loudness of the sound is received from each of the other devices 200 and 210 through the communication unit 110 , the controller 180 may compare the loudness of the first sound with the loudness of the second sound ( S200).

비교 결과, 제어부(180)는 더 큰 크기의 소리에 더 큰 값의 가중치가 곱해지도록 가중치를 산출하고, 더 작은 크기의 소리에 더 작은 값의 가중치가 곱해지도록 산출할 수 있다(S300).As a result of the comparison, the controller 180 may calculate a weight so that a larger sound is multiplied by a weight of a larger value, and may calculate a sound of a smaller size to be multiplied by a weight of a smaller value ( S300 ).

이 후, 제어부(180)는 단계 S300에서 산출된 가중치를 조절할 수 있다(S400). 예컨대, 제어부(180)는 소리 입력부(130,203)에 소리가 입력된 시간(제1 방안), 웨이크업 신호와 소리와의 유사도(제2 방안) 또는 스피커(120)를 통해 소리가 출력되는지 여부(제3 방안) 등을 고려하여서 가중치를 조절할 수 있다. 이 경우, 전술한 제1 방안 내지 제3 방안의 경우, 어느 하나가 선별적으로 고려되거나 또는 적어도 두 개 이상이 동시에 고려될 수도 있으며, 이는 사용자(400)에 의해 설정되거나 또는 기 정의된 알고리즘에 의해 변경될 수 있다.Thereafter, the controller 180 may adjust the weight calculated in step S300 (S400). For example, the controller 180 determines whether a sound is inputted to the sound input units 130 and 203 (the first method), the degree of similarity between the wake-up signal and the sound (second method), or whether the sound is output through the speaker 120 ( The weight may be adjusted in consideration of the third method). In this case, in the case of the above-described first to third methods, any one may be selectively considered or at least two or more may be considered simultaneously, which is based on an algorithm set by the user 400 or a predefined algorithm. may be changed by

합성부(140)는 단계 S400에서 조절된 각각의 가중치를 각각의 소리에 곱한 뒤 합성할 수 있다(S500).The synthesizer 140 may synthesize each sound after multiplying the respective weights adjusted in step S400 (S500).

제어부(180)는 단계 S500에서 합성된 소리에 대해 음성 인식이 수행되도록 제어할 수 있다(S600). 예컨대, 제어부(180)는 단계 S500에서 합성된 소리를 음성 인식 서버(500)에게 전달할 수 있다. 음성 인식 서버(500)로 전달된 소리는 이러한 음성 인식 서버(500)에서 음성 인식에 사용될 수 있다.The controller 180 may control the voice recognition to be performed on the sound synthesized in step S500 (S600). For example, the controller 180 may transmit the sound synthesized in step S500 to the voice recognition server 500 . The sound transmitted to the voice recognition server 500 may be used for voice recognition in the voice recognition server 500 .

이상에서 살펴본 바와 같이, 일 실시예에 따르면, 복수 개의 장치 각각이 소리를 입력받을 때, 이러한 소리를 발하는 음원과 각각의 장치 간의 이격 거리가 고려되어서 각각의 소리가 증폭될 수 있다. 따라서, 어느 하나의 장치가 음원과 원거리만큼 이격되어 있다고 하더라도 다른 장치가 음원과 근거리만큼 이격되어 있으면 이러한 다른 장치로 입력된 소리가 보다 증폭되어서 합성될 수 있으므로, 해당 음원이 발하는 소리에 대한 인식률이 향상될 수 있다.As described above, according to an embodiment, when each of a plurality of devices receives a sound, the respective sound may be amplified by considering the separation distance between the sound source emitting the sound and each device. Therefore, even if one device is separated from the sound source by a distance, if the other device is spaced apart from the sound source by a short distance, the sound input to the other device can be amplified and synthesized, so the recognition rate for the sound emitted by the sound source is lowered. can be improved

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and various modifications and variations will be possible without departing from the essential quality of the present invention by those skilled in the art to which the present invention pertains. Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

일 실시예에 따르면, 사용자가 음성 인식 장치로부터 원거리에서 음성을 발화하더라도, 사용자와 음성 인식 장치 간의 원거리로 인해 발생 가능한 음성 왜곡이 발생하지 않을 수 있다. 또한, 음성 입력 시스템에 잡음원이 존재하더라도 이러한 잡음원이 음성 인식에 영향을 최소한으로 미치도록 할 수 있다According to an embodiment, even when a user utters a voice from a distance from the voice recognition device, possible voice distortion may not occur due to the distance between the user and the voice recognition device. In addition, even if noise sources exist in the voice input system, these noise sources can be minimized to affect speech recognition.

100: 음성 인식 장치
200, 210: 타 기기
300: 잡음원
400: 사용자100: speech recognition device
200, 210: other devices
300: noise source
400: user

Claims

A voice recognition method performed by a voice recognition device, comprising:
deriving the volume of the first sound input to the first sound input unit of the voice recognition device;
Comparing the loudness of the first sound with the loudness of the second sound when information on the loudness of the second sound input to the second sound input unit of the other device is received from the other device;
Multiplying a relatively large sound by a weight of a relatively large value as a result of the comparison and multiplying a relatively small sound by a weight of a relatively small value;
synthesizing the first sound and the second sound each multiplied by the weight;
Controlling the voice recognition to be performed on the synthesized first sound and the second sound,
The step of multiplying the weight is
When a third sound is output through a speaker included in the voice recognition apparatus, while the third sound is output, the first sound is multiplied by a weight having a relatively small value, and relatively to the second sound to be multiplied by the weight of the larger value
Speech Recognition Method.

The method of claim 1,
detecting a first time at which the first sound is input to the first sound input unit and a second time at which the second sound is input to the second sound input unit;
Further comprising the step of determining whether the first sound and the second sound are sounds generated from the same sound source at the same time,
The step of multiplying the weight is
When it is determined that the first sound and the second sound are sounds generated from the same sound source at the same time, the first sound and the second sound are compared, and the later of the first sound and the second sound is input to be multiplied by a weight with a negative sign
Speech Recognition Method.

The method of claim 1,
Before the wake-up signal is recognized by the voice recognition device, when noise from a noise source is included in the first sound and the second sound, the first sound and the second sound are respectively associated with the wake-up signal. Further comprising the step of deriving a similarity,
The step of multiplying the weight is
Based on the derived similarity, a sound having a relatively low similarity among the first sound and the second sound is multiplied by a weight of a relatively small value, and a sound having a relatively high similarity is multiplied by a relatively large value to be multiplied by the weight of
Speech Recognition Method.

4. The method of claim 3,
recognizing the characteristics of the noise;
The method further comprising the step of causing an object receiving a sound having a relatively low similarity among the speech recognition apparatus and the other device to generate and output an offsetting sound for canceling the noise based on the characteristics of the recognized noise
Speech Recognition Method.

a first sound input unit for receiving a first sound;
a voice recognition unit for deriving the volume of the first sound;
a communication unit configured to receive information on the volume of the second sound input to the second sound input unit from another device including a second sound input unit;
a speaker that outputs sound to the outside; and
The loudness of the first sound and the loudness of the second sound are compared, and as a result of the comparison, a weight multiplied by a relatively loud sound is calculated to have a relatively large value, and a weight multiplied by a relatively low sound is a relatively small value. and a controller for synthesizing the first sound and the second sound multiplied by the calculated weights, and controlling the voice recognition to be performed on the synthesized first sound and the second sound,
The control unit is
When a third sound is output from the speaker, while the third sound is output, the first sound is multiplied by a relatively small weight, and the second sound is multiplied by a relatively large weight doing
speech recognition device.

As a computer program stored in a computer-readable recording medium,
The computer program, when executed by a processor,
5. The method according to any one of claims 1 to 4, comprising instructions for causing the processor to perform
computer program.