KR102111762B1

KR102111762B1 - Apparatus and method for collecting voice

Info

Publication number: KR102111762B1
Application number: KR1020180064456A
Authority: KR
Inventors: 이혜정; 이종민
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2018-06-04
Filing date: 2018-06-04
Publication date: 2020-05-15
Also published as: KR20190138188A

Abstract

일 실시예에 따른 음성 수집 장치는 영상에 포함된 인물의 얼굴을 분석하는 분석부와, 상기 분석부에 의해 분석된 결과를 기초로 상기 영상에 포함된 음성 중 추출의 대상이 되는 음성의 구간을 선정하는 선정부와, 상기 선정부에 의해 추출의 대상으로 선정된 구간에 상응하는 음성을 상기 영상에 포함된 음성으로부터 추출하는 추출부와, 상기 추출부에 의해 추출된 음성을, 상기 분석된 결과로부터 도출된 상기 인물에 대한 정보와 대응시켜서 저장하는 저장부를 포함한다.The voice collection device according to an embodiment includes an analysis unit for analyzing a face of a person included in an image, and a section of voice to be extracted among the voices included in the image based on the results analyzed by the analysis unit. The result of the analysis is the selection unit for selecting, an extraction unit for extracting a voice corresponding to a section selected as an object of extraction by the selection unit from the audio included in the image, and an audio extracted by the extraction unit. It includes a storage unit for storing in correspondence with the information on the person derived from.

Description

Voice collection device and method {APPARATUS AND METHOD FOR COLLECTING VOICE}

본 발명은 음성 수집 장치 및 방법에 관한 것이다.The present invention relates to a voice collection device and method.

최근 미디어 컨텐츠에 대한 다양한 미디어 서비스가 소개되고 있다. 시청자는 미디어 서비스를 통해 미디어 컨텐츠와 관련된 다양한 정보를 제공받을 수 있다. 예컨대, 시청자는 미디어 컨텐츠에 등장하는 인물, 장소, 배경 음악 또는 상품과 같은 다양한 객체에 대해 정보를 제공받을 수 있다. Recently, various media services for media content have been introduced. Viewers can be provided with various information related to media content through media services. For example, the viewer may be provided with information on various objects such as people, places, background music, or products appearing in media content.

이 중, 미디어 컨텐츠에 등장하는 인물에 대한 정보에 대해 살펴보기로 한다. 인물에 대한 정보에는 예컨대 해당 인물의 프로필 정보나 해당 인물이 등장하는 미디어 컨텐츠 자체에 대한 정보 등이 포함될 수 있다. Among them, information about the characters appearing in the media content will be described. Information about the person may include, for example, profile information of the person or information on the media content itself in which the person appears.

뿐만 아니라 해당 인물이 해당 미디어 컨텐츠에서 등장하는 시점에 대한 정보도 인물에 대한 정보에 포함될 수 있다. 이러한 시점에 대한 정보를 이용하면, 해당 미디어 컨텐츠의 재생 지점을 해당 인물이 등장하는 시점으로 이동시키는 서비스의 제공이 가능하다.In addition, information about the time when the person appears in the media content may be included in the information about the person. Using this point of view information, it is possible to provide a service that moves the play point of the corresponding media content to the point of time when the person appears.

한편, 미디어 컨텐츠와 관련된 전술한 인물에 대한 정보가 시청자에게 제공되기 위해서는 미디어 컨텐츠에 등장하는 인물은 누구인지, 미디어 컨텐츠의 어느 부분에서 인물이 등장하는지, 등장하는 인물의 프로필은 어떠한지 등과 같은 정보가 획득되어서 해당 미디어 컨텐츠에 삽입되어야 한다. 여기서, 이러한 정보의 획득 방법에는 예컨대 미디어 컨텐츠의 제작자 내지 편집자가 직접 해당 미디어 컨텐츠를 분석하여서 정보를 획득하는 방법 또는 해당 미디어 컨텐츠에 딥러닝과 같은 영상 분석 기법 등을 적용하여서 그에 포함된 인물이 누구인지를 식별한 뒤, 식별된 인물에 보다 구체적인 정보를 웹(web) 등에서 획득하는 방법 등과 같은 다양한 방법들이 있다.On the other hand, in order to provide viewers with information on the above-mentioned persons related to media content, information such as who is the person appearing in the media content, where is the person appearing in the media content, and what is the profile of the person appearing? It must be acquired and inserted into the media content. Here, in the method of acquiring such information, for example, a producer or editor of media content directly analyzes the media content and obtains information, or a person included in the media content by applying a video analysis technique such as deep learning to the media content. After identifying cognition, there are various methods, such as a method of obtaining more specific information on the identified person on the web.

한국특허등록공보, 제 10-1855241 호 (2018.04.30. 등록)Korean Patent Registration Publication, No. 10-1855241 (Registration on April 30, 2018)

미디어 컨텐츠에 등장하는 인물에 대한 정보를 추출하는 방법에는 미디어 컨텐츠를 구성하는 영상을 분석하는 방법, 해당 미디어 컨텐츠를 구성하는 음성을 분석하는 방법 또는 해당 미디어 컨텐츠를 구성하는 영상과 음성을 모두 분석하는 방법 등이 있다.The method of extracting information about a person appearing in media content includes analyzing a video constituting the media content, analyzing a voice constituting the media content, or analyzing both a video and a voice constituting the media content. There are methods.

여기서, 미디어 컨텐츠를 구성하는 음성을 분석하거나 음성과 영상을 모두 분석하는 방법의 경우, 사전에 마련된 각 인물에 대한 음성 데이터베이스를 활용할 수 있다. Here, in the case of a method of analyzing voice constituting media content or analyzing both voice and video, a voice database for each person prepared in advance may be used.

이에, 본 발명의 해결하고자 하는 과제는 다양한 미디어 컨텐츠로부터, 이들 미디어 컨텐츠 각각에 등장하는 인물의 음성을 수집하는 기술을 제공하는 것이다.Accordingly, an object of the present invention is to provide a technique for collecting voices of characters appearing in each of these media contents from various media contents.

또한, 이와 같이 수집된 음성을 활용하여서, 소정의 미디어 컨텐츠에 등장하는 인물에 대한 정보를 제공하는 것이다.In addition, by using the collected voice, information about a person appearing in a predetermined media content is provided.

다만, 본 발명의 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problem to be solved of the present invention is not limited to those mentioned above, and another problem not to be solved can be clearly understood by a person having ordinary knowledge to which the present invention belongs from the following description. will be.

일 실시예에 따른 음성 수집 방법은 음성 수집 장치에 의해 수행되며, 영상에 포함된 인물의 얼굴을 분석하는 단계와, 상기 분석하는 단계에서 분석된 결과를 상기 영상에 포함된 음성 중 추출의 대상이 되는 음성의 구간을 선정하는 단계와, 상기 선정하는 단계에서 선정된 구간에 상응하는 음성을, 상기 영상에 포함된 음성으로부터 추출하는 단계와, 상기 추출하는 단계에서 추출된 음성을, 상기 분석된 결과로부터 도출된 상기 인물에 대한 정보와 대응시켜서 저장하는 단계를 포함한다.The voice collection method according to an embodiment is performed by a voice collection device, wherein the face of a person included in the image is analyzed, and the result analyzed in the analysis step is extracted from the voice included in the image. Selecting a section of the voice to be, and extracting the voice corresponding to the section selected in the selecting step from the voice included in the image, and the extracted voice in the extracting step, the analyzed result And matching and storing information on the person derived from.

일 실시예에 따른 컴퓨터 판독가능한 기록매체에 저장된 컴퓨터 프로그램은 영상에 포함된 인물의 얼굴을 분석하는 단계와, 상기 분석하는 단계에서 분석된 결과를 상기 영상에 포함된 음성 중 추출의 대상이 되는 음성의 구간을 선정하는 단계와, 상기 선정하는 단계에서 선정된 구간에 상응하는 음성을, 상기 영상에 포함된 음성으로부터 추출하는 단계와, 상기 추출하는 단계에서 추출된 음성을, 상기 분석된 결과로부터 도출된 상기 인물에 대한 정보와 대응시켜서 저장하는 단계를 수행하도록 프로그램된다.A computer program stored in a computer-readable recording medium according to an embodiment includes analyzing a face of a person included in an image, and analyzing the result analyzed in the analyzing step, the voice to be extracted from the voice included in the image The step of selecting the section, extracting the voice corresponding to the section selected in the selecting step from the voice included in the image, and extracting the voice extracted in the extraction step from the analyzed result It is programmed to perform the step of storing in association with the information about the person.

일 실시예에 따르면, 인물 별 음성이 해당 인물에 대한 정보와 함께 데이터베이스의 형태로 획득될 수 있다. 기존에는 각 인물의 음성을 별도로 수집하기가 용이하지 않았으나, 일 실시예에 따르면 영상이 주어지기만 하면 해당 영상으로부터 각 인물의 음성이 음성 수집 장치에 의해 자동으로 또는 기계적으로 수집되어서 데이터베이스화될 수 있다. 아울러, 각 인물의 음성 그 자체 뿐만 아니라, 각 인물의 음성을 특징짓는 고유 정보까지도 해당 인물에 대한 정보와 함께 데이터베이스의 형태로 획득될 수 있다.According to an embodiment, voices for each person may be obtained in the form of a database together with information about the person. In the past, it was not easy to separately collect the voices of each person, but according to an embodiment, as long as an image is given, the voices of each person can be automatically or mechanically collected from the corresponding image and databased. . In addition, not only the voice itself of each person but also the unique information characterizing the voice of each person can be obtained in the form of a database together with information about the person.

이와 같이 데이터베이스화된 인물의 음성은 추후 음성만을 이용해서 인물을 식별하는 프로세스 내지 음성과 영상을 결합하여서 인물을 식별하는 프로세스 등에서 활용될 수 있으며, 이 경우 인물 식별의 정확성 내지 속도가 향상될 수 있다. 예컨대, 영상 속의 인물이 성형을 하거나 분장을 하였다고 하더라도 또는 얼굴 인식만으로는 인물을 특정할 수 없는 경우라도(인물이 뒤를 돌아보거나 빠르게 달려가는 경우 또는 영상 속의 인물의 얼굴의 크기가 식별하기에 적합하지 않을 만큼 작은 경우 등), 일 실시예에 따른 인물의 음성에 대한 데이터베이스를 활용할 경우, 해당 영상에서 어떤 인물이 등장하는지 내지 해당 영상의 어느 부분에서 어떤 인물이 등장하는지에 대한 정보가 정확하면서도 빠르게 획득될 수 있다.The voice of the databased person can be utilized in a process of identifying a person using only the voice later, or a process of identifying a person by combining voice and video, and in this case, the accuracy or speed of person identification can be improved. . For example, even if the person in the image is molded or dressed, or even if face recognition alone cannot identify the person (if the person is looking back or running fast, or the size of the person's face in the picture is not suitable for identification) If it is as small as, etc.), when using the database for the voice of a person according to an embodiment, information about which person appears in the corresponding video or in which part of the corresponding video is accurately and quickly obtained Can be.

도 1은 영상 재생 장치에서 영상이 재생되고 있는 화면을 도시하고 있다.
도 2는 일 실시예에 따른 음성 수집 장치가 적용된 시스템을 도시하고 있다.
도 3은 도 2에 도시된 음성 수집 장치의 구성을 도시하고 있다.
도 4는 영상, 이러한 영상을 구성하는 복수 개의 장면 및 이러한 영상에 포함된 음성 중 수집된 음성의 구간에 대해 예시적으로 도시하고 있다.
도 5는 일 실시예에 따른 음성 수집 방법의 절차를 예시적으로 도시하고 있다.
도 6은 도 5에 도시된 음성 수집 방법의 절차를 보다 세분화하여서 도시하고 있다.1 shows a screen in which an image is being played in an image playback device.
2 illustrates a system to which a voice collection device according to an embodiment is applied.
FIG. 3 shows the configuration of the speech collection device shown in FIG. 2.
FIG. 4 exemplarily shows an image, a plurality of scenes constituting the image, and a section of audio collected among the voices included in the image.
5 exemplarily shows a procedure of a voice collection method according to an embodiment.
FIG. 6 shows the procedure of the speech collection method illustrated in FIG. 5 in more detail.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing embodiments of the present invention, when it is determined that a detailed description of known functions or configurations may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to a user's or operator's intention or practice. Therefore, the definition should be made based on the contents throughout this specification.

도 1은 영상 재생 장치(10)에서 영상이 재생되고 있는 화면을 도시하고 있다. 먼저, 이하에서 영상이란 '미디어 컨텐츠'중 하나를 지칭하며, 소리(음성)가 정지 화상이나 동화상과 결합된 형태일 수 있다.1 shows a screen in which an image is being played in the image reproducing apparatus 10. First, hereinafter, an image refers to one of 'media contents', and may be a form in which sound (voice) is combined with a still image or a moving image.

도 1을 참조하면, 영상 재생 장치(10)에서 재생되는 영상에는 적어도 하나의 객체(20,30)가 포함된다. 객체(20,30) 각각에는 메타 데이터(meta-data)가 부여된다. 메타 데이터에 대해 예를 들어 살펴보면, 인물인 객체(20)에 대해서는 인물의 이름, 성별, 생년월일, 해당 영상에서 인물이 언제 등장하는지 또는 해당 영상 이외에 해당 인물이 등장하는 다른 영상에 대한 정보 등이 있을 수 있다. 이와 달리 사물인 객체(30)에 대해서는 사물의 명칭, 가격 또는 판매하는 장소 등이 있을 수 있다. 다만, 메타 데이터의 종류 내지 메타 데이터가 부여되는 객체의 종류가 전술한 것들에 한정되는 것은 아니다.Referring to FIG. 1, an image reproduced by the image reproducing apparatus 10 includes at least one object 20 or 30. Meta-data is provided to each of the objects 20 and 30. Looking at metadata for example, for the object 20 as a person, there may be information such as the person's name, gender, date of birth, when the person appears in the video, or information about other video in which the person appears in addition to the video. Can be. On the other hand, for the object 30 as an object, there may be a name of the object, a price, or a place for sale. However, the type of the meta data or the type of the object to which the meta data is assigned is not limited to those described above.

도 2는 일 실시예에 따른 음성 수집 장치(100)가 적용된 시스템을 도시하고 있다. 다만, 도 2는 예시적인 것에 불과하므로, 음성 수집 장치(100)가 도 2에 도시된 시스템에만 한정 적용되는 것으로 해석되지는 않는다.2 illustrates a system to which the voice collection device 100 according to an embodiment is applied. However, since FIG. 2 is only an example, the speech collection device 100 is not interpreted as being limited to the system illustrated in FIG. 2.

도 2를 참조하면, 영상 제공 서버(50)는 영상을 저장한다. 이러한 영상의 예로는 드라마, 뉴스, 쇼 프로그램이나 영화 같은 것이 있을 수 있으며, 다만 이에 한정되는 것은 아니다. 이러한 영상 제공 서버(50)는 방송국 등에 마련 내지 설치되어 있는 방송 장비 등일 수 있다.Referring to FIG. 2, the image providing server 50 stores an image. Examples of such videos may include dramas, news, show programs, or movies, but are not limited thereto. The video providing server 50 may be a broadcasting device provided or installed in a broadcasting station or the like.

영상 제공 서버(50)는 음성 수집 장치(100)에게 영상을 제공한다. 음성 수집 장치(100)에게 제공되는 이러한 영상은 음성 수집 장치(100)에서 인물에 대한 음성 데이터베이스 구축에 사용된다.The video providing server 50 provides video to the audio collection device 100. These images provided to the voice collection device 100 are used in the voice collection device 100 to construct a voice database for a person.

인물에 대한 음성 데이터베이스가 음성 수집 장치(100)에 구축 완료된 후에도, 영상 제공 서버(50)는 음성 수집 장치(100)에게 영상을 제공한다. 이 경우에 영상 제공 서버(50)로부터 음성 수집 장치(100)에게 영상이 제공되는 목적은, 해당 영상에 등장하는 인물에 대한 정보 등을 음성 수집 장치(100)로부터 획득하기 위해서일 수 있으나 이에 한정되는 것은 아니다.Even after the voice database for the person is completed in the voice collection device 100, the image providing server 50 provides the image to the voice collection device 100. In this case, the purpose of providing the image from the image providing server 50 to the voice collection apparatus 100 may be to obtain information about a person appearing in the image from the voice collection apparatus 100, but is not limited thereto. It does not work.

영상 재생 장치(10)는 인물에 대한 메타 데이터가 부여된 영상을 재생하는 장치이다. 영상 재생 장치(10)는 예컨대 TV, 컴퓨터 또는 스마트 기기 등일 수 있으나 이에 한정되는 것은 아니다.The video reproducing apparatus 10 is a device for reproducing an image to which meta data about a person is assigned. The video reproducing apparatus 10 may be, for example, a TV, a computer, or a smart device, but is not limited thereto.

영상 재생 장치(10)가 재생하는 이러한 영상은, 영상 재생 장치(10)가 영상 제공 서버(50)로부터 제공받은 것이거나 또는 영상 재생 장치(10)가 음성 수집 장치(100)로부터 직접 제공받은 것일 수 있다.The video reproduced by the video reproducing device 10 is that the video reproducing device 10 is provided from the video providing server 50 or the video reproducing device 10 is directly provided from the audio collection device 100. Can be.

음성 수집 장치(100)에 대해 살펴보기로 한다. 음성 수집 장치(100)는 영상 제공 서버(50)로부터 다양한 영상을 제공받아서 해당 영상에 등장하는 인물이 누구인지를 식별하고, 식별된 인물의 음성을 추출하며, 추출된 음성을 해당 인물에 대한 정보와 함께 데이터베이스의 형태로 저장한다. 즉, 음성 수집 장치(100)는 인물의 음성을 수집하여서 데이터베이스 형태로 가공한다.The voice collection device 100 will be described. The voice collection device 100 receives various images from the video providing server 50 to identify who is the person appearing in the video, extracts the voice of the identified person, and extracts the extracted information about the person. Save in the form of a database with. That is, the voice collection device 100 collects the voice of a person and processes it in a database form.

아울러, 동일 인물에 대한 음성이 복수 개가 모일 경우, 음성 수집 장치(100)는 이와 같은 복수 개의 음성에 음성 핑거프린트와 같은 다양한 기법을 적용하여서 음성을 식별하는 고유 정보를 추출한 뒤, 이러한 고유 정보를 전술한 데이터베이스에 저장할 수 있다. In addition, when a plurality of voices for the same person are gathered, the voice collection device 100 extracts unique information for identifying the voice by applying various techniques such as a voice fingerprint to the plurality of voices, and then extracts the unique information. It can be stored in the aforementioned database.

뿐만 아니라 음성 수집 장치(100)는 영상이 주어지면, 해당 영상에 등장하는 인물이 누구인지 등과 같은 메타 데이터를 추출한 뒤, 이러한 메타 데이터를 영상 제공 서버(50)에게 제공하거나 또는 해당 영상에 메타 데이터를 부여한 뒤 이러한 이러한 영상 자체를 영상 제공 서버(50)에게 되돌려줄 수도 있다. 이하에서는 이러한 음성 수집 장치(100)에 대해 보다 자세하게 살펴보기로 한다.In addition, when a video is given, the voice collection device 100 extracts metadata such as who is the person appearing in the video, and then provides the metadata to the video providing server 50 or meta data to the video. After granting this, the video itself may be returned to the video providing server 50. Hereinafter, the voice collection device 100 will be described in more detail.

도 3는 일 실시예에 따른 음성 수집 장치(100)의 구성을 도시하고 있으며, 다만 도 3에 도시된 것은 예시적인 것에 불과하다. 이러한 음성 수집 장치(100)는 PC나 서버 등에서 구현될 수 있다. FIG. 3 shows a configuration of the voice collection device 100 according to an embodiment, but the one shown in FIG. 3 is merely exemplary. The voice collection device 100 may be implemented in a PC or a server.

도 3을 도 2와 함께 참조하면, 음성 수집 장치(100)는 저장부(120), 분석부(130), 선정부(140) 및 추출부(150)를 포함하며, 실시예에 따라 통신부(110), 고유 정보 추출부(160) 및 인물 판별부(170) 중 적어도 하나를 포함할 수 있다. 아울러, 도 3에 도시되지 않았지만 음성 수집 장치(100)의 구현에 필요한 기타 다양한 구성들이 음성 수집 장치(100)에 포함될 수 있다.Referring to FIG. 3 together with FIG. 2, the voice collection device 100 includes a storage unit 120, an analysis unit 130, a selection unit 140 and an extraction unit 150, and according to an embodiment, the communication unit ( 110), at least one of a unique information extraction unit 160 and a person discrimination unit 170 may be included. In addition, although not shown in FIG. 3, various other configurations necessary for the implementation of the speech collection device 100 may be included in the speech collection device 100.

먼저, 통신부(110)는 데이터를 송수신하는 유/무선 통신 모듈로서 구현 가능하다. 음성 수집 장치(100)는 이러한 통신부(110)를 통해 영상 제공 서버(50) 또는 영상 재생 장치(10)와 데이터를 송수신할 수 있다.First, the communication unit 110 may be implemented as a wired / wireless communication module that transmits and receives data. The voice collection device 100 may transmit and receive data to and from the video providing server 50 or the video reproducing device 10 through the communication unit 110.

저장부(120)는 데이터를 저장하는 메모리 등으로 구현 가능하다. 이러한 저장부(120)에는 복수의 인물 각각에 대한 음성이 해당하는 인물에 대한 정보와 함께 데이터베이스의 형태로 저장되며, 더 나아가서는 각각의 인물의 음성으로부터 추출된 고유 정보가 저장될 수도 있되, 다만 저장되는 정보가 이에 한정되는 것은 아니다.The storage unit 120 may be implemented as a memory for storing data. In the storage unit 120, voices for each of a plurality of characters are stored in the form of a database together with information about the corresponding people, and furthermore, unique information extracted from the voices of each character may be stored. The information stored is not limited to this.

한편, 도 3에 도시된 분석부(130), 선정부(140), 추출부(150), 고유 정보 추출부(160) 및 인물 판별부(170) 각각은 이하에서 설명할 기능을 수행하도록 프로그램된 명령어를 저장하는 메모리 및 이러한 명령어를 실행하는 마이크로프로세서에 의해 구현 가능하다. Meanwhile, the analysis unit 130, the selection unit 140, the extraction unit 150, the unique information extraction unit 160, and the person discrimination unit 170 shown in FIG. 3 are each programmed to perform the functions described below. It can be implemented by the memory that stores the instructions and the microprocessor that executes these instructions.

먼저 분석부(130)에 대해 살펴보기로 한다. 분석부(130)는 영상을 분석한다. 여기서, 분석부(130)의 분석 대상인 영상은 음성 수집 장치(100)가 통신부(110)를 통해 영상 제공 서버(50)로부터 제공받은 것일 수 있으며 다만 이에 한정되는 것은 아니다.First, the analysis unit 130 will be described. The analysis unit 130 analyzes the image. Here, the image to be analyzed by the analysis unit 130 may be provided by the audio collection device 100 from the image providing server 50 through the communication unit 110, but is not limited thereto.

분석부(130)는 분석 대상인 영상을 복수 개의 정지 화상, 즉 복수 개의 장면으로 분할할 수 있다. The analysis unit 130 may divide an image to be analyzed into a plurality of still images, that is, a plurality of scenes.

또한, 분석부(130)는 이와 같이 분할된 각각의 장면에서 특징 벡터(feature vector)를 추출할 수 있다. 추출된 특징 벡터를 이용하여서, 분석부(130)는 각각의 장면에 얼굴이 포함되어 있는지를 분석할 수 있고, 얼굴이 포함되어 있다면 얼굴이 몇 개 포함되어 있는지 여부를 분석할 수 있다. 아울러, 분석부(130)는 각각의 얼굴에 대응되는 인물이 누구인지에 대한 정보를 합성곱 신경망 등과 같은 영상 식별 알고리즘 등을 이용하여서 획득할 수 있으며, 인물 식별이 된 경우 해당 얼굴이 그 다음 장면에서도 연속적으로 나타나는지를 추적(tracing)할 수 있다.Also, the analysis unit 130 may extract a feature vector from each of the divided scenes. Using the extracted feature vector, the analysis unit 130 may analyze whether each face includes a face, and if the face is included, how many faces are included. In addition, the analysis unit 130 may obtain information about who the person corresponding to each face is by using an image identification algorithm such as a convolutional neural network, and when the person is identified, the corresponding face is the next scene. You can also trace whether it appears continuously in.

뿐만 아니라, 분석부(130)는 전술한 특징 벡터를 이용하여서 각각의 영상에 포함된 얼굴의 입술에서 움직임이 있는지를 분석할 수 있고, 입술 움직임이 있다면 이러한 움직임의 정도가 소정의 기준을 넘어서는지를 분석할 수 있다. 여기서 입술 움직임은 해당 장면을 기준으로 하였을 때 앞뒤로 각각 배치된 소정 개수의 장면 각각의 특징 벡터를 고려하여서 판별 가능하다. 뿐만 아니라, 분석부(130)는 전술한 특징 벡터를 이용하여서, 각각의 영상에 포함된 얼굴이 이러한 영상이 표시되는 화면 상에서 어디를 향하는지 그 방향(예컨대, 정면을 향하는지 여부 등)을 분석할 수 있다.In addition, the analysis unit 130 may analyze whether there is movement on the lips of the face included in each image using the aforementioned feature vector, and if there is lip movement, whether the degree of such movement exceeds a predetermined criterion Can be analyzed. Here, the lip movement can be discriminated in consideration of the feature vectors of each of a predetermined number of scenes arranged back and forth when based on the corresponding scene. In addition, the analysis unit 130 analyzes the direction (eg, whether it is facing the front, etc.) of the face included in each image on the screen where the image is displayed using the above-described feature vector. can do.

여기서, 영상을 복수 개의 장면으로 분할하는 과정 자체, 각각의 장면으로부터 특징 벡터를 추출하는 과정 자체 및 특징 벡터를 이용하여서 얼굴에 대응되는 인물을 식별 내지 분석하는 과정 그 자체는 공지된 기술이므로 이에 대한 설명은 생략하기로 한다.Here, the process of dividing an image into a plurality of scenes itself, the process of extracting a feature vector from each scene itself, and the process of identifying or analyzing a person corresponding to a face using the feature vector itself are well-known techniques. The description will be omitted.

아울러, 분석부(130)가 영상을 복수 개의 장면으로 분할하거나 분할된 장면 각각으로부터 특징 벡터를 추출하는 과정은 영상 분석을 수행하기 위한 하나의 예시적인 과정에 불과하다. 즉, 실시예에 따라서 영상을 복수 개의 장면으로 분할하지 않고도 전술한 분석 과정은 수행될 수 있으며, 또는 영상을 복수 개의 장면으로 분할하긴 하였지만 이러한 각각의 장면으로부터 특징 벡터를 추출하지 않고도 전술한 분석 과정은 수행될 수 있다.In addition, the process in which the analysis unit 130 divides the image into a plurality of scenes or extracts a feature vector from each of the divided scenes is only one exemplary process for performing image analysis. That is, according to an embodiment, the above-described analysis process may be performed without dividing the image into a plurality of scenes, or although the image is divided into a plurality of scenes, the above-described analysis process without extracting feature vectors from each of these scenes Can be performed.

선정부(140)는 분석부(130)에 의해 분석된 결과를 기초로, 이러한 영상에 포함된 음성 중 추출의 대상이 되는 음성의 구간을 적어도 한 개 선정한다. 여기서 '음성의 구간'이란 추출의 시작 지점과 끝 지점으로 정의되는, 시작 지점과 끝 지점 사이의 구간을 지칭한다. 음성의 구간의 길이는 다양한 값을 가질 수 있다. 예컨대, 선정부(140)는 런닝 타임이 2시간인 영상에서 1분 10초부터 1분 15초까지의 장면에 해당하는 제1 음성 구간, 1분 30초부터 1분 40초까지의 장면에 해당하는 제2 음성 구간, 3분 30초부터 3분 45초까지의 장면에 해당하는 제3 음성 구간과 같이 복수 개의 음성 구간을 선정할 수 있다. 다만, 음성 구간의 개수가 3개인 것, 각각의 음성 구간의 시작 지점과 끝 지점 그리고 음성 구간의 길이가 5초, 10초, 15초 등인 것은 예시적인 것에 불과하다. 아울러, 실시예에 따라서 음성 구간의 길이는 0.1초 또는 0.01초 단위이거나 또는 1분이나 10분 단위일 수도 있다.The selection unit 140 selects at least one section of the voice to be extracted among the voices included in the image, based on the results analyzed by the analysis unit 130. Here, the term 'voice section' refers to a section between the start point and the end point, which is defined as the start point and end point of the extraction. The length of the speech section may have various values. For example, the selection unit 140 corresponds to a first audio section corresponding to a scene from 1 minute 10 seconds to 1 minute 15 seconds in a video having a running time of 2 hours, and a scene from 1 minute 30 seconds to 1 minute 40 seconds. A plurality of audio sections can be selected, such as a second audio section, and a third audio section corresponding to a scene from 3 minutes 30 seconds to 3 minutes 45 seconds. However, the number of voice sections is three, the start point and end point of each voice section, and the length of the voice section is 5 seconds, 10 seconds, 15 seconds, etc., and is merely exemplary. In addition, depending on the embodiment, the length of the voice section may be 0.1 seconds or 0.01 seconds, or 1 minute or 10 minutes.

선정부(140)가 추출의 대상이 되는 음성의 구간을 선정할 때에는 아래와 같은 기준을 활용할 수 있되, 이러한 기준은 예시적인 것에 불과하다. When the selection unit 140 selects a section of the voice to be extracted, the following criteria may be used, but these criteria are only exemplary.

먼저, 선정부(140)는, 분석부(130)에 의해 분석된, 각각의 장면마다 몇 개의 얼굴이 포함되어 있는지에 대한 결과를 고려할 수 있다.First, the selection unit 140 may consider the result of how many faces are included in each scene, analyzed by the analysis unit 130.

분석 결과, 1개의 얼굴이 포함되어 있는 장면에 대해서 먼저 살펴보기로 한다. 이 경우 선정부(140)는 해당 장면이 다음 중 어느 하나의 예에 해당한다면, 해당 장면에 상응하는 음성의 구간을 추출의 대상으로 선정할 수 있으나, 어느 하나의 예에도 해당하지 않는다면 해당 장면을 선정 대상에서 제외시킬 수 있다.As a result of the analysis, the scene containing one face will be described first. In this case, if the corresponding scene corresponds to any one of the following examples, the selection unit 140 may select an audio section corresponding to the corresponding scene as an object of extraction, but if the corresponding scene does not correspond to any one example, the selected scene may be selected. It can be excluded from selection.

제1 예 : 해당 장면에 나타난 얼굴의 입술에 소정 기준 이상의 움직임이 있는 장면Example 1: Scenes with movements above a certain standard on the lips of the face shown in the scene

제2 예 : 해당 장면에 나타난 얼굴의 입술에 소정 기준 이상의 움직임이 있으면서 이와 함께 해당 장면이 표시되는 화면에 대해 해당 얼굴이 정면을 향하는 장면2nd example: A scene in which the face faces the front of the screen where the scene is displayed while there is a movement of a predetermined criterion on the lips of the face shown in the scene

제3 예 : 위의 제2 예에 해당하는 장면에 나타난 얼굴이 그 이후의 장면에서도 연속하여서 나타나면서 입술에 소정 기준 이상의 움직임이 있는 장면 (제3 예에서는 얼굴이 화면에 대해 정면을 향하지 않는 장면이라도 대상으로 선정될 수 있음)3rd example: A scene in which the faces appearing in the scene corresponding to the second example above appear in succession in subsequent scenes, and the lips have a movement of a predetermined standard or higher (in the third example, the face does not face the front of the screen) May be selected as an object)

이와 달리, 여러 개의 얼굴이 등장하는 장면에 대해 살펴보기로 한다. 이 경우 선정부(140)는 해당 장면에 나타난 복수 개의 얼굴 각각에 포함된 입술 중 1개의 입술에서만 소정 기준 이상의 움직임이 있는지를 살펴본다. 만약 2개 이상의 입술에서 소정 기준 이상의 움직임이 있다면 선정부(140)는 해당 장면을 선정의 대상에서 제외시킨다. 그러나 만약 1개의 입술에서만 소정 기준 이상의 움직임이 있다면, 선정부(140)는 1개의 얼굴이 등장하는 장면에 대해 언급한, 전술한 3가지 예에 기초하여서 추출의 대상이 되는 장면을 선정할 수 있다.On the other hand, let's take a look at the scene where multiple faces appear. In this case, the selection unit 140 examines whether there is a movement of a predetermined criterion or more on only one of the lips included in each of the plurality of faces shown in the corresponding scene. If there is more than a predetermined movement on two or more lips, the selection unit 140 excludes the corresponding scene from selection. However, if there is a movement of more than a predetermined criterion on only one lip, the selection unit 140 may select a scene to be extracted based on the three examples described above, referring to the scene where one face appears. .

추출부(150)는 선정부(140)에 의해 추출의 대상으로 선정된 음성의 구간을 고려하여서, 이와 같이 선정된 음성의 구간에 상응하는 음성을 영상에 포함된 음성으로부터 추출한다. 여기서, 추출부(150)가 영상에 포함된 음성으로부터 소정 구간에 상응하는 음성을 추출하는 기술 그 자체는 공지된 기술이므로 이에 대한 설명은 생략하기로 한다.The extraction unit 150 considers the section of the voice selected as the object of extraction by the selection unit 140, and extracts a voice corresponding to the selected section of the voice from the voice included in the image. Here, since the extraction unit 150 extracts a voice corresponding to a predetermined section from the voice included in the image itself, a description thereof will be omitted.

저장부(120)는 추출부(150)에 의해 추출된 음성을, 추출된 음성을 발한 인물에 대한 정보와 대응시켜서 데이터베이스의 형태로 저장한다. 여기서, 추출된 음성을 발한 인물에 대한 정보는 분석부(130)로부터 제공받은 것일 수 있다.The storage unit 120 stores the voice extracted by the extraction unit 150 in the form of a database by correlating the extracted voice with information about the person who issued the voice. Here, the information on the person who emitted the extracted voice may be provided from the analysis unit 130.

즉, 일 실시예에 따르면, 인물 별 음성이 해당 인물에 대한 정보와 함께 데이터베이스의 형태로 획득될 수 있다. 기존에는 각 인물의 음성을 별도로 수집하기가 용이하지 않았으나, 일 실시예에 따르면 영상이 주어지기만 하면 해당 영상으로부터 각 인물의 음성이 음성 수집 장치에 의해 자동으로 또는 기계적으로 수집되어서 데이터베이스화될 수 있다. That is, according to an embodiment, voices for each person may be obtained in the form of a database together with information about the person. In the past, it was not easy to separately collect the voices of each person, but according to an embodiment, as long as an image is given, the voices of each person can be automatically or mechanically collected from the corresponding image and databased. .

한편, 도 4는 일 실시예에 따른 기술이 적용된 결과인, 분석의 대상이 되는 영상, 이러한 영상으로부터 분할된 복수 개의 장면 및 복수 개의 장면 각각에 대응되는 음성 중에서 추출의 대상으로 선정된 음성의 구간을 개념적으로 도시하고 있다. Meanwhile, FIG. 4 is a section of a voice selected as an extraction target among images that are analysis targets, a plurality of scenes divided from these images, and voices corresponding to each of the plurality of scenes, which are the results of applying the technology according to an embodiment. Conceptually.

분석부(130)는 전술한 바와 같이 영상(131)을 복수 개의 장면으로 분할하는데, 도 4에는 이러한 복수 개의 장면 중에 시간적으로 인접해있는 일부(a1 내지 a8)의 장면이 도시되어 있다. 선정부(140)는 이러한 장면(a1 내지 a8) 각각에 대한 분석 결과를 기초로, 영상(131)에 포함된 음성 중에서 추출의 대상이 되는 음성의 구간을 선정하는데, 이에 따라 선정된 구간은 장면 a1에 상응하는 음성의 구간 b1, 장면 a3에 상응하는 음성의 구간 b2, 장면 a4 내지 a6에 상응하는 음성의 구간 b3, 그리고 장면 a8에 상응하는 음성의 구간 b4이다.As described above, the analysis unit 130 divides the image 131 into a plurality of scenes, and FIG. 4 shows scenes of parts a1 to a8 that are temporally adjacent among the plurality of scenes. The selection unit 140 selects a section of the voice to be extracted among the voices included in the image 131 based on the analysis results for each of the scenes a1 to a8, and the selected section is the scene It is a section b1 of a speech corresponding to a1, a section b2 of a speech corresponding to scene a3, a section b3 of a speech corresponding to scenes a4 to a6, and a section b4 of a speech corresponding to scene a8.

여기서, 구간 b2와 b3는 서로 간에 인접해있는 장면(a3, a4 내지 a6)에 대한 구간임에도 서로 분리되어 있다. 이는 예컨대 구간 b2에 대응되는 장면 a3에 등장하는 인물이, 구간 b3에 대응되는 장면 a4부터 a6까지 등장하는 인물과 서로 상이하여서, 구간 b2에 대한 음성과 구간 b3에 대한 음성이 서로 구분되어야 함을 의미하는 것일 수 있다.Here, the sections b2 and b3 are separated from each other even though they are sections for scenes a3, a4 to a6 adjacent to each other. This means that, for example, the person appearing in scene a3 corresponding to section b2 is different from the person appearing from scenes a4 to a6 corresponding to section b3, so that the voice for section b2 and the voice for section b3 must be distinguished from each other. It may mean.

다시 도 3을 참조하면, 고유 정보 추출부(160)는 저장부(120)에 저장된 각 인물의 음성 복수 개를 이용하여서, 각 인물의 음성에 대한 고유 정보를 추출한다. 이 때 음성 핑거프린트와 같이 공지된 기술이 이용될 수 있으나 이에 한정되는 것은 아니다. 추출된 고유 정보는 각 인물에 매칭되어서 저장부(120)에 저장된다.Referring back to FIG. 3, the unique information extracting unit 160 extracts unique information about the voice of each person by using a plurality of voices of each person stored in the storage unit 120. At this time, a known technique such as a voice fingerprint may be used, but is not limited thereto. The extracted unique information is matched to each person and stored in the storage unit 120.

즉, 일 실시예에 따르면, 각 인물의 음성 그 자체 뿐만 아니라, 각 인물의 음성을 특징짓는 고유 정보까지도 해당 인물에 대한 정보와 함께 데이터베이스의 형태로 획득될 수 있다.That is, according to an embodiment, not only the voice itself of each person but also the unique information characterizing the voice of each person can be obtained in the form of a database together with information about the person.

한편, 일 실시예에 따른 음성 수집 장치(100)는 인물의 음성 또는 이러한 음성으로부터 추출된 고유 정보를 수집하여서 데이터베이스 형태로 제공할 수 있지만, 이에 더하여서, 다음과 같은 기능을 제공할 수도 있다. 예컨대, 임의의 영상이 영상 제공 서버(50)로부터 주어지면, 해당 영상에 등장하는 인물에 대한 정보 등과 같은 메타 데이터를 추출한 뒤, 이러한 메타 데이터 자체를 영상 제공 서버(50)에게 제공하거나 또는 해당 영상에 메타 데이터를 부여하여서 영상 제공 서버(50)에게 되돌려줄 수도 있다. 이에 대해서는 도 3에 도시된 인물 판별부(170)를 참조하여서 살펴보기로 한다.On the other hand, the voice collection device 100 according to an embodiment may collect a voice of a person or unique information extracted from the voice and provide it in the form of a database. In addition, the following functions may be provided. For example, if an arbitrary image is provided from the image providing server 50, after extracting metadata such as information about a person appearing in the image, the metadata itself is provided to the image providing server 50 or the corresponding image It may be given to the metadata to be returned to the image providing server 50. This will be described with reference to the person discrimination unit 170 shown in FIG. 3.

인물 판별부(170)는 머신 러닝 또는 딥러닝에 의해 학습된 것일 수 있다. 학습 시의 입력 데이터는 저장부(120)에 음성이며, 정답 데이터는 입력 데이터인 음성에 대응되도록 저장부(120)에 저장된 인물에 대한 정보일 수 있다. 학습이 완료된 후, 인물 판별부(170)에 소정의 영상이 입력되면, 인물 판별부(170)는 해당 영상에 등장하는 인물이 누구인지를 출력할 수 있다.The person discrimination unit 170 may be learned by machine learning or deep learning. The input data at the time of learning is voice to the storage unit 120, and the correct answer data may be information about a person stored in the storage unit 120 to correspond to the input data voice. After learning is completed, when a predetermined image is input to the person discrimination unit 170, the person discrimination unit 170 may output a person who appears in the corresponding image.

여기서, 인물 판별부(170)가 학습하는 과정 그 자체 내지 학습이 완료된 후에 동작하는 과정 그 자체는 머신 러닝 내지 딥러닝 분야에서 이미 공지된 기술이므로 이에 대한 설명은 생략하기로 한다.Here, since the process of learning by the person discrimination unit 170 itself or the process of operating after learning is completed is a technique known in the field of machine learning or deep learning, a description thereof will be omitted.

도 5는 일 실시예에 따른 음성 수집 방법의 절차를 도시하고 있다. 도 5에 도시된 이러한 음성 수집 방법은 전술한 음성 수집 장치(100)에 의해 수행될 수 있다.5 illustrates a procedure of a voice collection method according to an embodiment. The voice collection method illustrated in FIG. 5 may be performed by the voice collection device 100 described above.

한편, 도 5에 도시된 절차는 예시적인 것에 불과하므로 본 발명의 사상이 도 5에 도시된 것으로 한정 해석되지는 않으며, 실시예에 따라서 도 5에 도시된 것과는 다른 순서로 절차가 수행될 수 있으며, 도 5에 도시된 절차 중 적어도 하나의 절차가 수행되지 않을 수도 있고, 도 5에 도시되지 않은 절차가 추가적으로 수행될 수도 있다. Meanwhile, the procedure illustrated in FIG. 5 is merely exemplary, and thus the spirit of the present invention is not limited to that illustrated in FIG. 5, and the procedure may be performed in a different order from that illustrated in FIG. 5 according to an embodiment. , At least one of the procedures illustrated in FIG. 5 may not be performed, and a procedure not illustrated in FIG. 5 may be additionally performed.

먼저, 영상에 포함된 인물의 얼굴을 분석하는 단계(S100)가 수행된다.First, a step (S100) of analyzing the face of a person included in the image is performed.

이 후, 단계 S100에서 분석된 결과를 기초로, 상기 영상에 포함된 음성 중 추출의 대상이 되는 음성의 구간을 선정하는 단계(S200)가 수행된다.Thereafter, based on the result analyzed in step S100, a step (S200) of selecting a section of the voice to be extracted among the voices included in the image is performed.

이 후, 단계 S200에서 선정된 구간에 상응하는 음성을, 상기 영상에 포함된 음성으로부터 추출하는 단계(S300)가 수행된다.Thereafter, a step S300 of extracting a voice corresponding to the section selected in step S200 from the voice included in the image is performed.

이 후, 단계 S300에서 추출된 음성을, 단계 S100에서 분석된 결과로부터 도출된 상기 인물에 대한 정보와 대응시켜서 저장하는 단계(S400)가 수행된다.Thereafter, a step (S400) of storing the voice extracted in step S300 in association with information about the person derived from the result analyzed in step S100 is performed.

이하에서는 도 5에 도시된 음성 수집 방법을 보다 세분화하고 있는, 하나의 실시예에 대해 도시하고 있는 도 6에 대해 살펴보기로 한다.Hereinafter, the voice collection method illustrated in FIG. 5 will be described with reference to FIG. 6, which shows one embodiment, which is further subdivided.

도 6을 도 1 내지 도 5와 함께 참조하면, 음성 수집 장치(100)는 통신부(110)를 통해 영상 제공 서버(50)로부터 분석 대상인 영상을 제공받는다(S1000).Referring to FIG. 6 together with FIGS. 1 to 5, the voice collection device 100 receives an image to be analyzed from the image providing server 50 through the communication unit 110 (S1000).

분석부(130)는 단계 S1000에서 제공받은 영상을 복수 개의 정지 화상, 즉 복수 개의 장면으로 분할한다(S1050).The analysis unit 130 divides the image provided in step S1000 into a plurality of still images, that is, a plurality of scenes (S1050).

아울러, 도 6에는 도시되지 않았지만 분석부(130)는 단계 S1050에서 분할된 각각의 장면으로부터 특징 벡터(feature vector)를 추출한다.In addition, although not shown in FIG. 6, the analysis unit 130 extracts a feature vector from each scene divided in step S1050.

분석부(130)는 각 장면에 대해 추출된 특징 벡터를 기초로, 각 장면에 얼굴이 포함되어 있는지 여부, 얼굴이 포함되어 있다면 몇 개의 얼굴이 포함되어 있는지 여부를 분석할 수 있다(S1100). 또한, 이러한 단계 S1100에서는 다음과 같은 정보가 추가적으로 분석될 수 있다. 예컨대, 각 장면에 포함된 얼굴이 어떤 인물의 얼굴인지에 대한 정보, 각 장면에 포함된 얼굴의 입술에서 움직임이 있는지 여부, 입술 움직임이 있다면 이러한 움직임의 정도가 소정의 기준을 넘어서는지 여부, 각 영상에 포함된 얼굴이 이러한 영상이 표시되는 화면 상에서 어떤 방향을 향하는지(예컨대 정면을 향하는지)에 대한 정보가 분석될 수 있다.The analysis unit 130 may analyze whether a face is included in each scene or how many faces are included if each face is included based on the feature vector extracted for each scene (S1100). In addition, the following information may be further analyzed in step S1100. For example, information about a person's face included in each scene, whether there is movement on the lips of the face included in each scene, and if there is lip movement, whether the degree of such movement exceeds a predetermined criterion, each Information on which direction the face included in the image faces (eg, facing the front) on the screen on which the image is displayed may be analyzed.

단계 S1150에서 선정부(140)는, 단계 S1050에서 분할된 복수 개의 장면 중 이전에 선택되지 않은 어느 하나의 장면을 선택한다. In step S1150, the selection unit 140 selects any one scene not previously selected from the plurality of scenes divided in step S1050.

이하에서는, 단계 S1150에서 선택된 장면에 1개의 얼굴이 포함되어 있는 경우에 대해 살펴본다(S1200). 선정부(140)는 단계 S1150에서 선택된 장면이 다음 중 어느 하나의 예에 해당한다면, 해당 장면에 상응하는 음성의 구간을 추출의 대상으로 선정할 수 있다. 그러나, 단계 S1150에서 선택된 장면이 어느 하나의 예에도 해당하지 않는다면 해당 장면을 선정의 대상에서 제외시킬 수 있으며, 이 후 단계 S1400으로 이동한다.Hereinafter, a case where one face is included in the scene selected in step S1150 will be described (S1200). If the scene selected in step S1150 corresponds to any one of the following examples, the selection unit 140 may select a section of speech corresponding to the scene as an extraction target. However, if the scene selected in step S1150 does not correspond to any one example, the scene may be excluded from selection, and thereafter, the process moves to step S1400.

제2 예 (단계 S1210) : 해당 장면에 나타난 얼굴의 입술에 소정 기준 이상의 움직임이 있으면서 이와 함께 해당 장면이 표시되는 화면에 대해 해당 얼굴이 정면을 향하는 장면Example 2 (Step S1210): A scene in which the face faces the front of the screen on which the scene is displayed while there is a movement of more than a predetermined standard on the lips of the face shown in the scene

여기서 도 6에 도시된 단계 S1210는 제2 예에만 대응되고 제1 예 및 제3 에에는 대응되지 않는 것으로 도시되어 있지만, 본 발명의 사상이 도 6에 도시된 것으로 한정해석되는 것은 아니다. 예컨대, 실시예에 따라서 단계 S1210에서는 도 6에 도시된 것과는 달리 제1 예 또는 제3 예에 대한 장면이 선정될 수 있다.Here, although the step S1210 illustrated in FIG. 6 is shown to correspond only to the second example and not to the first example and the third example, the spirit of the present invention is not limited to that illustrated in FIG. 6. For example, in step S1210 according to an embodiment, a scene for the first example or the third example may be selected unlike the one illustrated in FIG. 6.

이와 달리, 여러 개의 얼굴이 등장하는 장면에 대해 살펴보기로 한다(S1300). 선정부(140)는 해당 장면에 나타난 복수 개의 얼굴 각각에 포함된 입술 중 1개의 입술에서만 소정 기준 이상의 움직임이 있는지를 살펴본다. 만약 2개 이상의 입술에서 소정 기준 이상의 움직임이 있다면 선정부(140)는 해당 장면을 선정의 대상에서 제외시킨다. 그러나 만약 1개의 입술에서만 소정 기준 이상의 움직임이 있다면, 선정부(140)는 1개의 얼굴이 등장하는 장면에 대해 언급한, 전술한 3가지 예에 기초하여서 추출의 대상이 되는 장면을 선정하며(S1310), 추출부(150)는 이와 같이 선정된 해당 장면에 상응하는 음성의 구간을 영상에 포함된 음성으로부터 추출한다(S1320).Unlike this, the scene in which multiple faces appear will be described (S1300). The selection unit 140 examines whether there is a movement of a predetermined criterion or more on only one of the lips included in each of the plurality of faces shown in the corresponding scene. If there is more than a predetermined movement on two or more lips, the selection unit 140 excludes the corresponding scene from selection. However, if there is a movement above a predetermined criterion on only one lip, the selection unit 140 selects a scene to be extracted based on the above-mentioned three examples referring to the scene where one face appears (S1310). ), The extracting unit 150 extracts a section of audio corresponding to the selected scene from the audio included in the image (S1320).

추출부(150)에 의해 단계 S1220 또는 단계 S1320에서 추출된 음성은 그에 매칭되는 인물의 정보와 함께 저장부(120)에 저장된다 (S1230 또는 S1330). The voice extracted in step S1220 or step S1320 by the extracting unit 150 is stored in the storage unit 120 along with information of a person who matches it (S1230 or S1330).

단계 S1230 또는 S1330 이후에, 선정부(140)는 단계 S1150에서 선택되지 않은 장면이 단계 S1050에서 분할된 복수 개의 장면 중에 있는지를 살펴본다. 만약 있다면 단계 S1150으로 이동한다. 그러나 없다면 고유 정보 추출부(160)는 저장부(120)에 저장된 각 인물의 음성을 기초로 각 인물을 식별하는 고유 정보를 추출한다(S1500).After step S1230 or S1330, the selection unit 140 looks at whether the scene not selected in step S1150 is among a plurality of scenes divided in step S1050. If any, the flow goes to step S1150. However, if not, the unique information extraction unit 160 extracts unique information for identifying each person based on the voice of each person stored in the storage unit 120 (S1500).

한편, 도 5 및 도 6에 도시된 음성 수집 방법의 경우 도 3에 도시된 음성 수집 장치(100)와 실질적으로 동일한 기술적 사상을 기초로 실시될 수 있는 바, 도 5 및 도 6에서 설명이 생략된 부분은 도 2 내지 도 4와 관련하여 설명된 음성 수집 장치(100)에 대한 설명 부분을 원용하기로 한다.On the other hand, in the case of the voice collection method illustrated in FIGS. 5 and 6, it may be implemented based on substantially the same technical idea as the voice collection device 100 illustrated in FIG. 3. The portions described above will use the description portion for the voice collection device 100 described with reference to FIGS. 2 to 4.

한편, 전술한 일 실시예에 따른 음성 수집 방법에 포함된 각각의 단계는, 이러한 단계를 수행하도록 프로그램된 컴퓨터 프로그램을 기록하는 컴퓨터 판독가능한 기록매체에서 구현될 수 있다.Meanwhile, each step included in the method for collecting voice according to the above-described embodiment may be implemented in a computer-readable recording medium recording a computer program programmed to perform such steps.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those skilled in the art to which the present invention pertains may make various modifications and variations without departing from the essential quality of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

일 실시예에 따르면, 각 인물의 음성이 해당 인물에 대한 정보와 함께 데이터베이스의 형태로 획득될 수 있다. 기존에는 각 인물의 음성을 별도로 수집하기가 용이하지 않았으나, 일 실시예에 따르면 영상이 주어지기만 하면 해당 영상으로부터 각 인물의 음성이 음성 수집 장치에 의해 자동으로 또는 기계적으로 수집되어서 데이터베이스화될 수 있다.According to an embodiment, the voice of each person may be obtained in the form of a database together with information about the person. In the past, it was not easy to separately collect the voices of each person, but according to an embodiment, as long as an image is given, the voices of each person can be automatically or mechanically collected from the corresponding image and databased. .

100: 음성 수집 장치100: voice collection device

Claims

An analysis unit analyzing faces of at least one person included in the image;
A selection unit for selecting a section of the voice to be extracted among the voices of the at least one person included in the image based on the results analyzed by the analysis unit;
An extraction unit for extracting a voice corresponding to a section selected as an object of extraction by the selection unit from the audio included in the image;
And a storage unit for storing the voice extracted by the extraction unit in association with information on the person derived from the analyzed result,
The analysis unit,
Analyzing whether the lips included in the face of the at least one person included in the image moves above a predetermined criterion,
The selection unit,
When the lips included in the faces of a plurality of persons in the first scene of the image move above a predetermined criterion, the first scene is not selected as a section of the voice to be extracted.
Voice collection device.

According to claim 1,
The analysis unit,
Analyzing the movement of the lips included in the face
Voice collection device.

According to claim 2,
The analysis unit,
On the screen where the image is displayed, the direction of the face is additionally analyzed.
Voice collection device.

The method of claim 3,
The selection unit,
According to the analysis of the analysis unit, a section in which the direction the face faces is analyzed as being front to the screen is selected as a section of speech to be extracted.
Voice collection device.

According to claim 1,
The selection unit,
The selection process is performed on a scene in which the number of faces in the scene included in the image is one.
Voice collection device.

According to claim 1,
The selection unit,
When one of the lips included in each of the faces of the plurality of persons appearing in the second scene of the image moves above the predetermined reference, performing the selection process for the second scene
Voice collection device.

According to claim 1,
The person stored in the storage unit is input data, and information about the person stored in the storage unit is used as correct answer data so as to correspond to the input data, and a person corresponding to the predetermined input is input when a predetermined voice is input. Further comprising a person discrimination unit for outputting information about
Voice collection device.

According to claim 1,
Further comprising a unique information extraction unit for extracting the unique information from the voice stored in the storage unit, and matching the extracted unique information with information about the person corresponding to the extracted voice of the unique information
Voice collection device.

A speech collection method performed by a speech collection device,
Analyzing a face of at least one person included in the image;
Selecting a section of the voice to be extracted among the voices of the at least one person included in the image based on the analyzed result in the analyzing step;
Extracting a voice corresponding to the selected section from the voice included in the video;
And storing the extracted voice in association with information on a person derived from the analyzed result, and storing the extracted voice.
The step of analyzing the face of the person,
And analyzing whether the lips included in the face of the at least one person included in the image move above a predetermined criterion,
The step of selecting the section of the voice,
When the lips included in the faces of a plurality of persons in the first scene of the image move above a predetermined criterion, the first scene is not selected as a section of the voice to be extracted.
How to collect voice.

A computer program stored on a computer-readable recording medium,
The computer program,
Analyzing a face of at least one person included in the image;
Selecting a section of the voice to be extracted among the voices of the at least one person included in the image based on the analyzed result in the analyzing step;
Extracting a voice corresponding to the selected section from the voice included in the video;
And storing the extracted voice in association with information on a person derived from the analyzed result, and storing the extracted voice.
The step of analyzing the face of the person,
And analyzing whether the lips included in the face of the at least one person included in the image move above a predetermined criterion,
The step of selecting the section of the voice,
When the lips included in the faces of the plurality of persons in the first scene of the image move above a predetermined criterion, the processor may perform a method of not selecting the first scene as a section of voice to be extracted. Containing commands for,
Computer program.