KR20120120858A

KR20120120858A - Service and method for video call, server and terminal thereof

Info

Publication number: KR20120120858A
Application number: KR1020110038666A
Authority: KR
Inventors: 강준규
Original assignee: 강준규
Priority date: 2011-04-25
Filing date: 2011-04-25
Publication date: 2012-11-02

Abstract

PURPOSE: A video call service, a supplying method thereof, and a video call service supplying server are provided to supply fun to a user through a video call by supplying a virtual object in a video call. CONSTITUTION: A control unit(110) recognizes an emotion state of a user from video information obtained from a photographing unit. The control unit extracts emotion information related to the recognized emotion state. The control unit transmits the extracted emotion information to a video call service supplying server. The control unit receives an object corresponding to the emotion information of the third party according to a video call from the video call service supplying server. The control unit outputs the received object in a location related to the received object according to the video call. [Reference numerals] (110) Control unit; (120) Imaging unit; (130) Display unit; (140) Communication unit; (150) Key input unit; (160) Storage unit; (170) Voice processing unit

Description

Video call service and its providing method, video call service providing server and terminal for this service {Service and method for video call, server and terminal}

본 발명은 현실환경(Real Environment) 및 가상환경(Virtual Environment)의 혼합된 결과를 사용자에게 제공하는 증강현실(Augmented Reality)에 관한 것으로, 특히 영상통화를 하는 쌍방의 얼굴과 몸체 위에 음성, 제스처 및 표정분석을 통한 혼합된 가상의 객체(문자 포함)를 정합시켜 다양한 혼합현실을 영상통화 화면 상에 구현할 수 있는 영상통화서비스 및 그 제공방법과 이를 위한 영상통화서비스 제공서버 및 제공단말기에 관한 것이다.
The present invention relates to augmented reality that provides a user with a mixed result of a real environment and a virtual environment, and in particular, voice, gesture and The present invention relates to a video call service and a method of providing the same, and a video call service providing server and a terminal for realizing various mixed reality on a video call screen by matching mixed virtual objects (including characters) through facial expression analysis.

증강현실(Augmented Reality), 혹은 더 일반적으로 혼합현실(Mixed Reality) 환경이란 컴퓨터가 만들어낸 가상의 감각과 실제 감각이 혼합된 환경을 말한다. Milgram 등이 제시한 것처럼 혼합현실은 실제-가상 연속체(Reality-Virtuality continuum) 상에 위치할 수 있다(Milgram, P., Colquhoun Jr., H.: A taxonomy of real and virtual world display integration. In: Tamura, Y. (ed.) Mixed Reality, Merging Real and Virtual Worlds, pp. 1.16. Springer, Berlin (1999)). 이 연속체에서 한 환경이 실제 혹은 가상에 가까운지는 그 환경을 관리하기 위해 얼마나 많은 정보가 컴퓨터에 저장되어 있는가로 결정된다. 예를 들어, 비행기 조종석의 Head-up Display는 증강현실의 한 종류이고, 게임 속의 가상의 신체에 유명인의 얼굴의 실제 사진을 덧입히는 것은 증강가상(Augmented Virtuality) 의 한 종류로 볼 수 있다. 이외에도 이러한 시각 증강현실을 촉각에 적용한 햅틱현실(Haptic Reality), 햅틱가상(Haptic Virtuality) 등이 연구되고 있다. Augmented Reality or, more commonly, Mixed Reality, is an environment in which a computer-generated virtual and real sense is mixed. As suggested by Milgram et al., Mixed reality can be located on a Reality-Virtuality continuum (Milgram, P., Colquhoun Jr., H .: A taxonomy of real and virtual world display integration. Tamura, Y. (ed.) Mixed Reality, Merging Real and Virtual Worlds, pp. 1.16.Springer, Berlin (1999)). In this continuum, whether an environment is real or virtual is determined by how much information is stored on the computer to manage the environment. For example, the head-up display of an airplane cockpit is a kind of augmented reality, and the addition of a real picture of a celebrity's face to a virtual body in a game is a kind of augmented virtuality. In addition, haptic reality and haptic virtuality that apply such visual augmented reality to the sense of touch have been studied.

증강현실은 가상현실의 한 분야로서 실제환경에 가상의 사물을 합성하여 원래의 환경에 존재하는 사물처럼 보이도록 하는 컴퓨터 그래픽 기법으로서, 증강현실은 가상의 공간과 가상의 사물만을 대상으로 하는 기존의 가상현실과 달리 현실세계의 기반에 가상의 사물을 합성하여 현실세계만으로는 얻기 어려운 부가적인 정보들을 보강해 제공할 수 있는 기술이다. 현재 증강현실 기술은 방송, 광고, 전시, 게임, 테마 파크, 군용, 교육 및 프로모션 등의 분야에서 다양한 형태로 사용되고 활발하게 개발되고 있는 상태이다.Augmented reality is a field of virtual reality, a computer graphics technique that synthesizes virtual objects in the real environment and looks like objects existing in the original environment.Augmented reality is an existing method that targets only virtual space and virtual objects. Unlike virtual reality, it is a technology that synthesizes virtual objects on the basis of real world and reinforces and provides additional information that is difficult to obtain in real world alone. Currently, augmented reality technology is being actively used in various forms in the fields of broadcasting, advertising, exhibition, games, theme parks, military, education and promotion.

즉, 증강현실은 실제세계와의 상호작용을 배제하고 오직 미리 구축해 놓은 가상공간 내에서의 상호작용을 처리하는 가상현실(Virtual Reality) 기술과 달리, 실시간 처리를 바탕으로 하여 미리 획득된 실제세계에 관한 정보가 단말기를 통해 입력되는 실제세계에 대한 영상에 겹쳐서 표시되어 실제세계와의 상호작용을 가능케 한다는 점에서 컴퓨터에 의해 생성된 영상만을 제공하는 가상현실과 구분된다. In other words, augmented reality differs from the virtual reality technology, which excludes interaction with the real world and processes interactions only in the pre-established virtual space, based on real-time processing. The information is overlaid on the image of the real world input through the terminal, and thus it is distinguished from the virtual reality that provides only the image generated by the computer in that it enables interaction with the real world.

이러한 증강현실 기술은 특히 통신 단말기에서 사용되는 모바일 증강현실 기술분야에서 각광받고 있는 추세로서, 현재 마커 기반의 모바일 증강현실 기술 또는 센서 기반의 모바일 증강현실 기술에 많은 연구와 투자가 이루어지고 있다. 마커 기반의 모바일 증강현실 기술은 특정 건물을 촬영할 때, 특정 건물과 대응되는 특정 기호를 같이 촬영한 후 특정 기호를 인식하여 해당하는 건물을 인식하는 기술이며, 센서 기반의 모바일 증강현실 기술은 단말기에 탑재된 GPS와 전자 나침반(Digital Compass) 등을 이용하여 단말기의 현재 위치와 바라보고 있는 방향을 유추하여 유추된 방향으로 영상에 해당하는 POI(Point of Interests) 정보를 오버레이(Overlay) 시켜주는 기술이다. Such augmented reality technology has been in the spotlight in the field of mobile augmented reality technology used in communication terminals in particular, a lot of research and investment in the mobile augmented reality technology based on a marker or a mobile augmented reality technology is currently being made. Marker-based mobile augmented reality technology is a technology for recognizing a building by recognizing a specific sign after taking a specific sign corresponding to a specific building when shooting a specific building. It is a technology that overlays POI (Point of Interests) information corresponding to the image in the inferred direction by inferring the current position of the terminal and the viewing direction by using the installed GPS and digital compass. .

이러한 종래의 기술들은 서비스 제공자가 미리 지정해 놓은 건물이나 장소에 대한 정보만을 제공하는 것이 일반적이어서 사용자가 서비스 제공자에 의해 지정되지 않은 객체에 대한 적절한 정보를 제공해 주는 것이 불가능하고, 현재 위치와 단말기가 바라보고 있는 방향을 유추할 뿐 단말기를 통해 입력된 영상을 정확하게 인식하는 기술을 제공하지 못하고 있기 때문에, 현재 대부분의 연구가 획득한 영상 내에 존재하는 실제 물체를 정확하게 인식하고 해당 물체의 지역정보를 매핑하여 직관적이고 편리한 영상인식 기반의 증상현실을 제공하고자 하는 연구나, 단말기를 통해 입력되는 입력영상에 포함된 객체의 위치에 상기 객체의 상세정보에 접근 가능하도록 하기 위한 아이콘을 증강현실의 형태로 디스플레이하여 사용자가 편리하게 관심객체의 위치를 인지하고 해당 관심객체의 상세정보에 접근할 수 있도록 하는 등의 연구 등 제공하는 정보의 정확도와 양적인 확장에 연구가 국한되고 있다.These conventional techniques generally provide only information about a building or a place previously designated by a service provider, and thus it is impossible for a user to provide appropriate information about an object not designated by the service provider. Since it does not provide the technology to accurately recognize the image input through the terminal by inferring the viewing direction, most current researches accurately recognize the actual object existing in the acquired image and map the local information of the object. In order to provide an intuitive and convenient image recognition-based symptom reality, or to display the icon in the form of augmented reality to access the detailed information of the object in the position of the object included in the input image input through the terminal The user can conveniently Research is limited to the accuracy and quantitative expansion of information provided, such as research on location and access to detailed information of the object of interest.

따라서 증강현실 기술의 개발에 편중되지 않고 증강현실 기술의 발전과 더불어 일상생활에서 통신 단말기를 사용하는데 있어서 증강현실 기술을 통해 사용자들에게 즐거움을 줄 수 있는 다양한 어플리케이션의 개발이 요망된다.
Therefore, the development of augmented reality technology, rather than the development of augmented reality technology in the use of a communication terminal in everyday life, it is desired to develop a variety of applications that can give users a pleasure through augmented reality technology.

따라서 본 발명의 목적은 영상통화 시에 통화자 쌍방의 감정상태를 표현하는 가상의 객체를 통화영상과 함께 제공함으로써 영상통화에 재미를 선사할 수 있는 영상통화 서비스 및 그 제공방법, 이를 위한 영상통화서비스 제공서버 및 제공단말기를 제공하는데 있다. Accordingly, an object of the present invention is to provide a video call service and a method of providing the same, which provides fun to a video call by providing a virtual object representing the emotional state of both callers during the video call with the call video, and a video call for the same It is to provide a service providing server and a providing terminal.

또한, 본 발명의 다른 목적은 통화자의 감정상태를 가상의 객체를 통해 통화자의 영상에 중첩시킴으로써, 통화자들의 영상통화에 더욱 현실감을 부여하는 증강현실을 경험할 수 있도록 하는 영상통화 서비스 및 그 제공방법, 이를 위한 영상통화서비스 제공서버 및 제공단말기를 제공하는데 있다. In addition, another object of the present invention is to superimpose the emotional state of the caller to the video of the caller through a virtual object, a video call service and a method of providing a video call to experience augmented reality that gives a more realism to the video caller of the caller The present invention provides a video call service providing server and a terminal for this purpose.

또한, 본 발명의 다른 목적은 통화자의 영상통화화면으로부터 통화자의 감정상태를 추출하고, 추출된 감정상태를 나타내는 가상의 객체를 통해 통화자의 영상에 중첩시킴으로써, 통화자들의 영상통화에 더욱 현실감을 부여하는 증강현실을 경험할 수 있도록 하는 영상통화 서비스 및 그 제공방법, 이를 위한 영상통화서비스 제공서버 및 제공단말기를 제공하는데 있다. In addition, another object of the present invention is to extract the emotional state of the caller from the video call screen of the caller, superimposed on the video of the caller through a virtual object representing the extracted emotional state, thereby giving more realism to the video caller The present invention provides a video call service and a method of providing the same, a video call service providing server, and a terminal for providing augmented reality.

또한, 본 발명의 다른 목적은 통화자의 통화음성으로부터 통화자의 감정상태를 추출하고, 추출된 감정상태를 나타내는 가상의 객체를 통해 통화자의 영상에 중첩시킴으로써, 통화자들의 영상통화에 더욱 현실감을 부여하는 증강현실을 경험할 수 있도록 하는 영상통화 서비스 및 그 제공방법, 이를 위한 영상통화서비스 제공서버 및 제공단말기를 제공하는데 있다.
In addition, another object of the present invention is to extract the emotional state of the caller from the caller voice of the caller, and to superimpose the video caller of the caller through the virtual object representing the extracted emotional state to give a more realism to the video caller The present invention provides a video call service and a method of providing the same, a video call service providing server, and a terminal for providing augmented reality.

상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화 서비스 및 그 제공방법, 이를 위한 영상통화서비스 제공서버 및 제공단말기는 음성인식, 객체인지의 얼굴인지기술의 얼굴영역검출기술, 얼굴영역 정규화 기술, 얼굴영역 내 특징추출기술, 객체인지의 감정인지기술의 얼굴구성요소(표정분석)관계기술, 객체인지의 손동작인지, 객체인지의 동작과 행동인지기술 기반 위에 실사와 가상영상의 실시간 정합기술을 활용하여 영상통화를 하는 쌍방의 얼굴과 몸체 위에 제스처 및 표정분석을 통한 혼합된 가상의 오브젝트(문자포함)를 정합시켜 현실세계에서 볼 수 없는 다양한 혼합현실을 영상통화를 통해 구현하는 것을 특징으로 한다.In order to achieve the above objects, a video call service and a method of providing the same, a video call service providing server and a terminal for the same, face recognition technology of face recognition technology, face area normalization technology of voice recognition, object recognition technology Based on the feature extraction technology in the face region, the facial component (expression analysis) relationship technology of the emotion recognition technology of the object or the object, the motion and behavior recognition technology of the object or the object, and the real-time matching technology of the virtual image By using the video call, the user can combine various virtual objects (including characters) through gesture and facial expression analysis on both the face and the body to make a video call. .

또한, 본 발명에 따른 영상통화 서비스는 본 발명은 음성, 얼굴과 몸체의 특정한 표정과 제스처의 표정분석 관계연산함수를 미리 등록하여, 유사한 음성, 표정과 제스처가 영상을 통해 전송되면, 출력되는 영상화면에서 음성, 표정과 제스처에 반응한 가상의 객체를 얼굴과 몸체 위에 실시간으로 정합시켜, 영상통화를 즐기도록 하는데 특징이 있다.In addition, the video call service according to the present invention is registered in advance the expression analysis relational functions of the specific expressions and gestures of the voice, face and body, and when a similar voice, facial expression and gesture is transmitted through the image, the output image The virtual object responding to the voice, facial expression and gesture on the screen is matched in real time on the face and body to enjoy a video call.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화 서비스는 촬상수단 및 디스플레이수단을 적어도 구비하는 영상통화서비스 제공단말기에 있어서, 상기 촬상수단 통해 촬영되는 사용자의 제스처 및 표정 중 적어도 어느 하나로부터 상기 사용자의 감정상태를 추출하고, 상기 추출된 감정상태에 대응하는 가상의 객체를 생성하여, 상기 사용자의 바디 및 얼굴 중 적어도 어느 하나 위에 중첩시켜, 상기 사용자와 영상통화 하는 상대방의 영상통화장치의 디스플레이수단에 표시하는 것을 특징으로 한다. In addition, the video call service according to the present invention for achieving the above objects is a video call service providing terminal having at least an image pickup means and a display means, at least any one of the user's gestures and facial expressions photographed through the image pickup means; The video call device of the other party who extracts the emotional state of the user from the user, generates a virtual object corresponding to the extracted emotional state, superimposes on at least one of the body and the face of the user, and makes a video call with the user. It is characterized by displaying on the display means.

이때, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 객체는, 문자를 더 포함하는 것을 특징으로 한다.In this case, the virtual object of the video call service according to the present invention is characterized in that it further comprises a character.

또한, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 객체는, 상기 사용자에 의하여 변경 가능한 것을 특징으로 한다.The virtual object of the video call service according to the present invention may be changed by the user.

이때, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 객체는, 상기 감정상태에 대응하여 실시간으로 변화하는 것을 특징으로 한다.At this time, the virtual object of the video call service according to the present invention is characterized in that it changes in real time corresponding to the emotional state.

이때, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 객체는, 상기 사용자의 바디 및 얼굴에 중첩된 위치가 변경되는 것을 특징으로 한다.In this case, the virtual object of the video call service according to the present invention is characterized in that the position overlapping the body and face of the user is changed.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화서비스 제공방법은, 영상통화서비스 제공서버에서 영상통화서비스 제공단말기로부터 전송되는 영상정보를 수신하는 과정; 상기 영상통화서비스 제공서버가 상기 전송된 영상정보에 포함된 사용자의 제스처 및 표정 중 적어도 어느 하나로부터 상기 사용자의 감정상태를 추출하는 과정; 상기 영상통화서비스 제공서버가 기 저장된 객체 관련 정보로부터 상기 추출된 감정상태에 대응하는 객체를 색출하는 과정; 상기 색출된 객체를 상기 사용자의 바디 및 얼굴 중 적어도 어느 하나 위에 중첩시키는 과정; 및 상기 색출된 객체가 중첩된 영상을 상기 사용자와 영상통화 하는 상대방의 영상통화서비스 제공단말기에 전송하는 과정;을 포함하는 것을 특징으로 한다. In addition, the video call service providing method according to the present invention for achieving the above object, the process of receiving video information transmitted from the video call service providing terminal in the video call service providing server; Extracting, by the video call service providing server, the emotional state of the user from at least one of a gesture and an expression of the user included in the transmitted image information; Extracting, by the video call service providing server, an object corresponding to the extracted emotional state from previously stored object related information; Superimposing the retrieved object on at least one of a body and a face of the user; And transmitting, to the video call service providing terminal of the counterpart, the video having the extracted object overlapped with the user.

이때, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 객체는, 문자를 더 포함하는 것을 특징으로 한다.At this time, the virtual object of the video call service providing method according to the present invention is characterized in that it further comprises a character.

또한, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 객체는, 상기 사용자에 의하여 변경 가능한 것을 특징으로 한다.The virtual object of the video call service providing method according to the present invention may be changed by the user.

이때, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 객체는, 상기 감정상태에 대응하여 실시간으로 변화하는 것을 특징으로 한다.In this case, the virtual object of the video call service providing method according to the present invention is characterized in that it changes in real time corresponding to the emotional state.

이때, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 객체는, 상기 사용자의 바디 및 얼굴에 중첩된 위치가 변경되는 것을 특징으로 한다.At this time, the virtual object of the video call service providing method according to the present invention is characterized in that the position overlapping the body and face of the user is changed.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화서비스 제공서버는 영상통화서비스 제공단말기이기와 연동하는 서버 통신부; 및 상기 영상통화서비스 제공단말기로부터 수신된 영상정보로부터 사용자의 제스처 및 표정 중 적어도 어느 하나에서 상기 사용자의 감정상태를 인지하고, 상기 인지된 감정상태를 기 저장된 객체 관련 정보와 비교하여, 상기 인지된 감정상태와 매칭되는 객체를 추출하며, 상기 추출된 객체를 상기 사용자의 바디 및 얼굴 중 적어도 어느 하나 위에 중첩시켜, 상기 영상통화서비스 제공단말기이기와 통신하는 상대방 영상통화서비스 제공단말기이기로 전송하는 서버 제어부;를 포함하는 것을 특징으로 한다. In addition, the video call service providing server according to the present invention for achieving the above object is a server communication unit interworking with the video call service providing terminal; And recognize the emotion state of the user from at least one of a gesture and an expression of the user from the image information received from the video call service providing terminal, and compare the recognized emotion state with previously stored object related information. A server for extracting an object matching an emotional state and superimposing the extracted object on at least one of a body and a face of the user and transmitting the object to a counterpart video call service providing terminal communicating with the video call service providing terminal. And a control unit.

이때, 본 발명에 따른 상기 영상통화서비스 제공서버는 상기 감정상태와 대응하는 객체 관련 데이터를 저장하는 서버 저장부;를 더 포함하는 것을 특징으로 한다. At this time, the video call service providing server according to the present invention further comprises a server storage unit for storing object-related data corresponding to the emotional state.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화서비스 제공단말기는, 영상통화에 따른 상대방의 영상 및 상기 영상에 중첩되는 객체를 표시하는 표시부; 영상통화서비스 제공서버와 연동하는 통신부; 영상통화에 따른 사용자의 영상정보를 획득하는 촬상부; 및 상기 촬상부에서 획득된 상기 영상정보로부터 상기 사용자의 감정상태를 인지하고, 인지된 감정상태와 관련된 감정정보를 추출하여 상기 영상통화서비스 제공서버로 전송하며, 상기 영상통화서비스 제공서버로부터 상기 영상통화에 따른 상대방의 감정정보에 대응하는 객체를 수신하여, 상기 영상통화에 따른 상대방의 영상에서 상기 수신된 객체와 연관되는 위치에 상기 수신된 객체를 중첩하여 상기 표시부에 출력하는 제어부;를 포함하는 것을 특징으로 한다. In addition, the video call service providing terminal according to the present invention for achieving the above object, the display unit for displaying the video of the other party and the object overlapping the video call; A communication unit interworking with a video call service providing server; An imaging unit which acquires image information of a user according to a video call; And recognize the emotion state of the user from the image information obtained by the image pickup unit, extract emotion information related to the recognized emotion state, and transmit the extracted emotion information to the video call service providing server, and the image from the video call service providing server. And a controller configured to receive an object corresponding to the emotion information of the other party according to a call and to superimpose the received object on a position associated with the received object in the image of the other party according to the video call and to output the object to the display unit. It is characterized by.

이때, 본 발명에 따른 상기 영상통화서비스 제공단말기의 상기 표시부는, 상기 영상통화에 따른 사용자의 영상을 더 표시하는 것을 특징으로 한다.In this case, the display unit of the video call service providing terminal according to the present invention is further characterized by displaying the user's video according to the video call.

또한, 본 발명에 따른 상기 영상통화서비스 제공단말기는 상기 객체를 적용할지 여부를 결정하는 키입력부;를 더 포함하는 것을 특징으로 한다.The video call service providing terminal according to the present invention may further include a key input unit for determining whether to apply the object.

또한, 본 발명에 따른 상기 영상통화서비스 제공단말기는 상기 객체를 저장하는 저장부;를 더 포함하는 것을 특징으로 한다.In addition, the video call service providing terminal according to the present invention further comprises a storage unit for storing the object.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 촬상수단, 디스플레이수단, 음성입력수단 및 음성출력수단을 적어도 구비하는 영상통화서비스 제공단말기에 있어서, 상기 음성입력수단을 통해 입력되는 사용자의 음성으로부터 상기 사용자의 감정상태를 추출하여, 상기 추출된 감정상태에 대응하는 가상의 제1객체를 생성하고, 상기 촬상수단 통해 촬영되는 사용자의 제스처 및 표정 중 적어도 어느 하나로부터 상기 사용자의 감정상태를 추출하고, 상기 추출된 감정상태에 대응하는 가상의 제2객체를 생성하여, 상기 사용자와 영상통화 하는 상대방의 영상통화서비스 제공단말기의 디스플레이수단에 표시되는 상기 사용자의 바디 및 얼굴 중 적어도 어느 하나 위에 상기 생성된 가상의 제1객체 및 제2객체를 각각 중첩시켜 표시하는 것을 특징으로 한다. In addition, in a video call service providing terminal including at least an image pickup means, a display means, a voice input means and a voice output means according to the present invention for achieving the above objects, a voice of a user input through the voice input means. Extracts the emotional state of the user from the user, generates a virtual first object corresponding to the extracted emotional state, and extracts the emotional state of the user from at least one of gestures and facial expressions of the user photographed through the imaging unit; And generating at least one virtual second object corresponding to the extracted emotional state, and displaying the virtual second object on at least one of the body and the face of the user displayed on the display means of the video call service providing terminal of the other party. Superimposing and displaying the created virtual first and second objects, respectively It is gong.

이때, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 제1객체 및 제2객체가 동일하면, 어느 하나만 표시하는 특징으로 한다. At this time, if the virtual first object and the second object of the video call service according to the present invention is the same, it is characterized in that any one of the display.

또한, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 제1객체 및 제2객체 중 적어도 어느 하나는, 문자를 더 포함하는 것을 특징으로 한다.In addition, at least one of the virtual first object and the second object of the video call service according to the present invention may further include a character.

이때, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 제1객체 및 제2객체 중 적어도 어느 하나는, 상기 사용자에 의하여 변경 가능한 것을 특징으로 한다.In this case, at least one of the virtual first object and the second object of the video call service according to the present invention may be changed by the user.

이때, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 제1객체 및 제2객체는 중 적어도 어느 하나는, 상기 감정상태에 대응하여 실시간으로 변화하는 것을 특징으로 한다.In this case, at least one of the virtual first object and the second object of the video call service according to the present invention may be changed in real time in response to the emotional state.

이때, 본 발명에 따른 상기 영상통화 서비스의 상기 가상의 제1객체 및 제2객체는 중 적어도 어느 하나는, 상기 사용자의 바디 및 얼굴에 중첩된 위치가 변경되는 것을 특징으로 하는 한다.In this case, at least one of the virtual first object and the second object of the video call service according to the present invention is characterized in that a position overlapping the body and face of the user is changed.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화서비스 제공방법은, 영상통화서비스 제공서버에서 영상통화서비스 제공단말기로부터 전송되는 음성정보 및 영상정보를 수신하는 과정; 상기 영상통화서비스 제공서버가 상기 전송된 음성정보와 상기 영상정보에 포함된 사용자의 제스처 및 표정 중 적어도 어느 하나로부터 상기 사용자의 감정상태를 추출하는 과정; 상기 영상통화서비스 제공서버가 기 저장된 객체 관련 정보로부터 상기 추출된 감정상태에 대응하는 객체를 색출하는 과정; 상기 색출된 객체를 상기 사용자의 바디 및 얼굴 중 적어도 어느 하나 위에 중첩시키는 과정; 및 상기 색출된 객체가 중첩된 영상을 상기 사용자와 영상통화 하는 상대방의 영상통화서비스 제공단말기에 전송하는 과정;을 포함하는 것을 특징으로 한다. In addition, the video call service providing method according to the present invention for achieving the above object, the step of receiving the voice information and video information transmitted from the video call service providing terminal in the video call service providing server; Extracting, by the video call service providing server, the emotional state of the user from at least one of the transmitted voice information and a gesture and a facial expression of the user included in the video information; Extracting, by the video call service providing server, an object corresponding to the extracted emotional state from previously stored object related information; Superimposing the retrieved object on at least one of a body and a face of the user; And transmitting, to the video call service providing terminal of the counterpart, the video having the extracted object overlapped with the user.

이때, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 제1객체 및 제2객체가 동일하면, 어느 하나만 표시하는 특징으로 한다.At this time, if the virtual first object and the second object of the video call service providing method according to the present invention is the same, it is characterized by displaying only one.

또한, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 제1객체 및 제2객체 중 적어도 어느 하나는, 문자를 더 포함하는 것을 특징으로 한다.In addition, at least one of the virtual first object and the second object of the video call service providing method according to the present invention may further include a character.

이때, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 제1객체 및 제2객체 중 적어도 어느 하나는, 상기 사용자에 의하여 변경 가능한 것을 특징으로 한다.In this case, at least one of the virtual first object and the second object of the video call service providing method according to the present invention may be changed by the user.

이때, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 제1객체 및 제2객체는 중 적어도 어느 하나는, 상기 감정상태에 대응하여 실시간으로 변화하는 것을 특징으로 한다.In this case, at least one of the virtual first object and the second object of the video call service providing method according to the present invention may be changed in real time in response to the emotional state.

이때, 본 발명에 따른 상기 영상통화서비스 제공방법의 상기 가상의 제1객체 및 제2객체는 중 적어도 어느 하나는, 상기 사용자의 바디 및 얼굴에 중첩된 위치가 변경되는 것을 특징으로 한다.In this case, at least one of the virtual first object and the second object of the video call service providing method according to the present invention is characterized in that a position overlapping the body and face of the user is changed.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화서비스 제공서버는 영상통화서비스 제공단말기이기와 연동하는 서버 통신부; 및 상기 영상통화서비스 제공단말기로부터 수신된 음성정보 및 영상정보로부터 사용자의 음성, 제스처 및 표정 중 적어도 어느 하나에서 상기 사용자의 감정상태를 인지하고, 상기 인지된 감정상태를 기 저장된 객체 관련 정보와 비교하여, 상기 인지된 감정상태와 매칭되는 객체를 추출하며, 상기 추출된 객체를 상기 사용자의 바디 및 얼굴 중 적어도 어느 하나 위에 중첩시켜, 상기 영상통화서비스 제공단말기와 통신하는 상대방 영상통화서비스 제공단말기로 전송하는 서버 제어부;를 포함하는 것을 특징으로 한다. In addition, the video call service providing server according to the present invention for achieving the above object is a server communication unit interworking with the video call service providing terminal; And recognize the emotion state of the user from at least one of a voice, a gesture, and an expression of the user from the voice information and the image information received from the video call service providing terminal, and compare the recognized emotion state with previously stored object related information. Extracts an object matching the recognized emotional state, and superimposes the extracted object on at least one of a body and a face of the user to communicate with the video call service providing terminal. Server control unit for transmitting; characterized in that it comprises a.

또한, 상기와 같은 목적들을 달성하기 위한 본 발명에 따른 영상통화서비스 제공단말기는, 음성통화에 따른 사용자의 음성을 입력받아 음성정보를 획득하는 음성입력부; 영상통화에 따른 상대방의 영상 및 상기 영상에 중첩되는 객체를 표시하는 표시부; 영상통화서비스 제공서버와 연동하는 통신부; 영상통화에 따른 사용자의 영상정보를 획득하는 촬상부; 및 상기 음성입력부에서 획득된 음성정보 및 상기 촬상부에서 획득된 상기 영상정보로부터 상기 사용자의 감정상태를 인지하고, 인지된 감정상태와 관련된 감정정보를 추출하여 상기 영상통화서비스 제공서버로 전송하며, 상기 영상통화서비스 제공서버로부터 상기 영상통화에 따른 상대방의 감정정보에 대응하는 객체를 수신하여, 상기 영상통화에 따른 상대방의 영상에서 상기 수신된 객체와 연관되는 위치에 상기 수신된 객체를 중첩하여 상기 표시부에 출력하는 제어부;를 포함하는 것을 특징으로 한다. In addition, the video call service providing terminal according to the present invention for achieving the above object, the voice input unit for receiving the voice of the user according to the voice call to obtain voice information; A display unit displaying an image of the other party and an object overlapping the image according to a video call; A communication unit interworking with a video call service providing server; An imaging unit which acquires image information of a user according to a video call; And recognize the emotion state of the user from the voice information obtained by the voice input unit and the image information obtained by the image pickup unit, extract emotion information related to the recognized emotion state, and transmit the extracted emotion information to the video call service providing server. Receiving an object corresponding to the emotion information of the other party according to the video call from the video call service providing server, by overlapping the received object at a position associated with the received object in the video of the other party according to the video call And a control unit outputting the display unit.

이때, 본 발명에 따른 상기 영상통화서비스 제공단말기의 상기 표시부는, 상기 영상통화에 따른 사용자의 영상을 더 표시하는 것을 특징으로 한다. In this case, the display unit of the video call service providing terminal according to the present invention is further characterized by displaying the user's video according to the video call.

또한, 본 발명에 따른 상기 영상통화서비스 제공단말기는 상기 객체를 적용할지 여부를 결정하는 키입력부;를 더 포함하는 것을 특징으로 한다. The video call service providing terminal according to the present invention may further include a key input unit for determining whether to apply the object.

또한, 본 발명에 따른 상기 영상통화서비스 제공단말기는 상기 객체를 저장하는 저장부;를 더 포함하는 것을 특징으로 한다.
In addition, the video call service providing terminal according to the present invention further comprises a storage unit for storing the object.

상술한 바와 같이 본 발명은 현실세계에서 볼 수 없는 다양한 혼합현실을 영상통화를 통해 구현함으로써, 영상통화에 풍성한 볼거리를 제공할 수 있다.As described above, the present invention can provide abundant sights to the video call by implementing various mixed reality not seen in the real world through the video call.

특히, 본 발명은 영상통화 시에 통화자 쌍방의 감정상태를 표현하는 가상의 객체를 통화영상과 함께 제공함으로써 영상통화에 재미를 선사할 수 있는 효과가 있다. In particular, the present invention has an effect that can provide fun to the video call by providing a virtual object representing the emotional state of both callers with the video call during the video call.

또한, 본 발명은 통화자의 감정상태를 가상의 객체를 통해 통화자의 영상에 중첩시킴으로써, 통화자들의 영상통화에 더욱 현실감을 부여하는 증강현실을 경험할 수 있도록 하여 통화자들의 감정상태를 신선하게 전달할 수 있는 효과가 있다. In addition, the present invention by superimposing the emotional state of the caller to the video of the caller through a virtual object, it is possible to experience the augmented reality that gives the caller a more realistic feeling to deliver the emotional state of the caller freshly It has an effect.

뿐만 아니라, 본 발명은 음성, 얼굴과 몸체의 특정한 표정과 제스처의 표정을 영상통화 화면을 통해 가상의 객체로 형상화함으로써, 사용자들에게 가상과 현실을 모두 경험할 수 있도록 하는 효과가 있다.
In addition, the present invention has the effect of enabling the user to experience both virtual and reality by shaping the expressions of voices, faces and bodies of specific expressions and gestures into virtual objects through the video call screen.

도 1은 본 발명의 일 실시예에 따라 증강현실을 제공하는 영상통화서비스 시스템의 개략적인 구성도,
도 2는 본 발명의 일 실시예에 따른 영상통화서비스 제공단말기의 내부 블록구성도,
도 3은 본 발명의 바람직한 실시예에 따른 영상통화서비스 제공서버의 내부 블록구성도,
도 4는 본 발명의 바람직한 실시예에 따른 영상통화서비스 제공방법을 나타낸 신호처리도,
도 5는 본 발명의 일 실시예에 따라 도 2의 단말기에서 수행되는 영상통화서비스 제공방법을 나타낸 제어흐름도,
도 6은 본 발명에 따른 영상통화서비스 제공방법의 개념도,
도 7a 및 도 7b는 본 발명의 일 실시예에 따른 영상통화서비스가 적용된 영상통화 화면을 나타낸 도면,
도 8a 및 도 8b는 본 발명의 일 실시예에 따른 영상통화서비스에 적용되는 다중카메라 네트워크 시공간 자동구성 기법의 처리절차를 도시한 도면,
도 9는 본 발명의 일 실시예에 따른 영상통화서비스에 적용되는 얼굴인지를 위한 얼굴영역검출기법에 대한 처리과정을 나타낸 도면,
도 10은 본 발명의 일 실시예에 따른 영상통화서비스에 적용되는 객체추출의 얼굴검출기술인 에이다부스트 알고리즘 적용화면을 나타낸 도면.1 is a schematic configuration diagram of a video call service system for providing augmented reality according to an embodiment of the present invention;
2 is an internal block diagram of a video call service providing terminal according to an embodiment of the present invention;
3 is an internal block diagram of a video call service providing server according to an embodiment of the present invention;
4 is a signal processing diagram showing a video call service providing method according to an embodiment of the present invention;
5 is a control flowchart illustrating a video call service providing method performed in the terminal of FIG. 2 according to an embodiment of the present invention;
6 is a conceptual diagram of a video call service providing method according to the present invention;
7A and 7B illustrate a video call screen to which a video call service is applied according to an embodiment of the present invention;
8A and 8B illustrate a processing procedure of a multi-camera network space-time autoconfiguration scheme applied to a video call service according to an embodiment of the present invention;
9 is a view showing a process of a face region detection method for face recognition applied to a video call service according to an embodiment of the present invention;
FIG. 10 is a diagram illustrating a screen for applying an ada boost algorithm, which is a face detection technology of object extraction, applied to a video call service according to an embodiment of the present invention. FIG.

이하 본 발명의 바람직한 실시 예들의 상세한 설명이 첨부된 도면들을 참조하여 설명될 것이다. 도면들 중 동일한 구성들은 가능한 한 어느 곳에서든지 동일한 부호들을 나타내고 있음을 유의하여야 한다. 하기 설명에서 구체적인 특정 사항들이 나타나고 있는데, 이는 본 발명의 보다 전반적인 이해를 돕기 위해 제공된 것이다. 그리고 본 발명을 설명함에 있어, 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a detailed description of preferred embodiments of the present invention will be given with reference to the accompanying drawings. It should be noted that the same configurations of the drawings denote the same reference numerals as possible whenever possible. Specific details are set forth in the following description, which is provided to provide a more thorough understanding of the present invention. In the following description of the present invention, detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

도 1은 본 발명의 일 실시예에 따라 증강현실을 제공하는 영상통화서비스 시스템의 개략적인 구성도로서, 도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 영상통화서비스 시스템은 영상통화서비스 제공단말기(100), 영상통화서비스 제공서버(200) 및 이들을 네트워크로 연결한 통신망(300)으로 구성될 수 있다.1 is a schematic configuration diagram of a video call service system providing augmented reality according to an embodiment of the present invention. As shown in FIG. 1, a video call service system according to an embodiment of the present invention is a video call. The service providing terminal 100, the video call service providing server 200, and a communication network 300 connecting them with a network may be configured.

먼저, 상기 통신망(300)은 유선 및 무선과 같은 통신 양태를 가리지 않고 구성될 수 있으며, 이동 통신망, 근거리 통신망(LAN: Local Area Network), 도시권 통신망(MAN: Metropolitan Area Network), 광역 통신망(WAN: Wide Area Network), 인공 위성 통신망 등 다양한 형태로 구성될 수 있다. 보다 구체적으로, 본 발명에서 말하는 통신망(300)은 공지의 월드와이드웹(WWW: World Wide Web), CDMA(Code Division Multiple Access), WCDMA(Wideband Code Division Multiple Access) 또는 GSM(Global System for Mobile communications) 통신망 등을 모두 포함하는 개념인 것으로 이해되어야 한다.First, the communication network 300 may be configured in any communication mode such as wired and wireless, and may include a mobile communication network, a local area network (LAN), a metropolitan area network (MAN), and a wide area network (WAN). It can be configured in various forms such as wide area network and satellite communication network. More specifically, the communication network 300 according to the present invention is known World Wide Web (WWW), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA) or Global System for Mobile communications It should be understood that the concept includes all communication networks.

다음으로, 상기 영상통화서비스 제공단말기(100)는 카메라 등의 촬영수단(내장하였거나 주변장치로 구비할 수 있는 경우를 포함하는 개념으로 이해되어야 한다)을 통하여 입력되는 입력영상에서 인지 및 추출한 사용자의 감정상태에 대응하는 객체 데이터를 추후 설명될 영상통화서비스 제공서버(200)로부터 제공받고, 상기 객체를 증강현실 기술을 통하여 영상통화 화면에 중첩하는 형태로 디스플레이하고, 사용자의 요청에 따라 상기 객체의 표시를 온/오프하거나 상기 객체의 위치나 움직임을 변경시키고, 형태를 변화시켜서 디스플레이하는 기능을 수행할 수 있다.Next, the video call service providing terminal 100 of the user is recognized and extracted from the input image input through the recording means such as a camera (including a built-in or can be provided as a peripheral device) Receive the object data corresponding to the emotional state from the video call service providing server 200 to be described later, and displays the object in the form of superimposing on the video call screen through augmented reality technology, according to the user's request A function of turning on / off the display, changing the position or movement of the object, and changing the shape may be displayed.

본 발명에서 말하는 영상통화서비스 제공단말기(100)는 통신망(300)에 접속한 후 통신할 수 있도록 하는 기능을 포함하는 디지털 기기를 의미하는 것으로서, 개인용 컴퓨터(예를 들어, 데스크탑 컴퓨터, 노트북 컴퓨터, 태블릿(tablet) 컴퓨터 등), 워크스테이션, PDA, 웹 패드, 스마트폰, 이동 전화기 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기라면 얼마든지 본 발명에 따른 영상통화서비스 제공단말기(200)로서 채택될 수 있다. 영상통화서비스 제공단말기(200)의 상세한 내부구성에 대해서는 후술하기로 한다.The video call service providing terminal 100 according to the present invention refers to a digital device including a function of allowing communication after connecting to the communication network 300, and a personal computer (for example, a desktop computer, a notebook computer, Provide a video call service according to the present invention as long as it is a digital device having a computing capability by mounting a microprocessor such as a tablet computer, a workstation, a PDA, a web pad, a smartphone, a mobile phone, etc. It may be adopted as the terminal 200. Detailed internal configuration of the video call service providing terminal 200 will be described later.

상기 영상통화서비스 제공서버(200) 통신망(300)을 통하여 영상통화서비스 제공단말기(100) 및 다른 정보제공서버(미도시됨)와 통신함으로써 영상통화서비스 제공단말기(100)의 요청에 따라 다양한 유형의 정보를 제공하는 기능을 수행하는 기능을 수행할 수 있다. 보다 구체적으로, 영상통화서비스 제공서버(200)는 웹 컨텐츠 검색엔진(미도시됨)을 포함하여 영상통화서비스 제공단말기(100)의 요청에 대응되는 상세 정보를 검색하고 그 검색결과를 영상통화서비스 제공단말기(100)의 사용자가 브라우징할 수 있도록 제공할 수 있다. 예를 들어, 영상통화서비스 제공서버(200)는 인터넷 검색 포털 사이트의 운영서버일 수 있고, 영상통화서비스 제공서버(200)를 통하여 영상통화서비스 제공단말기(100)에 제공되는 정보는 쿼리 이미지에 매칭된 정보, 웹 사이트, 웹 문서, 지식, 블로그, 카페, 이미지, 동영상, 뉴스, 음악, 쇼핑, 지도, 책, 영화 등에 관한 다양한 정보일 수 있다. 물론, 필요에 따라 영상통화서비스 제공서버(200)의 정보검색엔진은 영상통화서비스 제공서버(200)가 아닌 다른 연산장치나 기록매체에 포함될 수도 있다. 영상통화서비스 제공서버(200)의 상세한 내부구성에 대해서도 후술하기로 한다.Various types according to the request of the video call service providing terminal 100 by communicating with the video call service providing terminal 100 and other information providing server (not shown) through the video call service providing server 200 communication network 300 It can perform a function of providing a function of the information. More specifically, the video call service providing server 200 searches for detailed information corresponding to a request of the video call service providing terminal 100 including a web content search engine (not shown) and converts the search result into a video call service. It may be provided so that the user of the providing terminal 100 can browse. For example, the video call service providing server 200 may be an operation server of an Internet search portal site, and information provided to the video call service providing terminal 100 through the video call service providing server 200 may be stored in a query image. Matched information, websites, web documents, knowledge, blogs, cafes, images, videos, news, music, shopping, maps, books, movies, and the like. Of course, if necessary, the information search engine of the video call service providing server 200 may be included in a computing device or a recording medium other than the video call service providing server 200. Detailed internal configuration of the video call service providing server 200 will be described later.

이하에서는, 본 발명의 구현을 위하여 중요한 기능을 수행하는 영상통화서비스 제공단말기(100)의 내부 구성 및 각 구성요소의 기능에 대하여 살펴보기로 한다.Hereinafter, the internal configuration of the video call service providing terminal 100 performing important functions for the implementation of the present invention and the function of each component will be described.

도 2는 본 발명의 일 실시예에 따른 영상통화서비스 제공단말기의 내부 블록구성도로서, 도 2를 참조하면, 본 발명의 일 실시예에 따른 영상통화서비스 제공단말기(100)는 제어부(110), 촬상부(120), 표시부(130), 통신부(140), 키입력부(150), 저장부(160) 및 음성처리부(170)를 포함할 수 있다. 2 is an internal block diagram of a video call service providing terminal according to an embodiment of the present invention. Referring to FIG. 2, the video call service providing terminal 100 according to an embodiment of the present invention may include a control unit 110. The image capturing unit 120 may include an image capturing unit 120, a display unit 130, a communication unit 140, a key input unit 150, a storage unit 160, and a voice processing unit 170.

이때, 제어부(110), 촬상부(120), 표시부(130), 통신부(140), 키입력부(150), 저장부(160) 및 음성처리부(170)는 그 중 적어도 일부가 영상통화서비스 제공단말기(100)와 통신하는 프로그램 모듈들일 수 있다. 이러한 프로그램 모듈들은 운영 시스템, 응용 프로그램 모듈 및 기타 프로그램 모듈의 형태로 영상통화서비스 제공단말기(100)에 포함될 수 있으며, 물리적으로는 여러 가지 공지의 기억장치 상에 저장될 수 있다. 또한, 이러한 프로그램 모듈들은 영상통화서비스 제공단말기(100)와 통신 가능한 원격기억장치에 저장될 수도 있다. 한편, 이러한 프로그램 모듈들은 본 발명에 따라 후술할 특정 업무를 수행하거나 특정 추상데이터유형을 실행하는 루틴, 서브루틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포괄하지만, 이에 제한되지는 않는다.At this time, the control unit 110, the imaging unit 120, the display unit 130, the communication unit 140, the key input unit 150, the storage unit 160 and the voice processing unit 170 at least some of them provide a video call service The program modules may be in communication with the terminal 100. Such program modules may be included in the video call service providing terminal 100 in the form of an operating system, an application program module, and other program modules, and may be physically stored on various known storage devices. In addition, such program modules may be stored in a remote storage device that can communicate with the video call service providing terminal 100. On the other hand, such program modules include, but are not limited to, routines, subroutines, programs, objects, components, data structures, etc. that perform particular tasks or execute particular abstract data types, described below, in accordance with the present invention.

먼저, 본 발명의 일 실시예에 따르면, 상기 표시부(130)는 영상통화에 따른 상대방의 영상 및 상기 영상에 중첩되는 객체를 표시하며, 상기 통신부(140)는 영상통화서비스 제공서버(200)와 연동하여 데이터 및 제어신호를 송수신한다. 이때, 상기 표시부(130)는 영상통화에 따른 사용자의 영상을 더 표시할 수도 있다. 상기 키입력부(150)는 사용자의 요청에 따라 직간접적으로 기 설정된 신호를 발생시키는 버튼이나 감지센서가 될 수 있으며, 특히 사용자의 선택에 따라 상기 영상에 중첩되는 객체를 영상통화 영상에 적용할지 여부를 결정하는 제어신호를 발생시킬 수 있다. First, according to an embodiment of the present invention, the display unit 130 displays the video of the other party according to the video call and the object overlapping the video, the communication unit 140 and the video call service providing server 200 Send and receive data and control signals in conjunction. In this case, the display unit 130 may further display an image of the user according to the video call. The key input unit 150 may be a button or a sensor for generating a predetermined signal directly or indirectly according to a user's request, and in particular, whether to apply an object superimposed on the image to a video call image according to a user's selection It can generate a control signal to determine the.

상기 촬상부(120)는 카메라 등의 촬영장치를 포함할 수 있으며 영상통화에 따른 사용자의 영상정보를 획득하고, 상기 음성처리부(170)는 마이크로폰 및 스피커를 포함하여 음성을 입력 및 출력할 수 있으며, 상기 제어부(110)는 상기 음성처리부(17)에서 입력된 음성정보 및 상기 촬상부(120)에서 획득된 상기 영상정보로부터 상기 사용자의 감정상태를 인지하고, 인지된 감정상태와 관련된 감정정보를 추출하여 상기 영상통화서비스 제공서버(200)로 전송하며, 상기 영상통화서비스 제공서버(200)로부터 상기 영상통화에 따른 상대방의 감정정보에 대응하는 객체를 수신하여, 상기 영상통화에 따른 상대방의 영상에서 상기 수신된 객체와 연관되는 위치에 상기 수신된 객체를 중첩하여 상기 표시부(130)에 출력한다. 상기 저장부(160)는 상기 객체를 저장한다.The image capturing unit 120 may include a photographing apparatus such as a camera, obtains image information of a user according to a video call, and the voice processing unit 170 may input and output a voice including a microphone and a speaker. The controller 110 recognizes the emotion state of the user from the voice information input from the voice processor 17 and the image information obtained from the image pickup unit 120, and provides emotion information related to the recognized emotion state. Extracts and transmits to the video call service providing server 200 and receives an object corresponding to the emotion information of the other party according to the video call from the video call service providing server 200, the video of the other party according to the video call In the overlapping of the received object in the position associated with the received object and outputs to the display unit 130. The storage unit 160 stores the object.

도 3은 본 발명의 바람직한 실시예에 따른 영상통화서비스 제공서버의 내부 블록구성도로서, 영상통화서비스 제공서버(200)는 서버 제어부(210), 서버 통신부(220) 및 서버 저장부(230)를 포함할 수 있다.3 is an internal block diagram of a video call service providing server according to an exemplary embodiment of the present invention, wherein the video call service providing server 200 includes a server control unit 210, a server communication unit 220, and a server storage unit 230. It may include.

상기 서버 통신부(220)는 영상통화서비스 제공단말기(100)와 연동하여 데이터 및 제어신호를 송수신하고, 상기 서버 제어부(210)는 상기 영상통화서비스 제공단말기(100)로부터 수신된 음성정보 및 영상정보로부터 사용자의 음성, 제스처 및 표정 중 적어도 어느 하나에서 상기 사용자의 감정상태를 인지하고, 상기 인지된 감정상태를 기 저장된 객체 관련 정보와 비교하여, 상기 인지된 감정상태와 매칭되는 객체를 추출하며, 상기 추출된 객체를 상기 사용자의 바디 및 얼굴 중 적어도 어느 하나 위에 중첩시켜, 상기 영상통화서비스 제공단말기(100)와 통신하는 상대방 영상통화서비스 제공단말기로 전송하며, 상기 서버 저장부(230)는 상기 감정상태와 대응하는 객체 관련 데이터를 저장한다. The server communication unit 220 transmits and receives data and control signals in association with the video call service providing terminal 100, and the server control unit 210 receives voice information and video information received from the video call service providing terminal 100. Recognizing the emotional state of the user in at least one of the user's voice, gestures and facial expressions, and compares the recognized emotional state with pre-stored object-related information, extracts an object matching the recognized emotional state, The extracted object is superimposed on at least one of a body and a face of the user and transmitted to the other party's video call service providing terminal communicating with the video call service providing terminal 100, and the server storage unit 230 Stores object-related data corresponding to the emotional state.

도 4는 본 발명의 바람직한 실시예에 따른 영상통화서비스 제공방법을 나타낸 신호처리도로서, 도 4를 참조하면, 본 발명에 따른 영상통화서비스를 수행하는 두 단말인 제1단말기(100-1) 및 제2단말기(100-2)는 촬상부(120)를 통해 각각 영상을 촬영 및 음성처리부(17)를 통해 각각 음성을 입력받아서(401-1, 401-2), 이를 영상통화서비스 제공서버(200)에 전송한다. 이어, 영상통화서비스 제공서버(200)는 각각 수신한 입력음성 및 촬영영상의 인물의 음성과 표정이나 제스처를 분석하고 기 준비된 표준형상모형을 참조하여, 음성 및 영상(영상에 포함된 인물)의 감정상태를 인지한다(403). 이어, 영상통화서비스 제공서버(200)는 상기 403단계에서 인지된 감정상태와 대응하는 객체를 기 준비된 객체들로부터 검출하고(405), 이를 중첩할 영상통화화면의 위치를 검출하여(407), 영상통화화면에 상기 405단계에서 검출된 객체를 정합한다(409).4 is a signal processing diagram illustrating a video call service providing method according to an exemplary embodiment of the present invention. Referring to FIG. 4, a first terminal 100-1, which is two terminals performing a video call service according to the present invention. And the second terminal 100-2 respectively captures an image through the image capturing unit 120 and receives a voice through the voice processing unit 17 (401-1, 401-2). Transmit to 200. Subsequently, the video call service providing server 200 analyzes the voice and facial expression or gesture of the person of the input voice and the captured image and receives the voice and image (person included in the video) by referring to a standard shape model. Recognize the emotional state (403). Subsequently, the video call service providing server 200 detects an object corresponding to the emotional state recognized in step 403 from previously prepared objects (405), and detects a position of the video call screen to be superimposed thereon (407). The object detected in step 405 is matched to the video call screen (409).

그리고, 영상통화서비스 제공서버(200)는 상기 409단계에서 정합된 영상을 각각 상대방의 단말기로 전송하며, 이를 수신한 제1단말기(100-1) 및 제2단말기(100-2)는 각각 이를 표시부(130)에 디스플레이한다.In addition, the video call service providing server 200 transmits the matched image to the other party's terminal in step 409, and the first terminal 100-1 and the second terminal 100-2 receiving the image are respectively provided with the same. It is displayed on the display unit 130.

전술한 도 4를 참조한 설명과 같이, 본 발명에 따른 증강현실을 제공하는 영상통화서비스 제공방법은 영상통화서비스 제공서버(200)에서 영상으로부터 감정상태를 인지하고, 객체를 추출하고, 정합할 위치를 검출하여 정합하는 일련의 상기 403단계 내지 409단계를 모두 수행할 수 있으며, 상기 403단계 내지 409단계 중 일부를 영상통화서비스 제공단말기(100)가 수행할 수도 있다.As described above with reference to FIG. 4, the video call service providing method for providing augmented reality according to the present invention recognizes an emotional state from an image, extracts an object, and matches the video call service providing server 200. In step 403 to step 409 of detecting and matching all of the steps can be performed, the video call service providing terminal 100 may perform some of the steps 403 to 409.

한편, 도 5는 본 발명의 일 실시예에 따라 도 2의 단말기에서 수행되는 영상통화서비스 제공방법을 나타낸 제어흐름도로서 도 5를 참조하면, 상기 403단계 내지 409단계를 모두 영상통화서비스 제공단말기(100)가 수행할 수도 있다. Meanwhile, FIG. 5 is a control flowchart illustrating a video call service providing method performed by the terminal of FIG. 2 according to an embodiment of the present invention. Referring to FIG. 5, the video call service providing terminal includes steps 403 to 409. 100) may perform.

도 5에 도시된 바와 같이, 영상통화서비스 제공단말기(100)의 제어부(110)는 촬상부(120)를 제어하여 영상을 촬영함과 아울러 음성처리부(170)를 제어하여 음성을 입력받고(501), 입력된 음성 및 촬영된 영상(자신을 촬영하고 있는 경우에는 자신의 영상, 영상통화를 수행하고 있는 경우에는 상대방의 단말기에서 촬영되어 전송된 영상)으로부터 인물의 감정상태를 인지한다(503). 이어, 인지된 감정상태에 대응하는 객체를 저장부(160)로부터 추출하고(505), 추출된 객체와 영상을 정합하여(507), 표시부(13)를 통하여 영상을 출력한다. 이때, 상기 저장부(160)에는 다양한 감정상태에 대응하는 각종 객체 데이터들이 기 저장되어 있을 수도 있고, 제공서버(200)를 비롯하여 관련 객체 데이터를 제공하는 서버들로부터 통신망(300)을 통하여 실시간 또는 미리 다운로드 받아 저장하여 사용할 수 있다.As shown in FIG. 5, the control unit 110 of the video call service providing terminal 100 controls the image capturing unit 120 to capture an image and receives a voice by controlling the audio processing unit 170 (501). ), The emotional state of the person is recognized from the input voice and the captured video (if the user is capturing his / her own video, or if the user is conducting a video call, the video is captured and transmitted by the other party's terminal) (503). . Subsequently, the object corresponding to the recognized emotional state is extracted from the storage unit 160 (505), the extracted object and the image are matched (507), and the image is output through the display unit 13. In this case, the storage unit 160 may be pre-stored a variety of object data corresponding to a variety of emotional states, in real time or through the communication network 300 from the server providing the relevant object data, including the providing server 200 It can be downloaded and saved in advance.

도 6은 본 발명에 따른 영상통화서비스 제공방법의 개념도로서, 본 발명에 따른 영상통화서비스는 먼저, 해당 영상통화서비스 제공단말기(100-1)의 촬상부(120)에서 촬영되어 입력되는 영상 또는 통신부(140)를 통해 수신되는 상대방 영상통화서비스 제공단말기(100-2)로부터 전송된 영상을 입력받는 영상입력단계를 수행하고, 이어 인지모듈에서 음성인식(Voice Recognition), 객체추적의 사람추적(People Tracking), 객체추적의 카메라위상(Topology)인지, 객체인지의 얼굴인지(Face Recognition), 객체인지의 감정인지, 객체인지의 손동작인지, 객체인지의 동작과 행동인지 기법 등을 활용한 영상인지단계를 수행한다. 상기 영상인지단계에서 다양한 추적, 인지, 추출 및 검출 기법을 이용하여 찾아낸 감정상태는 표정과 제스처에 대한 표준형상모형DB 등과 감정상태를 여러 단계로 그 정도에 따라 구분한 각종 분석코드들을 의하여 분석된다. 상기 영상인지단계에서 사용되는 다양한 기법들에 대해서는 개별적으로 간략히 후술한다.FIG. 6 is a conceptual diagram of a method for providing a video call service according to the present invention. First, the video call service according to the present invention may be an image captured or input by the imaging unit 120 of the corresponding video call service providing terminal 100-1. After performing the video input step of receiving the image transmitted from the video call service providing terminal (100-2) received through the communication unit 140, the voice recognition (Voice Recognition), object tracking in the recognition module ( People Tracking, Camera Topology of Object Tracking, Face Recognition of Objects or Faces, Emotions of Objects or Objects, Hand Movements of Objects, Motions and Behaviors of Objects, etc. Perform the steps. The emotional state found using various tracking, cognition, extraction and detection techniques in the image recognition stage is analyzed by the standard shape model DB for facial expressions and gestures, and various analysis codes that classify the emotional state in various stages. . Various techniques used in the image recognition step will be briefly described below individually.

이어, 합성모형DB로부터 상기 영상인지단계에서 인지된 영상의 감정상태에 대응한 가상의 객체를 검출하는 객체검출단계가 수행된다. 상기 객체검출단계에서 검출된 객체는 영상과 실시간 정합과정(영상정합단계)을 통하여 해당 영상통화서비스 제공단말기(100-1, 2)의 표시부(130)에 각각 증강현실이 적용된 영상을 출력한다. 이에 따라, 본 발명은 현실세계에서 볼 수 없는 다양한 혼합현실을 영상통화를 통해 구현함으로써, 영상통화에 활력을 불어넣게 된다.Subsequently, an object detection step of detecting a virtual object corresponding to the emotional state of the image recognized in the image recognition step is performed from the synthesis model DB. The object detected in the object detection step outputs an image to which augmented reality is applied to the display unit 130 of the video call service providing terminal 100-1, 2 through a real-time matching process with the image. Accordingly, the present invention is to invigorate the video call by implementing a variety of mixed reality not seen in the real world through the video call.

도 7a 및 도 7b는 본 발명의 일 실시예에 따른 영상통화서비스가 적용된 영상통화 화면을 나타낸 도면으로서, 본 발명에 따른 영상통화서비스는 자동인식 정합방식과 선택인식 정합방식을 모두 제공할 수 있다는 것을 나타내기 위한 것이다. 도 6a의 자동인식 정합방식은 촬영된 또는 수신된 영상으로부터 자동으로 표정 및 제스처를 인식하여 표시부(130)에 실시간으로 정합하는 방식이고, 도 6b의 선택인식 정합방식은 사용자가 키입력부(150) 등을 통해 표정 및 제스처를 수동으로 조작 또는 선택하여 실시간 정합을 하는 방식이다.7A and 7B illustrate a video call screen to which a video call service is applied according to an embodiment of the present invention. The video call service according to the present invention may provide both an automatic recognition matching method and a selective recognition matching method. It is to show that. The automatic recognition matching method of FIG. 6A is a method of automatically recognizing facial expressions and gestures from a photographed or received image and matching the display unit 130 in real time, and the selection recognition matching method of FIG. 6B includes a user inputting a key input unit 150. It is a method of real time matching by manually manipulating or selecting facial expressions and gestures.

도 6, 도 7a 및 도 7b에 도시된 바와 같이, 본 발명에 따른 영상통화서비스는 영상통화 시에 통화자 쌍방의 감정상태를 표현하는 가상의 객체를 통화영상과 함께 제공함으로써 영상통화에 재미를 선사하게 되고, 통화자의 감정상태를 가상의 객체를 통해 통화자의 영상에 중첩시킴으로써, 통화자들의 영상통화에 더욱 현실감을 부여하는 증강현실을 경험할 수 있도록 하여 통화자들의 감정상태를 신선하게 전달할 수 있게 된다. As shown in FIG. 6, FIG. 7A and FIG. 7B, the video call service according to the present invention provides a fun virtual video call by providing a virtual object representing the emotional state of both callers together with the video call during the video call. By superimposing the emotional state of the caller on the video of the caller through a virtual object, it is possible to experience the augmented reality that gives the caller a more realistic feeling to convey the emotional state of the caller freshly do.

그러면, 이하에서 도 8a 내지 도 10을 참조하여, 본 발명에 따른 영상통화서비스 및 그 제공방법에서 적용 가능한 다양한 기법들에 대해서 설명한다. Next, various techniques applicable to the video call service and the method of providing the same according to the present invention will be described below with reference to FIGS. 8A to 10.

먼저, 영상인지단계에서 수행되는 인지과정 중 하나인 음성인식과정은 자연어 음성을 인식하고 이를 실행 명령어로 변환하여 관련 음성의 표준모형 개발 및 음성을 통한 감정상태의 표준 DB화를 통해 보다 세밀한 감정합성표현이 가능해지도록 한 것이다. 본 발명에서는 자동적 수단에 의하여 음성으로부터 언어적 의미 내용을 식별하여 감정코드를 추출한다.First, the voice recognition process, which is one of the cognitive processes performed in the image recognition stage, recognizes natural language voices and converts them into execution commands to develop more detailed emotion synthesis through the development of a standard model of related voices and a standard DB of emotion states through voices. The expression is made possible. In the present invention, the emotional means extracts an emotional code by identifying linguistic meaning content from the voice.

음성인식과정은 구체적으로 음성파형을 입력하여 단어나 단어열을 식별하고 의미를 추출하는 처리 과정이며, 크게 음성 분석, 음소 인식, 단어 인식, 문장 해석, 의미 추출의 5가지로 분류된다. 좁은 의미로는 음성 분석에서 단어 인식까지를 말하는 경우가 많다. 인간-기계 인터페이스 개선의 하나로 음성으로 정보를 입력하는 음성 인식과 음성으로 정보를 출력하는 음성 합성 기술의 연구 개발이 오랫동안 진행되어 왔다.The speech recognition process is a process of identifying a word or word sequence and extracting meaning by inputting a speech waveform. The speech recognition process is classified into five categories: speech analysis, phoneme recognition, word recognition, sentence interpretation, and meaning extraction. In a narrow sense, they often speak from speech analysis to word recognition. As an improvement of the human-machine interface, research and development of speech recognition technology for inputting information with voice and speech synthesis technology for outputting information with voice have been in progress for a long time.

음성인식의 궁극적인 목표는 자연스러운 발성에 의한 음성을 인식하여 실행 명령어로서 받아들이거나 자료로서 문서에 입력하는 완전한 음성?텍스트 변환의 실현이다. 단지 단어를 인식할 뿐 아니라 구문 정보, 의미 정보, 작업에 관련된 정보와 지식 등을 이용하여 연속 음성 또는 문장의 의미 내용을 정확하게 추출하는 음성 이해 시스템을 개발하는 것이다. 이러한 시스템의 연구 개발이 전 세계에서 활발하게 진행되고 있으며, 본 연구발명은 이러한 음석인식을 통해 감정분석을 위한 코드를 추출하고, 관련DB(표준형모형 및 합성모형을 추출)를 정합하는 기술로 사용할 수 있다. The ultimate goal of speech recognition is the realization of complete speech-to-text conversion that recognizes speech by natural utterance and accepts it as execution instructions or inputs it into a document as data. It is to develop a speech understanding system that not only recognizes words but also accurately extracts the meaning of continuous speech or sentences using phrase information, semantic information, and work-related information and knowledge. The research and development of such a system is being actively carried out all over the world, and the present invention can be used as a technology for extracting codes for emotion analysis and matching related DB (standard model and synthetic model) through this speech recognition. Can be.

또한, 영상인지단계에서 수행되는 인지과정에는 사람추적(People Tracking)기법이 사용될 수 있다. 사람추적은 사람의 경우 움직임이 자유롭고 식별할 수 있는 뚜렷한 특성이 없고, 공공장소에서의 사람 추적 시 사람과 사람이 서로 겹치는 문제를 고려하여, 겹쳐진 물체인지(Occluded Object Recognition)를 수행한다. 객체간 겹침은 병합분리(merge-split)와 straight-through 방법을 활용하고, 겹침 객체의 인지는 등장, 퇴장, 연속, 병합, 분리 이벤트를 활용한다. 또한, 사람추적기법은 전술한 겹침 문제 해결 후 사람들의 위치를 좌표로 표현하여야 하므로, 위치추적을 수행한다. 객체의 크기는 카메라의 초점거리(focal length)와 같은 조정 정보를 이용하여 계산하고 실세계의 좌표로 위치를 표시한다. In addition, a people tracking technique may be used in the cognitive process performed in the image recognition step. In the case of human tracking, the object is free and there is no discernible characteristic, and the object tracking is performed in consideration of the problem of overlapping with each other in tracking people in public places. Object-to-object overlap uses merge-split and straight-through methods, and recognition of overlapped objects uses appearance, exit, sequential, merge, and split events. In addition, the person tracking technique performs location tracking because the location of the people must be represented by coordinates after solving the above-described overlap problem. The size of the object is calculated using adjustment information, such as the focal length of the camera, and the location is expressed in the coordinates of the real world.

한편, 사람추적기법은 다중카메라에 의해 수행될 수 있다. 중첩된 카메라의 경우 추적 알고리즘은 카메라 보정과 카메라들간의 추적된 객체들의 핸드오프의 계산이 필요하며, 이를 위해 많은 공통적인 FOV(Field Of View)를 공유하는 것이 필요하다. 객체탐지를 위한 가우시안모델을 이용하여 전경픽셀(Foreground pixel)과 배경(Background pixel)을 분리하고 분리된 전경픽셀을 토대로 동일 객체를 판단하여 카메라들 간의 링크를 찾고, 겹치지 않는 다중 카메라 구성에서 객체의 일치성을 점진적인 칼라의 유사성 학습과정을 토대로 검증한다. Meanwhile, the person tracking technique may be performed by multiple cameras. In the case of superimposed cameras, the tracking algorithm requires camera calibration and calculation of the handoff of the tracked objects between the cameras, and for this purpose it is necessary to share many common field of view (FOV). Using the Gaussian model for object detection, the foreground and background pixels are separated and the same object is judged based on the separated foreground pixels to find links between cameras. Conformity is verified based on progressive color similarity learning process.

CCCM을 이용해서 칼라의 유사성을 판단한 후 카메라 사이의 링크를 결정하며, 링크는 주어진 시간동안 객체의 등장과 재등장을 하기의 수학식 1의 조건부 전이 확률로 계산하여 결정한다.
After determining the similarity of colors using CCCM, the link between cameras is determined, and the link is determined by calculating the appearance and reappearance of an object for a given time using the conditional transition probability of Equation 1 below.

영상인지단계에서 수행되는 인지과정 중 하나는 카메라위상인지기법이다. 카메라위상인지는 전이시간(travel-transition)모델에 의하여 카메라의 위치를 노드로 표시하는 가시적 그래프 표현이 가능하고, 다중카메라 네트워크 시공간 자동구성 기법을 이용함으로써 일정 시간의 분석된 데이터를 이용하여 자율적으로 학습하는 방식을 사용하여 중첩 또는 비중첩 카메라에서 입력받은 영상으로부터 이미지간의 객체를 매칭하여 카메라간의 관계를 결정한다. 효과적 객체 추출 및 인식은 카메라와 움직이는 사람간의 거리에 따라 피부 영역을 그리드 기반의 추출 방법을 사용하여 개발한다. 도 8a 및 도 8b는 본 발명에 따른 영상통화서비스에 적용되는 다중카메라 네트워크 시공간 자동구성 기법에 있어서, 객체가 등장할 때 처리하는 순서와 객체가 퇴장할 때 처리하는 순서를 각각 나타낸 것이다. 다중카메라 네트워크 시공간 자동구성 기법은 당업자에게 자명한 것이므로, 이에 대한 구체적인 설명은 생략한다.One of the recognition processes performed in the image recognition stage is a camera phase recognition technique. The camera phase recognition can be visually represented by nodes using the travel-transition model, and the autonomous data can be autonomously used by analyzing the data of a certain time by using the multi-camera network spatio-temporal autoconfiguration technique. Using the learning method, the relationship between cameras is determined by matching objects between images from images received from overlapping or non-overlapping cameras. Effective object extraction and recognition is developed using a grid-based extraction method for the skin area according to the distance between the camera and the moving person. 8A and 8B illustrate an order of processing when an object appears and an order of processing when the object exits in the multi-camera network space-time automatic configuration technique applied to the video call service according to the present invention. Since the multi-camera network space-time autoconfiguration technique is obvious to those skilled in the art, a detailed description thereof will be omitted.

영상인지단계에서 수행되는 인지과정 중 또 하나는 얼굴인지기법으로서, 얼굴영역검출(Face Detection), 얼굴영역 정규화 및 얼굴영역 내 특징추출을 포함한다. Another recognition process performed in the image recognition step is a face recognition technique, which includes face detection, face region normalization, and feature extraction in the face region.

도 9는 본 발명의 일 실시예에 따른 영상통화서비스에 적용되는 얼굴인지를 위한 얼굴영역검출기법에 대한 처리과정을 나타낸 도면으로서, 도 9를 참조하면 얼굴영역검출기법은 (a)에서 얼굴영역을 검출하고, (b)에서 스킨(적색영역)컬러를 감지하며, (c)에서 팽창연산을 수행하여, (d)에서 라벨링(labeling)한다.FIG. 9 is a flowchart illustrating a process for detecting a face area detection method for face recognition applied to a video call service according to an embodiment of the present invention. Referring to FIG. 9, the face area detection method is a face area in (a). (B) detects the skin (red region) color, performs expansion operation in (c), and label in (d).

이를 더욱 상세히 설명하면, (a)단계는 전체 이미지로부터 얼굴영역을 검출하고 눈, 코, 입 영역을 찾는 Face Detection 알고리즘을 이용하여 전체 이미지로부터 얼굴영역만 검출(초기 입력이미지)하고, (b)단계는 skin color 영역 검출 및 이를 라벨링하기 위해 바이너리 이미지(binary image)로 전환(경험적 방법 활용)한 후, 팽창연산(Binary morpological Dilation)을 통해 노이즈 제거하고, 라벨링을 통한 각 영역의 중심점을 찾고 눈의 영역 구현한다.In more detail, step (a) detects a face region from the entire image and detects only the face region from the entire image (initial input image) using a face detection algorithm that finds an eye, nose, and mouth region, and (b) The step is to detect the skin color area and convert it to a binary image (using an empirical method) to label it, remove noise through Binary morpological Dilation, find the center point of each area through labeling, and Implement the realm of.

이어, 얼굴영역검출기법의 두 번째로 얼굴영역 정규화는 고개의 기울기, 얼굴 정면 각도, 표정 등과 같은 얼굴의 다양한 변화로 인한 얼굴검출이 어려워 다른 변화가 없거나 특별한 경우에만 적용이 가능하며, 지식기반방법(knowledge-based methods)다양한 변화에 대한 규칙을 확장 또는 정규화 함으로써 해결한다.Second, face region normalization is the second method of face region detection technique, which is difficult to detect face due to various changes of face such as head tilt, face front angle, facial expression, etc. (knowledge-based methods) Solve by extending or normalizing the rules for various changes.

마지막으로, 얼굴영역검출기법의 세 번째로 얼굴영역 내 특징 추출은 초기값을 설정해주면 자동으로 임의의 물체의 윤곽을 찾아내주는 Snake 모델(active contour model) 알고리즘을 이용한다. 스네이크 모델은 영상에서 특정 영역을 분할하려 할 때 많이 이용되는 기법으로서, 본 발명에서는 칼라 영상에서 얼굴과 얼굴의 특징점을 찾는 방법으로 이 알고리즘을 적용할 수 있다. 스네이크 모델은 전역적으로 정의된 cost를 최적화하는 방향으로 윤곽(contour)을 능동적으로 추출한다.Finally, the third feature of facial region detection technique is the Snake model (active contour model) algorithm which automatically finds the contour of an arbitrary object when the initial value is set. The snake model is a technique widely used when attempting to segment a specific area in an image. In the present invention, the algorithm may be applied to a method for finding a face and a feature point of a face in a color image. The snake model actively extracts contours in a way that optimizes globally defined costs.

또한, 영상인지단계에서 수행되는 인지과정에서는 얼굴구성요소(표정분석)관계를 통하여 눈, 입 등의 형태와 서로간의 관계를 통해 표정을 인식할 수 있는 표정분석 관계연산함수를 사용함으로써, 웃음, 미소, 울음, 찡그림, 놀람 등 얼굴의 표정인식이 가능하며, 이를 통한 행복, 슬픔 등의 감정 추론이 가능하여, 최종적으로 감정인지 결과를 얻을 수 있다.In addition, in the cognitive process performed in the image recognition step, the facial expression (expression analysis) relationship is used by using the facial expression analysis relation function that can recognize the expression through the shape of the eyes, mouth, etc. It is possible to recognize facial expressions such as smiles, crying, frowning, and surprises. Through this, it is possible to infer emotions such as happiness and sadness, and finally, it is possible to obtain emotional recognition results.

또한, 영상인지과정에서 수행되는 인지과정에서는 손동작인지, 즉 제스처 인식기법이 적용될 수 있다. 제스처 인식은 인식된 포즈 영상들로부터 제스처를 인식하며, 제스처 인식을 위해 HMM사용한다. HMM은 시간적으로 제약을 받는 정보의 구조를 모델링 하는데 뛰어나다. 상태전이매개변수는 순차적인 일련의 사건 발생을 모델링하고, 관측심볼확률분포는 각 사건의 특징을 유한개의 심볼로 대응하는데, 이러한 두 가지 확률과정의 결합으로 이루어진 HMM은 학습데이터를 이용해 적절한 제스처 모델을 구성한다. In addition, the recognition process performed in the image recognition process may be a hand gesture, that is, a gesture recognition technique. Gesture recognition recognizes gestures from the recognized pose images and uses HMM for gesture recognition. HMM is good at modeling the structure of time-constrained information. The state transition parameter models a sequential series of event occurrences, and the observation symbol probability distribution maps the characteristics of each event into a finite number of symbols. Configure

인지과정에서는 인지하고자 하는 제스처와 학습이 끝난 후 생성된 HMM의 제스처모델을 비교하고 가장 유사한 제스처모델을 선택하여 결과값을 확률로 나타낸다. 학습은 각 제스처 별로 이루어지고, 해당 제스처의 HMM모델의 학습결과를 적용한다. HMM의 학습 과정은 각 숫자 별로 손동작을 이용하여 은닉 마르코프 모델을 구성하는 과정으로, EM알고리즘의 하나인 Baum-Welch 알고리즘을 이용한다. 이러한 각각의 숫자 모델에 전향 알고리즘을 적용하여 가장 높은 확률을 보이는 숫자 모델을 최종 인식 결과로 출력한다.In the cognitive process, we compare the gesture to be recognized with the gesture model of the HMM generated after learning and select the most similar gesture model to represent the result as a probability. Learning is performed for each gesture, and the learning result of the HMM model of the gesture is applied. The learning process of HMM is to construct a hidden Markov model using hand gestures for each number. It uses Baum-Welch algorithm, one of EM algorithms. By applying the forwarding algorithm to each of these numerical models, the numerical model with the highest probability is output as the final recognition result.

영상인지과정에서 수행되는 인지과정에서 적용될 수 있는 또 하나의 기법은 동작과 행동인지 기법으로서, 모션 히스토리(Motion History) 및 SVM(Support Vector Machines)기법이 사용된다. 동작은 지역적으로 이동하거나 위치가 연속적으로 변화하는 과정이며, 행동은 여러 가지 요인에 의해 발생하거나 이미 이루어진 상황이므로, 동작은 위치변화를 뜻하며 행동은 발생적인 상황측면을 뜻한다. 따라서 동작은 단시간의 모션 히스토리를 인코딩하여 움직임의 형태를 알 수 있고, 필터링된 이미지 분류기는 SVM(Support Vector Machines)을 이용하여 구축한다.Another technique that can be applied in the cognitive process performed in the image recognition process is a motion and behavior cognitive technique, the motion history (Motion History) and SVM (Support Vector Machines) techniques are used. Since motion is a process of moving locally or changing position continuously, and action is a situation caused or already caused by various factors, motion means a change of position and action refers to an occurrence situation. Therefore, the operation can encode the short-time motion history to know the type of motion, and the filtered image classifier is constructed using SVM (Support Vector Machines).

다음으로, 영상인지과정에서 수행되는 검출과정에 적용될 수 있는 검출기법으로서, 객체의 화소, 블랍 및 레벨을 검출하는 것이다. 이 검출기법은 배경의 정보를 학습한 배경모델이 구성되면 배경으로부터 전경객체를 추출하기 위하여 우선 화소 레벨에서 전경화소와 배경화소를 검출하는 단계를 수행하고, 인접 전경화소들을 연결하여 의미있는 블랍을 검출하는 과정을 거친 후, 추출된 블랍 중 관심 객체를 구분할 수 있는 블랍을 관심객체로 정의하는 객체레벨 검출단계를 수행한다. 이때, 상기 화소레벨 검출은 균일분포 배경 모델에 의한 배경차감은 전처리 과정에서 HSI컬러공간에서의 배경모델영상을 획득하기 위해 일정시간의 훈련과정을 거쳐 배경영역의 컬러값에 대한 균일 배경모델을 만들고, 블랍레벨검출은 격자이미지기법을 사용하여 낮은 해상도로 레이블링하더라도 노이즈 픽셀 등의 예기치 않은 인위적 결함이 있을 경우에도 좋은 성능을 발휘한다. Next, a detector method that can be applied to a detection process performed in an image recognition process, detects pixels, blobs, and levels of an object. This detector method detects the foreground pixel and the background pixel at the pixel level in order to extract the foreground object from the background when the background model has been learned from the background information, and then connects the adjacent foreground pixels to form a meaningful blob. After the detection process, an object level detection step of defining a blob that can distinguish an object of interest among the extracted blobs as an object of interest is performed. At this time, the pixel level detection is performed by a uniform time model to obtain a background model image in the HSI color space in the preprocessing process. However, even though the blob level detection uses a lattice image technique to label at low resolutions, the blob level detection performs well even when there are unexpected artificial defects such as noise pixels.

상기 객체레벨검출은 블랍레벨검출에서 추출한 블랍의 집합에서 관심객체인 얼굴과 손 영역을 검출하기 위하여 에이다부스트 알고리즘을 적용할 수 있다. 하기의 수학식 2에 에이다부스트 알고리즘을 나타내었으며, 도 10에 객체추출의 얼굴검출기술인 에이다부스트 알고리즘 적용화면을 나타내었다.
In the object level detection, an adaboost algorithm may be applied to detect a face and a hand region that are objects of interest from a set of blobs extracted by the blob level detection. An ada boost algorithm is shown in Equation 2 below, and FIG. 10 shows an application screen of an ada boost algorithm, which is a face detection technique of object extraction.

상기 객체추출의 얼굴검출은 지식기반방법, 특징기반방법, 외형기반방법 등을 사용한다. 지식기반방법은 사람의 두 개의 눈, 한 개의 코, 입으로 구성되어 있고 각 요소들은 기하학적 위치 관계로 구성됨을 전제로 하여 얼굴을 검출하는 방법으로서, 이미지내의 히스토그램을 이용하는 방법을 활용한다(가로축의 히스토그램을 이용하여 눈, 코, 입의 위치 정보를 찾음).The face detection of the object extraction uses a knowledge-based method, a feature-based method, an appearance-based method, and the like. The knowledge-based method uses the histogram in the image as a method of detecting a face on the premise that it consists of two eyes, one nose and one mouth, and each element is composed of geometric positional relationship. Uses histogram to find location of eyes, nose and mouth).

상기 특징기반방법은 얼굴 요소, 색깔, 모양, 크기가 같은 얼굴 고유의 특징을 이용해서 얼굴 크기 및 위치를 추론하여 얼굴 영역을 검출하고, 얼굴 요소의 거리나 위치 등을 통해 얼굴 인지 아닌지를 판단하는 방법으로서, 이목구비, 텍스쳐, 피부색, 임계값, 복합특징을 이용하는 방법 등이 활용될 수 있다.The feature-based method detects a face region by inferring the face size and position by using face-specific features having the same face element, color, shape, and size, and determines whether or not the face is determined by distance or position of the face element. As the method, the eye, the texture, the skin color, the threshold value, the method using the complex features and the like can be utilized.

상기 외형기반 방법은 학습 영상 집합에 의해 학습된 모델을 이용해서 얼굴을 검출하는 방법으로서, 얼굴과 비얼굴의 훈련 이미지 집합을 만들어 학습시킨 후 검출한다. The appearance-based method is a method of detecting a face using a model trained by a training image set. The appearance-based method detects a face and a non-face training image set after training.

다음으로, 실시간 정합과정은 실사와 가상의 영상을 실시간으로 정합하는 과정으로서, 해당 오브젝트를 호출시키는 명령키가 입력되면, 콘텐츠수행기에서 관련 오브젝트를 호출하여 렌더링처리기를 통해 영상으로 출력한다. Next, a real-time matching process is a process of matching real-time and virtual images in real time. When a command key for calling a corresponding object is input, a content processor calls a related object and outputs the image through a rendering processor.

카메라는 렌즈나 종류에 따라 초점, 거리, 일그러짐 등 특성이 달라지므로, 이러한 카메라의 특성 값을 찾아내는 프로세스를 진행한 후 camera matrix를 얻어낼 수 있다. 이에 따라, 입력받은 영상과 camera matrix데이터를 통해 몸체의 3차원 위치상태를 구하는 계산을 하게 되며, 이렇게 얻어진 3차원 위치상태 값은 단순히 가상의 객체를 이동하고 회전하는데만 사용하는 것이 아니라 camera matrix데이터를 이용하여 2차원 평면 디스플레이에 정확하게 그려내는데 이를 정합이라 한다.Since cameras have different characteristics such as focus, distance, and distortion depending on the lens and type, the camera matrix can be obtained after the process of finding the characteristic values of the camera. Accordingly, the three-dimensional positional state of the body is calculated from the input image and the camera matrix data. The three-dimensional positional state values thus obtained are not simply used to move and rotate the virtual object, but to the camera matrix data. Is accurately drawn on a two-dimensional flat panel display. This is called registration.

콘텐츠 수행기를 통해 표정과 제스처의 감성을 표현할 2, 3차원 오브젝트를 사전에 미리 생성, 저장하고, 렌더링처리기를 통해 3D렌더링 처리를 수행한다. 3차원 컴퓨터 그래픽스 툴을 이용하여 3차원 모델을 만들고 이것을 영화영상 등에 이용하기 위해서 2차원적 그림으로 변화시키는 과정을 렌더링이라고 한다. 즉, 장면을 이미지로 전환하는 과정이라 할 수 있으며, 실시간 정합에 빠른 처리속도를 필요로 하는 렌더링 처리가 필수적이다. The content executor generates and stores 2D and 3D objects in advance to express emotions of facial expressions and gestures in advance, and performs 3D rendering processing through a render processor. The process of creating a 3D model using 3D computer graphics tools and converting it into a 2D picture for use in a movie image is called rendering. In other words, it is a process of converting a scene into an image, and rendering processing that requires fast processing speed for real-time registration is essential.

전술한 실시간 정합과정에 의해 정합된 실사와 가상의 영상은 실시간으로 접합되어 해당 디바이스의 영상 디스플레이에 출력된다. 이때, 디스플레이 될 또는 디스플레이된 영상은 해당 과정에서 저장될 수 있다. 영상저장의 일례로 H.264/AVC 등을 들 수 있다.The real image and the virtual image matched by the above-described real-time matching process are bonded in real time and output to the image display of the corresponding device. At this time, the image to be displayed or displayed may be stored in the process. One example of image storage is H.264 / AVC.

H.264 코덱은 국제 표준화 기구인 ITU-T와 ISO에서 공동으로 제안한 비디오 압축 기술로서 ITU-T에서 붙인 H.264라는 명칭 이외에 ISO에서 붙인 MPEG art10/ AVC 라는 명칭을 사용한다. H.264 코덱은 Floating point연산방식이 아닌 정수 연산만으로 가능해져 연산오차가 적고, 8*8블록 단위로 계산하는 MPEG-4와 달리 4*4블록 단위로 계산하므로 보다 정밀한 비교가 가능하며, 필요에 따라 16*16, 16*8, 8*16, 8*4, 4*8 등 다양한 블록크기로 적용이 가능하다.The H.264 codec is a video compression technology jointly proposed by the International Organization for Standardization (ITU-T) and ISO. It uses the MPEG art10 / AVC designation from ISO as well as the H.264 designation from ITU-T. H.264 codec is possible to use only integer operation rather than floating point operation method, so there is little operation error. Unlike MPEG-4 which calculates by 8 * 8 block unit, it calculates by 4 * 4 block unit, so more precise comparison is possible. It can be applied in various block sizes such as 16 * 16, 16 * 8, 8 * 16, 8 * 4, 4 * 8.

H.264 코덱은 기존과 달리 DCT변환과정 이전에 중복성을 제거함으로써 프로세서 효율개선되고, 루프필터 방식을 적용하여 낮은 비트율에서 발생하던 격자 무늬해소 및 디코더와 엔코더가 모두 이 기능을 수행함으로써 영상을 제작했을 때의 화질과 복원 후 화질에 차이가 없으며, 바로 앞 프레임의 차이점만 이용하던 종래기술과 달리 더 이전에 나왔던 프레임과도 비교함으로써 반복된 영상에서 우수한 성능을 보인다. Unlike the existing H.264 codec, processor efficiency is improved by removing redundancy before the DCT conversion process, and the loop filter method is used to eliminate the plaid generated at a low bit rate, and both the decoder and the encoder perform this function to produce an image. There is no difference in picture quality after restoration and picture quality after restoration, and it shows excellent performance in repeated images by comparing with the frame that came out earlier, unlike the prior art which used only the difference of the previous frame.

또한, H.264 코덱은 다른 코덱에 비해 에러에 대한 내성이 강하고, 우수한 압축 성능으로 인해 H.264는 모바일서비스와 같은 높은 DATA압축율을 요구하는 환경에서도 최적의 서비스를 지원할 수 있어, 현재 블루레이, DVD를 사용하는 다양한 디바이스에서 H.264를 채택하고 있다.In addition, the H.264 codec is more resistant to errors than other codecs, and because of its superior compression performance, H.264 can support optimal services even in environments requiring high data compression rates such as mobile services. H.264 is adopted by various devices using DVD.

전술한 바와 같이, 본 발명에 따른 영상통화 서비스 및 그 제공방법, 이를 위한 영상통화서비스 제공서버 및 제공단말기는 음성인식, 객체인지의 얼굴인지기술의 얼굴영역검출기술, 얼굴영역 정규화 기술, 얼굴영역 내 특징추출기술, 객체인지의 감정인지기술의 얼굴구성요소(표정분석)관계기술, 객체인지의 손동작인지, 객체인지의 동작과 행동인지기술 기반 위에 실사와 가상영상의 실시간 정합기술을 활용하여 영상통화를 하는 쌍방의 얼굴과 몸체 위에 제스처 및 표정분석을 통한 혼합된 가상의 오브젝트(문자포함)를 정합시켜 현실세계에서 볼 수 없는 다양한 혼합현실을 영상통화를 통해 구현한다.As described above, the video call service and the method of providing the same according to the present invention, the video call service providing server and the terminal for the face recognition technology, face region normalization technology, face region normalization technology of the voice recognition, object recognition face recognition technology My feature extraction technology, facial component (expression analysis) relationship technology of emotion recognition technology of object recognition, object recognition of hand gestures, motion and behavior recognition technology of object recognition based on real-time and real-time matching technology of virtual images By combining the mixed virtual objects (including texts) through gesture and facial expression analysis on both the face and the body making the call, various mixed reality which is not seen in the real world is realized through video call.

또한, 본 발명에 따른 본 발명에 따른 영상통화 서비스 및 그 제공방법, 이를 위한 영상통화서비스 제공서버 및 제공단말기는 음성, 얼굴과 몸체의 특정한 표정과 제스처의 표정분석 관계연산함수를 미리 등록하여, 유사한 음성, 표정과 제스처가 영상을 통해 전송되면, 출력되는 영상화면에서 음성, 표정과 제스처에 반응한 가상의 객체를 얼굴과 몸체 위에 실시간으로 정합시켜, 사용자의 영상통화에 놀라운 즐거움을 준다. In addition, according to the present invention, the video call service and the method of providing the same, the video call service providing server and the terminal for this in advance to register the expression and the relation analysis function of the specific expression and gesture of the voice, face and body, When similar voices, facial expressions and gestures are transmitted through the video, virtual objects responding to the voices, facial expressions and gestures are matched on the face and the body in real time, providing incredible enjoyment to the user's video call.

한편 본 발명의 상세한 설명에서는 구체적인 실시 예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되어 정해져서는 안되며 후술하는 특허청구의 범위뿐 아니라 이 특허청구의 범위와 균등한 것들에 의해서 정해져야 한다.While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments, but is capable of various modifications within the scope of the invention. Therefore, the scope of the present invention should not be limited by the described embodiments, but should be determined by the scope of the appended claims and equivalents thereof.

Claims

In a video call service providing terminal having at least an imaging means and a display means,
At least one of the user's body and face by extracting the emotional state of the user from at least one of the user's gestures and facial expressions taken by the image pickup means, generating a virtual object corresponding to the extracted emotional state The video call service superimposed on the display unit and displayed on the display means of the video call device of the other party.

The method of claim 1, wherein the virtual object,
Video call service, characterized in that it further comprises a text.

The method of claim 1, wherein the virtual object,
Video call service, characterized in that changeable by the user.

The method of claim 1, wherein the virtual object,
Video call service, characterized in that to change in real time in response to the emotional state.

The method of claim 1, wherein the virtual object,
Video call service, characterized in that the position of the user's body and face superimposed.

Receiving video information transmitted from a video call service providing terminal in a video call service providing server;
Extracting, by the video call service providing server, the emotional state of the user from at least one of a gesture and an expression of the user included in the transmitted image information;
Extracting, by the video call service providing server, an object corresponding to the extracted emotional state from previously stored object related information;
Superimposing the retrieved object on at least one of a body and a face of the user; And
And transmitting the overlapped image to the video call service providing terminal of the other party making a video call with the user.

The method of claim 6, wherein the virtual object,
The video call service providing method characterized in that it further comprises a character.

The method of claim 6, wherein the virtual object,
The video call service providing method, characterized in that changeable by the user.

The method of claim 6, wherein the virtual object,
The video call service providing method, characterized in that the change in real time in response to the emotional state.

The method of claim 6, wherein the virtual object,
And a position superimposed on the user's body and face is changed.

A server communication unit interworking with a video call service providing terminal device; And
Recognizing the emotional state of the user in at least one of the gesture and facial expression of the user from the image information received from the video call service providing terminal, and compared the recognized emotional state with the pre-stored object-related information, the recognized emotion A server controller which extracts an object matching a state and superimposes the extracted object on at least one of a body and a face of the user and transmits the extracted object to a counterpart video call service providing terminal communicating with the video call service providing terminal Video call service providing server comprising a.

12. The method of claim 11,
And a server storage unit for storing object related data corresponding to the emotional state.

A display unit displaying an image of the other party and an object overlapping the image according to a video call;
A communication unit interworking with a video call service providing server;
An imaging unit which acquires image information of a user according to a video call; And
Recognizing the emotional state of the user from the image information obtained by the image pickup unit, extracts the emotion information related to the recognized emotional state and transmits to the video call service providing server, the video call from the video call service providing server And a controller configured to receive an object corresponding to the opponent's emotion information according to the video call, and to superimpose the received object on a position associated with the received object in the video of the counterpart according to the video call and output the object to the display unit. Call service provider terminal.

The method of claim 13, wherein the display unit,
Terminal for providing a video call service, characterized in that to further display the user's video according to the video call.

The method of claim 13,
And a key input unit for determining whether to apply the object.

The method of claim 13,
And a storage unit for storing the object.

In the video call service providing terminal comprising at least an image pickup means, a display means, an audio input means and an audio output means,
Extracting the emotion state of the user from the voice of the user input through the voice input means to generate a virtual first object corresponding to the extracted emotion state,
Extracting the emotional state of the user from at least one of a gesture and an expression of the user photographed through the imaging means, and generating a virtual second object corresponding to the extracted emotional state,
The video call superimposing the generated first and second objects on at least one of the body and the face of the user displayed on the display means of the video call service providing terminal of the other party making the video call. service.

18. The method of claim 17,
And displaying only one of the virtual first object and the second object.

The method of claim 17, wherein at least one of the virtual first object and the second object,
Video call service, characterized in that it further comprises a text.

The method of claim 17, wherein at least one of the virtual first object and the second object,
Video call service, characterized in that changeable by the user.

The method of claim 17, wherein at least one of the virtual first object and the second object,
Video call service, characterized in that to change in real time in response to the emotional state.

The method of claim 17, wherein at least one of the virtual first object and the second object,
Video call service, characterized in that the position of the user's body and face superimposed.

Receiving voice information and video information transmitted from a video call service providing terminal in a video call service providing server;
Extracting, by the video call service providing server, the emotional state of the user from at least one of the transmitted voice information and a gesture and a facial expression of the user included in the video information;
Extracting, by the video call service providing server, an object corresponding to the extracted emotional state from previously stored object related information;
Superimposing the retrieved object on at least one of a body and a face of the user; And
And transmitting the overlapped image to the video call service providing terminal of the other party making a video call with the user.

24. The method of claim 23,
And displaying only one if the virtual first object and the second object are the same.

The method of claim 23, wherein at least one of the virtual first object and the second object,
The video call service providing method characterized in that it further comprises a character.

The method of claim 23, wherein at least one of the virtual first object and the second object,
The video call service providing method, characterized in that changeable by the user.

The method of claim 23, wherein at least one of the virtual first object and the second object,
The video call service providing method, characterized in that the change in real time in response to the emotional state.

The method of claim 23, wherein at least one of the virtual first object and the second object,
And a position superimposed on the user's body and face is changed.

A server communication unit interworking with a video call service providing terminal device; And
Recognizing the emotional state of the user in at least one of the user's voice, gestures and facial expressions from the voice information and the image information received from the video call service providing terminal, and compares the recognized emotional state with the object-related information previously stored Extracts an object matching the recognized emotional state, and superimposes the extracted object on at least one of a body and a face of the user, and transmits the extracted object to a counterpart video call service providing terminal communicating with the video call service providing terminal. A video call service providing server comprising: a server control unit.

30. The method of claim 29,
And a server storage unit for storing object related data corresponding to the emotional state.

A voice input unit configured to receive voice information of a user according to a voice call and obtain voice information;
A display unit displaying an image of the other party and an object overlapping the image according to a video call;
A communication unit interworking with a video call service providing server;
An imaging unit which acquires image information of a user according to a video call; And
Recognizing the emotion state of the user from the voice information obtained from the voice input unit and the image information obtained from the image pickup unit, extracts emotion information related to the recognized emotion state and transmits it to the video call service providing server, The display unit receives an object corresponding to the emotion information of the other party according to the video call from the video call service providing server, and overlaps the received object at a position associated with the received object in the video of the other party according to the video call. And a control unit for outputting the video call service providing terminal.

The method of claim 31, wherein the display unit,
Terminal for providing a video call service, characterized in that to further display the user's video according to the video call.

32. The method of claim 31,
And a key input unit for determining whether to apply the object.

32. The method of claim 31,
And a storage unit for storing the object.