KR20240009862A

KR20240009862A - A method for detecting a deepfake image and an electronic device for the method

Info

Publication number: KR20240009862A
Application number: KR1020230069274A
Authority: KR
Inventors: 김병규; 최영주
Original assignee: 숙명여자대학교산학협력단
Priority date: 2022-07-14
Filing date: 2023-05-30
Publication date: 2024-01-23

Abstract

The present invention provides a deepfake image detecting method which comprises the following steps of: coupling a patch embedding function based on a segmented face image and a CNN function based on a convolutional neural network; inputting a class token learned by a true label, a distillation token based on learning through comparison of a teacher network inference result with the true label, and a global pooling function to which the patch embedding function and the CNN function are coupled into a transformer; and performing machine learning for detecting a deepfake image based on a loss function outputted by the transformer.

Description

Deepfake image detection method and electronic device therefor {A METHOD FOR DETECTING A DEEPFAKE IMAGE AND AN ELECTRONIC DEVICE FOR THE METHOD}

본 발명은 기계학습을 통해 페이크 이미지와 실제 이미지를 구분할 수 있는 딥페이트 이미지 탐지 방법 및 이를 위한 전자 장치를 제공한다.The present invention provides a deep fake image detection method that can distinguish fake images from real images through machine learning and an electronic device therefor.

위변조 이미지, 예컨대 딥페이크 이미지로 인한 피해는 사물 인터넷 시대의 도래와 함께 빠른 속도로 증가하고 있다. 그 피해의 대상은 연예인부터 일반인까지 광범위하며, 디지털 성범죄, 불법 복제 및 저작권 침해 등 각종 범죄에 위변조 이미지가 이용되고 있다. 나아가, 동영상의 프레임 단위로 이미지를 위변조한 위변조 영상(e.g.,딥페이크 영상)으로 인한 피해 또한 컴퓨팅 장치의 발달에 따라 증가하고 있다.Damage caused by forged images, such as deepfake images, is increasing rapidly with the advent of the Internet of Things era. The victims range widely from celebrities to the general public, and forged and altered images are used for various crimes such as digital sex crimes, illegal copying, and copyright infringement. Furthermore, damage caused by forged videos (e.g., deepfake videos), in which images are falsified frame by frame, is also increasing with the development of computing devices.

발달된 인공 지능 기술을 이용하여 제작된 위변조 이미지 및 위변조 영상은 사람의 육안으로 위조 또는 변조된 것인지 여부를 판정하기 어려운 문제가 있다. 또한, 이러한 위조 또는 변조 기술은 인공 지능 기술의 발달과 함께 빠르게 진화하고 있는 문제가 있다.There is a problem in that forged images and forged videos produced using advanced artificial intelligence technology are difficult to determine whether they have been forged or altered with the human eye. In addition, these forgery or falsification technologies have a problem that is rapidly evolving along with the development of artificial intelligence technology.

따라서, 위변조 이미지 및 위변조 영상으로 인한 피해를 예방하기 위해, 위변조 이미지 및 위변조 영상을 판정하기 위한 기술이 요구된다.Therefore, in order to prevent damage caused by forged images and forged videos, technology for determining forged images and forged videos is required.

본 발명은 위변조 이미지 및 위변조 영상으로 인한 피해를 방지하기 위해 딥페이크 이미지 탐지의 정확도를 향상시킬 수 있는 방법을 제공한다.The present invention provides a method for improving the accuracy of deepfake image detection to prevent damage caused by forged images and forged videos.

본 발명은 합성곱 신경망 네트워크(convolutional neural network)에 기반한 CNN 함수를 결합하는 단계, 실제 레이블(true label)로 학습한 클래스 토큰(class token), 티처 네트워크(teacher network) 추론 결과와 상기 실제 레이블 비교를 통한 학습에 기반한 증류 토큰(distilaation token) 및 상기 패칭 임베딩 함수와 상기 CNN 함수가 결합된 전역 풀링(global pooling) 함수를 변환기(transformer)에 입력하는 단계 및 상기 변환기에 의해 출력되는 손실 함수(loss function)에 기반하여 딥페이크 이미지를 탐지하기 위한 기계학습을 수행하는 단계를 포함하는 딥페이크 이미지 탐지 방법을 제공한다.The present invention includes the steps of combining a CNN function based on a convolutional neural network, class tokens learned with true labels, and comparing the teacher network inference results with the true labels. Inputting a distillation token based on learning through and a global pooling function combining the patching embedding function and the CNN function into a transformer, and a loss function output by the transformer. Provides a deepfake image detection method that includes performing machine learning to detect deepfake images based on a function).

일 실시예에 따르면, 상기 딥페이크 이미지 탐지 방법은 MTCNN(multitask cascaded convolutional network) 모델을 이용해 영상에서 얼굴 이미지를 추출하는 단계 및 상기 얼굴 이미지에서 이미지를 분할하기 위한 특정 영역을 설정하는 단계를 포함할 수 있다.According to one embodiment, the deepfake image detection method may include extracting a face image from an image using a multitask cascaded convolutional network (MTCNN) model and setting a specific region for segmenting the image from the face image. You can.

일 실시예에 따르면, 상기 패치 임베딩 함수는 분할된 이미지인 패치(patch), 분할된 패치들의 지수(exponential number)에 기초하여 결정되고 상기 CNN 함수는 CNN 특징(CNN features)의 개수에 기초하여 결정될 수 있다.According to one embodiment, the patch embedding function is determined based on the patch, which is a segmented image, and the exponential number of the segmented patches, and the CNN function is determined based on the number of CNN features. You can.

일 실시예에 따르면, 상기 전역 풀링 함수는 채널별로 상기 임베딩 함수와 상기 CNN 함수를 연쇄결합(concatenating)함으로써 결정될 수 있다.According to one embodiment, the global pooling function may be determined by concatenating the embedding function and the CNN function for each channel.

일 실시예에 따르면, 상기 손실 함수는 페이크 이미지에 대한 손실 함수와 실제 이미지에 대한 손실 함수에 기초하여 결정되며, 상기 기계학습을 수행하는 단계는 상기 손실 함수로 정규화 가중치를 설정하여 기계학습을 수행할 수 있다.According to one embodiment, the loss function is determined based on a loss function for a fake image and a loss function for a real image, and the step of performing the machine learning is to perform machine learning by setting a normalization weight with the loss function. can do.

일 실시예에 따르면, 상기 페이크 이미지에 대한 손실 함수는 상기 페이크 이미지에 대한 티처 네트워크의 오즈비(odds ratio), 클래스 토큰의 오즈비 및 증류 토큰의 오즈비에 기반하여 결정되고 상기 실제 이미지에 대한 손실 함수는 상기 실제 이미지에 대한 티처 네트워크의 오즈비(odds ratio), 클래스 토큰의 오즈비 및 증류 토큰의 오즈비에 기반하여 결정될 수 있다.According to one embodiment, the loss function for the fake image is determined based on the odds ratio of the teacher network, the odds ratio of the class token, and the odds ratio of the distillation token for the fake image, and the loss function for the real image is determined based on the odds ratio of the teacher network and the odds ratio of the distillation token. The loss function may be determined based on the odds ratio of the teacher network, the odds ratio of the class token, and the odds ratio of the distillation token for the real image.

본 발명은 분할된 얼굴 이미지에 기반한 패치 임베딩(patch embedding) 함수와 합성곱 신경망 네트워크(convolutional neural network)에 기반한 CNN 함수를 결합하는 연산부, 실제 레이블(true label)로 학습한 클래스 토큰(class token), 티처 네트워크(teacher network) 추론 결과와 상기 실제 레이블 비교를 통한 학습에 기반한 증류 토큰(distilaation token) 및 상기 패칭 임베딩 함수와 상기 CNN 함수가 결합된 전역 풀링(global pooling) 함수를 입력받는 변환기(transformer) 및 상기 변환기에 의해 출력되는 손실 함수(loss function)에 기반하여 딥페이크 이미지를 탐지하기 위한 기계학습을 수행하는 기계학습 수행부를 포함하는 전자 장치를 제공한다.The present invention is an operation unit that combines a patch embedding function based on a segmented face image and a CNN function based on a convolutional neural network, and a class token learned with a true label. , a transformer that receives a distillation token based on learning through comparison of the teacher network inference result and the actual label, and a global pooling function combining the patching embedding function and the CNN function. ) and a machine learning performing unit that performs machine learning to detect deepfake images based on a loss function output by the converter.

일 실시예에 따르면, 상기 전자 장치는 MTCNN(multitask cascaded convolutional network) 모델을 이용해 영상에서 얼굴 이미지를 추출하고, 상기 얼굴 이미지에서 이미지를 분할하기 위한 특정 영역을 설정하는 전처리부를 더 포함할 수 있다.According to one embodiment, the electronic device may further include a preprocessor that extracts a face image from an image using a multitask cascaded convolutional network (MTCNN) model and sets a specific region for segmenting the image from the face image.

일 실시예에 따르면, 상기 손실 함수는 페이크 이미지에 대한 손실 함수와 실제 이미지에 대한 손실 함수에 기초하여 결정되며, 상기 수행부는 상기 손실 함수로 정규화 가중치를 설정하여 기계학습을 수행할 수 있다.According to one embodiment, the loss function is determined based on a loss function for a fake image and a loss function for a real image, and the execution unit may perform machine learning by setting a normalization weight with the loss function.

본 발명은 증류 토큰과 추출된 이미지에 대한 지역 특징과 전역 특징을 모두 딥페이크 탐지를 위한 학습 모델에 적용함으로써 딥페이크 이미지 탐지의 정확도를 향상시킬 수 있는 방법을 제공한다.The present invention provides a method to improve the accuracy of deepfake image detection by applying both local and global features of the distillation token and extracted image to a learning model for deepfake detection.

도 1은 본 발명에서 개시하고 있는 딥페이크 이미지 탐지 프로세스를 설명하기 위한 도면이다.
도 2는 본 발명에서 개시하고 있는 딥페이크 이미지 탐지 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명에서 개시하고 있는 전역 풀링 함수 도출 과정을 설명하기 위한 도면이다.
도 4는 본 발명에서 개시하고 있는 손실 함수 도출 과정을 설명하기 위한 도면이다.
도 5는 본 발명에서 개시하고 있는 일 실시예에 따라 증류 토큰을 활용한 경우와 종래 기술에 따른 경우의 손실 함수를 비교한 그래프이다.
도 6은 본 발명에서 개시하고 있는 일 실시예에 따라 딥페이크 감지 알고리즘을 활용한 경우와 종래 기술에 따른 경우의 페이크 이미지에 대한 손실함수와 실제 이미지에 대한 손실 함수를 비교한 그래프이다.
도 7은 본 발명에서 개시하고 있는 일 실시예에 따른 전자 장치의 블록도이다.1 is a diagram illustrating the deepfake image detection process disclosed in the present invention.
Figure 2 is a flowchart illustrating the deepfake image detection method disclosed in the present invention.
Figure 3 is a diagram for explaining the process of deriving the global pooling function disclosed in the present invention.
Figure 4 is a diagram for explaining the loss function derivation process disclosed in the present invention.
Figure 5 is a graph comparing the loss function when using a distillation token according to an embodiment disclosed in the present invention and when according to the prior art.
Figure 6 is a graph comparing the loss function for a fake image and the loss function for a real image when using the deepfake detection algorithm according to an embodiment disclosed in the present invention and when using the prior art.
Figure 7 is a block diagram of an electronic device according to an embodiment disclosed in the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all changes, equivalents, and substitutes included in the spirit and technical scope of the present invention. While describing each drawing, similar reference numerals are used for similar components.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, a first component may be named a second component, and similarly, the second component may also be named a first component without departing from the scope of the present invention. The term and/or includes any of a plurality of related stated items or a combination of a plurality of related stated items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is said to be "connected" or "connected" to another component, it is understood that it may be directly connected to or connected to the other component, but that other components may exist in between. It should be. On the other hand, when it is mentioned that a component is “directly connected” or “directly connected” to another component, it should be understood that there are no other components in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this application are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. 이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세하게 설명한다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and unless explicitly defined in the present application, should not be interpreted in an ideal or excessively formal sense. No. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명에서 개시하고 있는 딥페이크 이미지 탐지 프로세스를 설명하기 위한 도면이다.1 is a diagram illustrating the deepfake image detection process disclosed in the present invention.

일 실시예에 따르면, 딥페이크 이미지 탐지 프로세스는 이미지에서 딥페이크 분석을 수행할 특정 영역을 설정하는 과정, 설정된 특정 영역의 이미지를 전처리하여 데이터를 증강(augmentation)하는 과정, 증간된 데이터를 기계학습 모델에 적용하여 학습을 수행하는 과정, 학습 결과에 기초하여 감지된 이미지가 실제 이미지인지 또는 페이크 이미지인지를 판별하는 딥페이크 탐지 과정을 포함할 수 있다.According to one embodiment, the deepfake image detection process includes a process of setting a specific area in the image to perform deepfake analysis, a process of preprocessing the image of the set specific area to augment the data, and machine learning of the augmented data. It may include a process of performing learning by applying it to a model, and a deepfake detection process of determining whether the detected image is a real image or a fake image based on the learning results.

일 실시예에 따르면, 딥페이크 이미지 탐지를 위한 기계학슴 모델은 ViT(vision transformer) 모델과 CNN(convolutional neural network) 모델을 이용할 수 있다. 다양한 실시예에 따르면, ViT 모델은 이미지를 분할한 패치에 순서를 지정하며 이미지 내에서 패치의 위치 정보가 MSL(Multi head self attention layer)로 연결될 수 있다. 따라서 ViT 모델을 활용할 경우 이미지 내에서 다른 부분간의 관계 학습이 가능할 수 있다.According to one embodiment, the machine learning model for detecting deepfake images may use a vision transformer (ViT) model and a convolutional neural network (CNN) model. According to various embodiments, the ViT model assigns an order to the patches that segment the image, and the location information of the patches within the image can be connected to a multi-head self attention layer (MSL). Therefore, when using the ViT model, it may be possible to learn relationships between different parts within an image.

일 실시예에 따르면, ViT 모델의 MSL은 이미지 전체에 대한 전역 정보(global information)를 임베딩할 수 있다. 다양한 실시예에 따르면, ViT 모델은 하위 계층의 ResNet보다 더 많은 전역 정보를 포함할 수 있다.According to one embodiment, the MSL of the ViT model can embed global information about the entire image. According to various embodiments, the ViT model may contain more global information than the ResNet of the lower layer.

일 실시예에 따르면, CNN 모델에는 이미지 주변 픽셀 값에 대한 정보가 포함되어 있어 이미지 합성으로 인해 주변 픽셀이 부자연스러워지는 현상을 감지할 수 있다. 다양한 실시예에 따르면, CNN 모델과 ViT 모델에 따른 패치 임베딩을 결합함으로써 지역 정보(local information)와 전역 공간 정보(global spatial information)을 얻을 수 있다.According to one embodiment, the CNN model includes information about pixel values around the image, so it is possible to detect the phenomenon in which the surrounding pixels become unnatural due to image compositing. According to various embodiments, local information and global spatial information can be obtained by combining patch embeddings according to a CNN model and a ViT model.

도 2는 본 발명에서 개시하고 있는 딥페이크 이미지 탐지 방법을 설명하기 위한 흐름도이다. 일 실시예에 따르면, 도 2에서 도시하고 있는 흐름도는 도 7에서 도시하고 있는 전자 장치에 의해 수행될 수 있다.Figure 2 is a flowchart illustrating the deepfake image detection method disclosed in the present invention. According to one embodiment, the flowchart shown in FIG. 2 may be performed by the electronic device shown in FIG. 7.

일 실시예에 따르면, S210 단계에서 MTCNN(multitask cascaded convolutional network) 모델을 이용해 영상에서 얼굴 이미지를 추출할 수 있다. 다양한 실시예에 따르면, S220 단계에서 S210 단계를 통해 추출된 얼굴 이미지에서 이미지를 분할하기 위한 특정 영역을 설정할 수 있다.According to one embodiment, a face image may be extracted from an image using a multitask cascaded convolutional network (MTCNN) model in step S210. According to various embodiments, a specific area for segmenting the image may be set in the face image extracted through steps S220 to S210.

일 실시예에 따르면, 추출한 얼굴 이미지에서 랜드마크(landmark)를 생성하고 실제 이미지와 페이크 이미지 사이에서 얼굴의 구조적 동일성 차이를 추출할 수 있다. 다양한 실시예에 따르면, 추출한 얼굴 이미지에서 랜드마크를 생성하는 이유는 얼굴의 이미지 일부를 잘라내어 모델을 보다 일반화하기 위함이다. S210 및 S220 단계는 도 1에서 도시하고 있는 전처리 과정에 포함될 수 있으며, 이 같은 전처리 과정은 학습 모델이 오버피팅(overfitting) 되는 것을 방지할 수 있다.According to one embodiment, a landmark can be created from the extracted face image and the difference in facial structural identity between the real image and the fake image can be extracted. According to various embodiments, the reason for generating landmarks from extracted face images is to more generalize the model by cutting out part of the face image. Steps S210 and S220 may be included in the preprocessing process shown in FIG. 1, and this preprocessing process can prevent overfitting of the learning model.

일 실시예에 따르면, S230 단계에서 분할된 얼굴 이미지에 기반한 패치 임베딩(patch embedding) 함수와 CNN 함수를 결합할 수 있다. 다양한 실시예에 따르면 얼굴 이미지를 원하는 패치 크기로 분할하고 패치 크기 커널을 사용하여 하나의 CNN 계층에 패치를 내장할 수 있다.According to one embodiment, a patch embedding function based on the segmented face image and a CNN function may be combined in step S230. According to various embodiments, the face image can be segmented into desired patch sizes and the patches can be embedded in one CNN layer using the patch size kernel.

일 실시예에 따르면, 패치 임베딩 함수는 얼굴 패치의 특징을 결정할 수 있으며, CNN 함수는 얼굴 이미지의 전체 특징을 결정할 수 있다. 본 발명에 따를 경우, 패치 임베딩 함수와 CNN 함수가 기계학습에 모두 고려됨으로서 패칭 임베딩 함수만이 고려된 기게학습 모델에 비해 딥페이크 이미지 탐지 성능이 향상될 수 있다. S230 단계에 대한 보다 자세한 설명은 도 3에 대한 설명을 통해 후술한다.According to one embodiment, the patch embedding function can determine the features of the face patch, and the CNN function can determine the overall features of the face image. According to the present invention, both the patch embedding function and the CNN function are considered in machine learning, so deepfake image detection performance can be improved compared to a machine learning model in which only the patching embedding function is considered. A more detailed description of step S230 will be described later through the description of FIG. 3.

일 실시예에 따르면, S240 단계에서 실제 레이블(true label)로 학습한 클래스 토큰(class token), 티처 네트워크(teacher network) 추론 결과와 실제 레이블 비교를 통한 학습에 기반한 증류 토큰(distilaation token) 및 패칭 임베딩 함수와 CNN 함수가 결합된 전역 풀링(global pooling) 함수를 변환기(transformer)에 입력할 수 있다. 다양한 실시예에 따르면, 변환기에 입력되는 이미지의 특징들은 패치마다 상호 작용할 수 있으며, CNN 함수는 이미지의 주변 특징들과 상호 작용할 수 있다.According to one embodiment, a class token learned as a true label in step S240, a distillation token based on learning through comparison of the teacher network inference result and the actual label, and patching A global pooling function that combines an embedding function and a CNN function can be input to the transformer. According to various embodiments, the features of the image input to the converter may interact for each patch, and the CNN function may interact with surrounding features of the image.

일 실시예에 따르면, 티처 네트워크는 기존에 잘 학습된 네트워크 기반으로 예측하며, 이를 실제 레이블과 비교하여 스튜던트 네트워크(student network)를 학습하는 구조를 의미할 수 있다. 다양한 실시예에 따르면, 클래스 토큰은 실제 레이블로 학습하고 증류 토큰은 티처 네트워크의 예측값을 기반으로 학습될 수 있다. S240 단계에 대한 보다 자세한 설명은 도 4에 대한 설명을 통해 후술한다.According to one embodiment, the teacher network may refer to a structure that makes predictions based on an existing well-trained network and learns a student network by comparing this with an actual label. According to various embodiments, class tokens may be learned from actual labels and distillation tokens may be learned based on predictions of the teacher network. A more detailed description of step S240 will be described later through the description of FIG. 4.

일 실시예에 따르면, S250 단계에서 변환기에 의해 출력되는 손실 함수(loss function)에 기반하여 딥페이크 이미지를 탐지하기 위한 기계학습을 수행할 수 있다. 다양한 실시예에 따르면, 상기 손실 함수를 최소화하도록 기계학습을 수행할 수 있다.According to one embodiment, machine learning to detect deepfake images can be performed based on the loss function output by the converter in step S250. According to various embodiments, machine learning may be performed to minimize the loss function.

일 실시예에 따르면, S250 단계를 통해 학습된 기계학습 모델을 통해 S260 단계에서 딥페이크 이미지를 탐지할 수 있다. According to one embodiment, deepfake images can be detected in step S260 through a machine learning model learned through step S250.

도 3은 본 발명에서 개시하고 있는 전역 풀링 함수 도출 과정을 설명하기 위한 도면이다. 일 실시예에 따르면, 도 3에서 도시하고 있는 도출 과정은 도 7에서 도시하고 있는 전자 장치에 의해 수행될 수 있다.Figure 3 is a diagram for explaining the process of deriving the global pooling function disclosed in the present invention. According to one embodiment, the derivation process shown in FIG. 3 may be performed by the electronic device shown in FIG. 7.

일 실시예에 따르면, 패치 임베딩 함수는 아래의 수식 1과 같이 정의될 수 있다.According to one embodiment, the patch embedding function may be defined as Equation 1 below.

[수식 1][Formula 1]

상기 수식 1에서 Z_p는 패칭 임베딩 함수이며, x_p는 패치이고, E는 학습 가능한 임베딩이며, N은 분할된 패치들의 지수(exponential nuber)이다.In Equation 1, Z _p is the patching embedding function, x _p is the patch, E is the learnable embedding, and N is the exponential nuber of the divided patches.

다양한 실시예에 따르면, CNN 함수는 아래의 수식 2와 같이 정의 될 수 있다.According to various embodiments, the CNN function can be defined as Equation 2 below.

[수식 2][Formula 2]

상기 수식 2에서 Z_f는 CNN 함수이며, x_f는 efficient net을 거친 분할된 이미지이고, M은 CNN 특징들의 갯수이다.In Equation 2 above, Z _f is a CNN function, x _f is a segmented image through an efficient net, and M is the number of CNN features.

일 실시예에 따르면, CNN 함수와 패칭 임베딩 함수는 채널별로 연쇄결합(concatenating)될 수 있으며, CNN 함수와 패칭 임베딩 함수가 연쇄결합된 결과값()이 전역 풀링(global pooling)에 적용될 수 있다. According to one embodiment, the CNN function and the patching embedding function can be concatenated for each channel, and the result of concatenating the CNN function and the patching embedding function ( ) can be applied to global pooling.

도 4는 본 발명에서 개시하고 있는 손실 함수 도출 과정을 설명하기 위한 도면이다. 일 실시예에 따르면, 도 4에서 도시하고 있는 도출 과정은 도 7에서 도시하고 있는 전자 장치에 의해 수행될 수 있다.Figure 4 is a diagram for explaining the loss function derivation process disclosed in the present invention. According to one embodiment, the derivation process shown in FIG. 4 may be performed by the electronic device shown in FIG. 7.

일 실시예에 따르면, 변환기에 클래스 토큰, 증류 토큰과 전역 풀링을 통한 결과값()이 입력될 수 있다. 다양한 실시예에 따르면 =전역 풀링()으로 변환기에 입력되는 (N+M, N)의 입력 벡터일 수 있다.According to one embodiment, the class token, the distillation token, and the result through global pooling ( ) can be entered. According to various embodiments =Global pooling( ) may be an input vector of (N+M, N) input to the converter.

일 실시예에 따르면, 변환기에 입력되는 최종 입력값은 아래의 수식 3과 같이 정의될 수 있다.According to one embodiment, the final input value input to the converter may be defined as Equation 3 below.

[수식 3][Formula 3]

상기 수식 3에서 Z₀는 최종 입력값이며, x_class는 클래스 토큰이고, 는 전역 풀링을 통한 결과값이며, x_distilaation은 증류 토큰이고, E_pos는 학습 가능한 위치 임베딩값이다.In Equation 3 above, Z ₀ is the final input value, x _class is a class token, is the result of global pooling, x _distilaation is the distillation token, and E _pos is the learnable position embedding value.

다양한 실시예에 따르면, 최종 손실 함수는 실제 이미지에 대한 손실 함수와 페이크 이미지에 대한 손실 함수에 기반하여 결정될 수 있으며, 보다 구체적으로 아래의 수식 4를 통해 결정될 수 있다.According to various embodiments, the final loss function may be determined based on the loss function for the real image and the loss function for the fake image, and more specifically, may be determined through Equation 4 below.

[수식 4][Formula 4]

상기 수식 4에서 은 최종 손실 함수이며, 는 페이크 이미지에 대한 손실 함수이고, 은 실제 이미지에 대한 손실 함수이다.In Equation 4 above, is the final loss function, is the loss function for the fake image, is the loss function for the actual image.

일 실시예에 따르면, 페이크 이미지에 대한 손실 함수는 아래의 수식 5와 같이 정의될 수 있다.According to one embodiment, the loss function for the fake image can be defined as Equation 5 below.

[수식 5][Formula 5]

상기 수식 5에서 는 페이크 이미지에 대한 손실 함수이며, λ는 사용자가 설정 가능한 변수이고, 는 이진 교차 엔트로피 손실(binary cross entropy loss)이고, 는 페이크 예측을 위한 티처 모델의 로짓(logit)이며, 는 페이크 예측을 위한 증류 토큰의 로짓이며, 는 페이크 예측을 위한 클래스 토큰의 로짓이고, y와 σ는 시그마 함수이다.In Equation 5 above, is the loss function for the fake image, λ is a variable that can be set by the user, is the binary cross entropy loss, is the logit of the teacher model for fake prediction, is the logit of the distillation token for fake prediction, is the logit of the class token for fake prediction, and y and σ are sigma functions.

다양한 실시예에 따르면, 실제 이미지에 대한 손실 함수는 아래의 수식 6과 같이 정의될 수 있다.According to various embodiments, the loss function for the actual image may be defined as Equation 6 below.

[수식 6][Formula 6]

상기 수식 6에서 는 실제 이미지에 대한 손실 함수이며, λ는 사용자가 설정 가능한 변수이고, 는 이진 교차 엔트로피 손실(binary cross entropy loss)이고, 는 사실 예측을 위한 티처 모델의 로짓(logit)이며, 는 사실 예측을 위한 증류 토큰의 로짓이며, 는 사실 예측을 위한 클래스 토큰의 로짓이고, y와 σ는 시그마 함수이다.In Equation 6 above, is the loss function for the actual image, λ is a variable that can be set by the user, is the binary cross entropy loss, is the logit of the teacher model for fact prediction, is the logit of the distillation token for fact prediction, is actually the logit of the class token for prediction, and y and σ are sigma functions.

도 5는 본 발명에서 개시하고 있는 일 실시예에 따라 증류 토큰을 활용한 경우와 종래 기술에 따른 경우의 손실 함수를 비교한 그래프이다.Figure 5 is a graph comparing the loss function when using a distillation token according to an embodiment disclosed in the present invention and when according to the prior art.

도 5에서 도시하고 있는 그래프에 따르면, 페이크 손실에 대한 가중치가 커질수록 손실의 편차가 커지는 것을 확인할 수 있으며, 증류 토큰을 적용한 경우가 증류 토큰을 적용하지 않은 경우보다 상대적으로 손실이 적은 것을 확인되는바, 증류 토큰을 적용한 경우 딥페이크 탐지 학습모델의 성능이 향상되는 것을 확인할 수 있다.According to the graph shown in Figure 5, it can be seen that as the weight for fake loss increases, the deviation in loss increases, and it can be seen that the loss when distillation tokens are applied is relatively less than when distillation tokens are not applied. It can be seen that the performance of the deepfake detection learning model is improved when distillation tokens are applied.

도 6은 본 발명에서 개시하고 있는 일 실시예에 따라 딥페이크 감지 알고리즘을 활용한 경우와 종래 기술에 따른 경우의 페이크 이미지에 대한 손실함수와 실제 이미지에 대한 손실 함수를 비교한 그래프이다.Figure 6 is a graph comparing the loss function for a fake image and the loss function for a real image when using the deepfake detection algorithm according to an embodiment disclosed in the present invention and when using the prior art.

도 6에서 도시하고 있는 그래프에 따르면, 본 발명에서 제안하고 있는 학습 모델을 활용한 경우의 페이크 이미지에 대한 손실 함수와 실제 이미지에 대한 손실 함수가 종래의 SOTA 모델을 활용한 경우의 페이크 이미지에 대한 손실 함수와 실제 이미지에 대한 손실 함수에 비해 작은 것을 확인할 수 있는바, 본 발명에서 제안하는 학습 모델에 따를 경우 종래 기술에 비해 딥페이크 탐지의 정확도가 향상되는 것을 확인할 수 있다.According to the graph shown in FIG. 6, the loss function for the fake image when using the learning model proposed in the present invention and the loss function for the real image are compared to the loss function for the fake image when using the conventional SOTA model. It can be seen that the loss function is smaller than the loss function for the actual image, and it can be seen that the accuracy of deepfake detection is improved compared to the prior art when the learning model proposed in the present invention is followed.

도 7은 본 발명에서 개시하고 있는 일 실시예에 따른 전자 장치의 블록도이다. Figure 7 is a block diagram of an electronic device according to an embodiment disclosed in the present invention.

일 실시예에 따르면, 전자 장치(700)는 MTCNN 모델을 이용해 영상에서 얼굴 이미지를 추출하고, 얼굴 이미지에서 이미지를 분할하기 위한 특정 영역을 설정하는 전처리부(710), 분할된 얼굴 이미지에 기반한 패치 임베딩 함수와 합성곱 신경망 네트워크에 기반한 CNN 함수를 결합하는 연산부(720), 실제 레이블로 학습한 클래스 토큰, 티처 네트워크 추론 결과와 상기 실제 레이블 비교를 통한 학습에 기반한 증류 토큰 및 패칭 임베딩 함수와 CNN 함수가 결합된 전역 풀링(global pooling) 함수를 입력받는 변환기(730); 및 변환기(730)에 의해 출력되는 손실 함수에 기반하여 딥페이크 이미지를 탐지하기 위한 기계학습을 수행하는 기계학습 수행부(740)를 포함할 수 있다.According to one embodiment, the electronic device 700 extracts a face image from an image using the MTCNN model, a preprocessor 710 that sets a specific area for segmenting the image from the face image, and a patch based on the segmented face image. An operation unit 720 that combines an embedding function and a CNN function based on a convolutional neural network, a class token learned from an actual label, a distillation token and a patching embedding function based on learning through comparison of the teacher network inference result and the actual label, and a CNN function. A converter 730 that receives a global pooling function combined with; and a machine learning performing unit 740 that performs machine learning to detect deepfake images based on the loss function output by the converter 730.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 사람이라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 실행된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an illustrative explanation of the technical idea of the present invention, and various modifications and variations will be possible to those skilled in the art without departing from the essential characteristics of the present invention. Accordingly, the embodiments implemented in the present invention are not intended to limit the technical idea of the present invention, but rather to explain it, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted in accordance with the claims below, and all technical ideas within the equivalent scope should be construed as being included in the scope of rights of the present invention.

Claims

Combining a patch embedding function based on a segmented face image and a CNN function based on a convolutional neural network;
A class token learned with a true label, a distillation token based on learning through comparison of the teacher network inference result with the actual label, the patching embedding function, and the CNN function. Inputting the combined global pooling function into a transformer; and
Including performing machine learning to detect deepfake images based on the loss function output by the converter,
How to detect deepfake images.

According to paragraph 1,
Extracting a face image from an image using a multitask cascaded convolutional network (MTCNN) model; and
Including setting a specific area for segmenting the image in the face image,
How to detect deepfake images.

According to paragraph 1,
The patch embedding function is determined based on the patch, which is a segmented image, and the exponential number of the segmented patches, and the CNN function is determined based on the number of CNN features.
How to detect deepfake images.

According to paragraph 1,
Characterized in that the global pooling function is determined by concatenating the embedding function and the CNN function for each channel,
How to detect deepfake images.

According to paragraph 1,
The loss function is determined based on the loss function for the fake image and the loss function for the real image,
The step of performing machine learning is characterized in that machine learning is performed by setting normalization weights with the loss function.
How to detect deepfake images.

According to clause 5,
The loss function for the fake image is determined based on the odds ratio of the teacher network, the odds ratio of the class token, and the odds ratio of the distillation token for the fake image, and the loss function for the real image is determined based on the odds ratio of the teacher network for the fake image. Characterized in that it is determined based on the odds ratio of the teacher network, the odds ratio of the class token, and the odds ratio of the distillation token.
How to detect deepfake images.

An operation unit that combines a patch embedding function based on a segmented face image and a CNN function based on a convolutional neural network;
A class token learned with a true label, a distillation token based on learning through comparison of the teacher network inference result with the actual label, the patching embedding function, and the CNN function. A transformer that receives the combined global pooling function as input; and
Comprising a machine learning performance unit that performs machine learning to detect deepfake images based on the loss function output by the converter,
Electronic devices.

In clause 7,
Further comprising a preprocessor that extracts a face image from the image using a multitask cascaded convolutional network (MTCNN) model and sets a specific region for segmenting the image in the face image,
Electronic devices.

In clause 7,
The patch embedding function is determined based on the patch, which is a segmented image, and the exponential number of the segmented patches, and the CNN function is determined based on the number of CNN features.
Electronic devices.

In clause 7,
Characterized in that the global pooling function is determined by concatenating the embedding function and the CNN function for each channel,
Electronic devices.

In clause 7,
The loss function is determined based on a loss function for the fake image and a loss function for the real image, and the execution unit performs machine learning by setting a normalization weight with the loss function.
Electronic devices.

According to clause 11,
The loss function for the fake image is determined based on the odds ratio of the teacher network, the odds ratio of the class token, and the odds ratio of the distillation token for the fake image, and the loss function for the real image is determined based on the odds ratio of the teacher network for the fake image. Characterized in that it is determined based on the odds ratio of the teacher network, the odds ratio of the class token, and the odds ratio of the distillation token.
Electronic devices.