KR20230100200A

KR20230100200A - Method and device for training machine learning model for person search using adaptive gradient propagation

Info

Publication number: KR20230100200A
Application number: KR1020210189891A
Authority: KR
Inventors: 심재영; 한병주; 고규현
Original assignee: 울산과학기술원
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-07-05

Abstract

A device for training a machine learning model according to one embodiment may include a processor that obtains a region of interest (ROI) and feature data for the region of interest by applying a backbone network of the machine learning model to an image, obtains a target feature vector from the region of interest, and updates parameters of the machine learning model.

Description

Method and apparatus for training machine learning model for person search using adaptive gradient propagation

이하, 사람 검색을 위한 기계 학습 모델의 트레이닝 방법 및 장치에 대한 기술이 개시된다.Hereinafter, a description of a method and apparatus for training a machine learning model for human search is disclosed.

본 발명과 관련된 기술들은 2가지로 정리될 수 있다. 첫 번째는 사람 검출(person detection) 및 사람 식별(person re-identification)을 구분하여 학습하는 투-스텝(two-step) 기법들이고, 두 번째는 사람 검출 및 사람 식별을 통합하여 학습하는 end-to-end 기법들이다.Technologies related to the present invention can be organized into two types. The first are two-step techniques that learn by distinguishing between person detection and person re-identification, and the second are end-to-end techniques that learn by integrating person detection and person identification. -end techniques.

투-스텝 기법에서, 먼저 영상에서 사람의 위치를 찾기 위하여 사람 검출 기술이 이용될 수 있다. 그 후에, 찾은 사람들은 사람이 누구인지 식별하기 위하여, 사람 식별 기술이 이용될 수 있다. 사람 검색 기술 분야에서 사용되는 벤치마크 PRW 데이터 셋이 제공될 수 있고, 사람 검출 기술 및 사람 식별 기술을 개별적으로 학습시키며 사람 검색(person search)의 결과를 분석하기 위하여 다양한 실험이 진행될 수 있다. 사람(person) 및 배경(background)을 분리하여 학습을 진행하기 위하여, 사람에 대응하는 영역을 따로 분할(segmentation) 작업을 함으로써 인공지능 모델은 사람에게 더 집중하여 트레이닝될 수 있는 기법이 이용될 수 있다. 다양한 스케일들로 추출된 특징들이 사람 검색에 이용될 수 있고, 사람 검출기 및 사람 식별기 간의 태스크(task) 불일치를 해결하기 위해 혼합된 학습 데이터 세트가 구축될 수 있다. In the two-step technique, a person detection technique may be used to first locate a person in an image. Thereafter, people identification technology may be used to identify the people who are found. A benchmark PRW data set used in the field of human search technology may be provided, and various experiments may be conducted to analyze the results of a person search while separately learning a person detection technology and a person identification technology. In order to proceed with learning by separating the person and the background, a technique in which the artificial intelligence model can be trained more intensively on the person can be used by separately segmenting the area corresponding to the person. there is. Features extracted at various scales can be used for person search, and a blended training data set can be built to resolve task discrepancies between a person detector and a person discriminator.

엔드-투-엔드 기법들은 사람 검출 기술 및 사람 식별 기술을 통합하여 사람 검색을 수행하고, 엔드-투-엔드 기법에 따른 사람 검색 모델은 트레이닝에서도 사람 검출 및 사람 식별이 통합되어 트레이닝될 수 있다. 사람 검출 기술 및 사람 식별 기술이 통합된 딥러닝 기반 네트워크를 제안될 수 있고, 사람 검색 기술 분야에서 사용하는 벤치마크 CUHK-SYSU 데이터 세트가 제공될 수 있다. 배경은 사람 검색 모델을 학습시키는 과정에서 부정적인 영향을 끼칠 수 있고, 인공지능 모델이 트레이닝될 때 배경보다 사람의 특징을 추출할 수 있는 기법이 제안될 수 있다. 객체 분할(instance segmenatation) 기법 및 사람의 관절을 예측하는 기법(keypoint detection) 기법이 추가적으로 도입됨으로써, 사람 검색을 위한 유용한 특징이 추출될 수 있다. 사람들의 추출된 특징들을 크기(magnitude)와 각도(angle)로 나누는 기법을 통해 사람 검출 기술 및 사람 식별 기술의 task 불일치가 완화될 수 있다. 사람 검색에서 사람의 일부분만 보이는, 가려진 사람(occluded person)에 대해 대응하기 위한 기법이 제안될 수 있고, 사람 검색 분야에서 데이터 세트 안에 가려진 사람들이 많이 등장하는 또 하나의 벤치마크 데이터 세트가 제공될 수 있다. End-to-end techniques perform human search by integrating human detection technology and human identification technology, and a human search model according to the end-to-end technique can be trained by integrating human detection and human identification in training. A deep learning-based network in which human detection technology and human identification technology are integrated can be proposed, and a benchmark CUHK-SYSU data set used in the field of human search technology can be provided. The background can have a negative effect in the process of training a human search model, and a technique that can extract human features rather than the background when an artificial intelligence model is trained can be proposed. By additionally introducing an instance segmentation technique and a keypoint detection technique, useful features for human search can be extracted. Task inconsistency between human detection technology and human identification technology can be alleviated through a technique of dividing extracted features of people into magnitude and angle. A technique for responding to occluded persons, in which only part of the person is visible in person search, can be proposed, and another benchmark data set in which many occluded persons appear in the data set in the field of person search will be provided. can

엔드 투 엔드(End-to-end) 방식의 기계 학습 모델은, 사람 검출 및 사람 식별이 통합되어 함께 트레이닝될 수 있는 장점을 가질 수 있다. 다만, 사람의 공통된 특징을 찾는 사람 검출 기술과 사람들 간의 차별화된 특징을 찾는 사람 식별 기술 간의 충돌이 발생할 수 있다. 반대로, 투 스텝(two-step) 방식의 기계 학습 모델은 사람 검출 및 사람 식별에 대하여 별도로 설계되어 개별적으로 트레이닝될 수 있으므로 사람 검출 및 사람 식별 간의 충돌이 일어나지 않는 장점을 가질 수 있다. 다만, 두 개의 기술을 따로 학습을 해야 한다는 단점이 있다.An end-to-end machine learning model may have an advantage in that human detection and human identification can be integrated and trained together. However, a collision may occur between a person detection technology for finding common characteristics of people and a person identification technology for finding differentiated characteristics among people. Conversely, since a two-step machine learning model is designed separately for human detection and human identification and can be individually trained, it may have an advantage in that conflict between human detection and human identification does not occur. However, the downside is that you have to learn the two skills separately.

본 개시에서, 사람 검색 기술에 관한 새로운 엔드 투 엔드 기반의 학습 네트워크가 제안될 수 있다. 본 발명에 따른 기계 학습 모델은, 사람 검출 및 사람 식별과 함께 사람의 부분 분류기(part classifier)를 통해 다수의 태스크들을 통합적으로 구성함으로써 사람마다 중요한 대표 특징을 추출하며, 사람 검출 및 사람 식별 태스크 간의 충돌을 방지할 수 있다. In this disclosure, a new end-to-end based learning network for human search technology can be proposed. The machine learning model according to the present invention extracts important representative features for each person by integrally configuring a plurality of tasks through a part classifier of a person along with human detection and human identification, and between the human detection and human identification tasks. collision can be prevented.

실험적으로, 제안된 기법은 기존의 엔드 투 엔드 기법들보다 높은 성능을 기록할 수 있다.Experimentally, the proposed technique can record higher performance than existing end-to-end techniques.

컴퓨터의 GPU 성능 향상과 더불어 최근 딥러닝 기반 인공지능 기술이 다양한 영역에 활용될 수 있다. 구체적으로, 기계의 시각 지능을 연구하는 컴퓨터 비전 분야에서, 최근 들어 사람 식별(person re-identification) 기술은 개인 보안, 엔터테인먼트 서비스 등 다양한 분야에서 활용될 수 있어 중요한 문제로 간주될 수 있다. 하지만, 종래의 사람 식별 기술들은 정제된 환경에서 단일 사람만 촬영이 된 데이터 또는 제한된 촬영 환경을 요구할 수 있다. 그리하여 최근에는 이러한 문제를 해결하기 위하여, 데이터 및 촬영 환경에 대한 제약이 없고 한 영상에 다수의 사람이 등장하는 일반 영상에서 원하는 사람을 찾아내는 사람 검색(person search) 기술이 주목을 받을 수 있다. 사람 검색 기술은 사람 식별에서 다루는 영상과 다르게 CCTV 영상과 같이 다수의 사람들이 다양한 자세들과 다양한 각도들로 등장하기 때문에 종래의 사람 식별 문제보다 도전적인 문제로 간주될 수 있다. 사람 검색 기술은 다양한 감시 카메라들의 출현으로 매일 대규모의 수많은 보행자 이미지들이 제공됨에 따라 감시 및 추적에 활용도가 높고, 스마트폰에 있는 증강 현실 기술과도 결합할 수 있는 시각 엔터테인먼트 서비스 기술을 위한 핵심 기술일 수 있다.In addition to improving computer GPU performance, recent deep learning-based artificial intelligence technology can be used in various fields. Specifically, in the field of computer vision that studies visual intelligence of machines, recently, person re-identification technology can be used in various fields such as personal security and entertainment services, and can be considered as an important problem. However, conventional human identification techniques may require data in which only a single person is photographed in a refined environment or a limited photographing environment. Therefore, recently, in order to solve this problem, a person search technology for finding a desired person in a general video in which a large number of people appear in one video without restrictions on data and photographing environment can receive attention. Unlike the images dealt with in human identification, the human search technology can be regarded as a more challenging problem than the conventional human identification problem because a large number of people appear in various postures and angles like CCTV images. People search technology is a key technology for visual entertainment service technology that can be combined with augmented reality technology in smartphones and is highly useful for monitoring and tracking as numerous large-scale images of pedestrians are provided every day with the advent of various surveillance cameras. can

일 실시예에 따른 기계 학습 모델을 트레이닝시키는 방법은, 이미지에 기계 학습 모델의 백본 네트워크(backbone network)를 적용함으로써 관심 영역(region of interest; ROI) 및 상기 관심 영역에 대한 특징 데이터(feature data)를 획득하는 단계, 상기 획득된 특징 데이터로부터, 상기 관심 영역의 복수의 부분 영역들의 부분 특징들을 나타내는 특징 행렬을 획득하는 단계, 상기 획득된 특징 행렬에 상기 기계 학습 모델의 사람 검출 네트워크(person detection network)를 적용함으로써 상기 관심 영역이 사람에 대응하는 영역일 가능성을 나타내는 가능성 점수(possibility score)를 획득하는 단계, 상기 획득된 특징 행렬에 상기 기계 학습 모델의 부분 분류 네트워크(part classification network)를 적용함으로써, 상기 복수의 부분 영역들을 지시하는 파티션 클래스들(partition classes)에 각 부분 영역이 속할 확률을 포함하는 확률 행렬(probability matrix)을 획득하는 단계, 상기 획득된 특징 행렬에 상기 기계 학습 모델의 사람 식별 네트워크(person re-identification network)를 적용함으로써 상기 관심 영역으로부터 타겟 특징 벡터(target feature vector)를 획득하는 단계, 및 상기 가능성 점수, 상기 확률 행렬, 및 상기 타겟 특징 벡터에 기초하여 계산된 목적함수 값을 이용하여 상기 기계 학습 모델의 파라미터를 업데이트하는 단계를 포함할 수 있다.A method for training a machine learning model according to an embodiment includes a region of interest (ROI) and feature data for the region of interest by applying a backbone network of the machine learning model to an image. Obtaining a feature matrix representing partial features of a plurality of subregions of the region of interest from the obtained feature data, adding a person detection network of the machine learning model to the obtained feature matrix. ) to obtain a possibility score indicating the possibility that the region of interest is a region corresponding to a human, by applying a part classification network of the machine learning model to the obtained feature matrix. , Obtaining a probability matrix including a probability that each subregion belongs to partition classes indicating the plurality of subregions, and identifying a person of the machine learning model in the obtained feature matrix. obtaining a target feature vector from the region of interest by applying a person re-identification network, and an objective function value calculated based on the likelihood score, the probability matrix, and the target feature vector It may include updating parameters of the machine learning model using

상기 기계 학습 모델의 파라미터를 업데이트하는 단계는, 상기 가능성 점수에 기초하여 상기 관심 영역의 상기 사람 검출 네트워크에 대한 검출 손실 값(detection loss value)을 계산하는 단계, 상기 확률 행렬에 기초하여 상기 부분 분류 네트워크에 관한 부분 분류 손실 값(part classification loss value)을 계산하는 단계, 및 상기 타겟 특징 벡터에 기초하여 상기 사람 식별 네트워크에 관한 식별 손실 값(re-identification loss value)을 계산하는 단계를 포함할 수 있다.Updating parameters of the machine learning model may include calculating a detection loss value for the person detection network of the region of interest based on the likelihood score, and classifying the portion based on the probability matrix. calculating a part classification loss value for the network, and calculating a re-identification loss value for the person identification network based on the target feature vector. there is.

상기 부분 분류 손실 값을 계산하는 단계는 상기 확률 행렬과 함께 상기 가능성 점수에 더 기초하여 상기 부분 분류 손실 값을 계산하는 단계를 포함할 수 있다.Calculating the partial classification loss value may include calculating the partial classification loss value further based on the probability score together with the probability matrix.

상기 가능성 점수에 더 기초하여 상기 부분 분류 손실 값을 계산하는 단계는, 상기 확률 행렬의 확률에 상기 가능성 점수를 곱한 값을 부분 영역이 파티션 클래스에 속할 가능성을 나타내는 보정된 확률로서 이용하여 상기 부분 분류 손실 값을 계산하는 단계를 포함할 수 있다.The calculating of the partial classification loss value further based on the probability score may include the partial classification by using a value obtained by multiplying a probability of the probability matrix by the probability score as a corrected probability representing a probability that a partial region belongs to a partition class. Calculating a loss value may be included.

상기 식별 손실 값을 계산하는 단계는, 상기 타겟 특징 벡터와 함께 상기 가능성 점수에 더 기초하여 상기 식별 손실 값을 계산하는 단계를 포함할 수 있다.Calculating the identification loss value may include calculating the identification loss value further based on the likelihood score together with the target feature vector.

상기 가능성 점수에 더 기초하여 상기 식별 손실 값을 계산하는 단계는, 상기 타겟 특징 벡터와 다른 특징 벡터의 유사도에 상기 가능성 점수를 곱한 값을 상기 타겟 특징 벡터 및 상기 다른 특징 벡터 간의 보정된 유사도로서 이용하여 상기 식별 손실 값을 계산하는 단계를 포함할 수 있다.Calculating the identification loss value further based on the likelihood score may include using a value obtained by multiplying a similarity between the target feature vector and another feature vector by the likelihood score as a corrected similarity between the target feature vector and the other feature vector. and calculating the identification loss value by doing so.

상기 식별 손실 값을 계산하는 단계는, 상기 관심 영역에 대한 사람의 신원(identity)을 지시하는 참값 신원 클래스(identity class)가 존재하는 경우에 응답하여, 상기 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도 및 상기 타겟 특징 벡터 및 상기 참값 신원 클래스의 대표 특징 벡터(representative feature vector) 간의 유사도의 차이에 기초하여 상기 식별 손실 값을 계산하는 단계를 포함할 수 있다.The calculating of the identification loss value may include a similarity between the target feature vector and another feature vector in response to a case where there is a true value identity class indicating a person's identity with respect to the region of interest; and and calculating the identification loss value based on a difference in similarity between the target feature vector and a representative feature vector of the true value identity class.

상기 식별 손실 값을 계산하는 단계는, 상기 관심 영역이 상기 참값 신원 클래스로 분류될지 여부에 대한 바이너리 분류 손실 값(binary classification loss value)을 계산하는 단계를 더 포함할 수 있다.The calculating of the identification loss value may further include calculating a binary classification loss value for whether the ROI is classified into the true identity class.

상기 식별 손실 값을 계산하는 단계는, 상기 관심 영역에 대한 참값 신원 클래스가 존재하지 않는 경우에 응답하여, 상기 타겟 특징 벡터 및 각 신원 클래스의 대표 특징 벡터 간의 유사도에 기초하여 상기 식별 손실 값을 계산하는 단계를 포함할 수 있다.The calculating of the identification loss value may include calculating the identification loss value based on a similarity between the target feature vector and a representative feature vector of each identity class in response to a case where there is no true identity class for the region of interest. steps may be included.

상기 타겟 특징 벡터 및 각 신원 클래스의 대표 특징 벡터 간의 유사도에 기초하여 상기 식별 손실 값을 계산하는 단계는, 상기 이미지의 다른 관심 영역에 대한 신원 클래스가 존재하는 경우에 응답하여, 상기 다른 관심 영역의 타겟 특징 벡터 및 상기 다른 관심 영역의 신원 클래스의 대표 특징 벡터 간의 유사도를 계산하는 단계 및 상기 관심 영역의 상기 타겟 특징 벡터 및 각 신원 클래스의 대표 특징 벡터 간의 유사도와 상기 다른 관심 영역의 타겟 특징 벡터 및 상기 다른 관심 영역의 신원 클래스의 대표 특징 벡터 간의 유사도의 차이에 기초하여, 상기 식별 손실 값을 계산하는 단계를 포함할 수 있다.The calculating of the identification loss value based on the similarity between the target feature vector and the representative feature vector of each identity class may include, in response to an identity class for another ROI of the image, the other ROI of the image. Calculating a degree of similarity between a target feature vector and a representative feature vector of an identity class of another region of interest, and a similarity between the target feature vector of the region of interest and a representative feature vector of each identity class, and the target feature vector of the other region of interest and and calculating the identification loss value based on a difference in similarity between representative feature vectors of identity classes of different regions of interest.

계산된 목적함수 값을 이용하여 상기 기계 학습 모델의 파라미터를 업데이트하는 단계는, 상기 이미지로부터 복수의 관심 영역들이 획득된 경우에 응답하여, 복수의 관심 영역들에 대하여 계산된 목적함수 값들에 기초하여 상기 이미지에 대한 목적함수 값을 계산하는 단계 및 상기 이미지에 대하여 계산된 목적함수 값을 이용하여 상기 기계 학습 모델의 파라미터를 업데이트하는 단계를 포함할 수 있다.Updating the parameter of the machine learning model using the calculated objective function value may include, in response to a case in which a plurality of areas of interest are obtained from the image, based on the objective function values calculated for the plurality of areas of interest. The method may include calculating an objective function value for the image and updating parameters of the machine learning model using the objective function value calculated for the image.

일 실시예에 따른 기계 학습 모델을 트레이닝시키는 장치는, 이미지에 기계 학습 모델의 백본 네트워크(backbone network)를 적용함으로써 관심 영역(region of interest; ROI) 및 상기 관심 영역에 대한 특징 데이터(feature data)를 획득하고, 상기 획득된 특징 데이터로부터, 상기 관심 영역의 복수의 부분 영역들의 부분 특징들을 나타내는 특징 행렬을 획득하며, 상기 획득된 특징 행렬에 상기 기계 학습 모델의 사람 검출 네트워크(person detection network)를 적용함으로써 상기 관심 영역이 사람에 대응하는 영역일 가능성을 나타내는 가능성 점수(possibility score)를 획득하고, 상기 획득된 특징 행렬에 상기 기계 학습 모델의 부분 분류 네트워크(part classification network)를 적용함으로써, 상기 복수의 부분 영역들을 지시하는 파티션 클래스들(partition classes)에 각 부분 영역이 속할 확률을 포함하는 확률 행렬(probability matrix)을 획득하며, 상기 획득된 특징 행렬에 상기 기계 학습 모델의 사람 식별 네트워크(person re-identification network)를 적용함으로써 상기 관심 영역으로부터 타겟 특징 벡터(target feature vector)를 획득하고, 상기 가능성 점수, 상기 확률 행렬, 및 상기 타겟 특징 벡터에 기초하여 계산된 목적함수 값을 이용하여 상기 기계 학습 모델의 파라미터를 업데이트하는 프로세서를 포함할 수 있다.An apparatus for training a machine learning model according to an embodiment includes a region of interest (ROI) and feature data for the region of interest by applying a backbone network of the machine learning model to an image. Obtaining a feature matrix representing partial features of a plurality of subregions of the region of interest from the obtained feature data, and applying a person detection network of the machine learning model to the obtained feature matrix. By applying a possibility score indicating the possibility that the region of interest is a region corresponding to a person, and applying a part classification network of the machine learning model to the obtained feature matrix, the plurality of A probability matrix including a probability that each subregion belongs to partition classes indicating subregions of is obtained, and the person identification network of the machine learning model is applied to the obtained feature matrix. -identification network) to obtain a target feature vector from the region of interest, and the machine learning using the probability score, the probability matrix, and an objective function value calculated based on the target feature vector It may include a processor that updates parameters of the model.

상기 프로세서는, 상기 가능성 점수 및 참값(ground truth) 가능성 점수의 비교에 기초하여 상기 관심 영역의 상기 사람 검출 네트워크에 대한 검출 손실 값(detection loss value)을 계산하고, 상기 확률 행렬에 기초하여 상기 부분 분류 네트워크에 관한 부분 분류 손실 값(part classification loss value)을 계산하며, 상기 타겟 특징 벡터에 기초하여 상기 사람 식별 네트워크에 관한 식별 손실 값(re-identification loss value)을 계산할 수 있다.The processor calculates a detection loss value for the person detection network in the region of interest based on the comparison of the likelihood score and a ground truth likelihood score, and based on the probability matrix, the portion A part classification loss value for the classification network may be calculated, and a re-identification loss value for the person identification network may be calculated based on the target feature vector.

상기 프로세서는, 상기 확률 행렬과 함께 상기 가능성 점수에 더 기초하여 상기 부분 분류 손실 값을 계산할 수 있다.The processor may calculate the partial classification loss value further based on the probability score together with the probability matrix.

상기 프로세서는, 상기 확률 행렬의 확률에 상기 가능성 점수를 곱한 값을 부분 영역이 파티션 클래스에 속할 가능성을 나타내는 보정된 확률로서 이용하여 상기 부분 분류 손실 값을 계산할 수 있다.The processor may calculate the partial classification loss value by using a value obtained by multiplying a probability of the probability matrix by the probability score as a corrected probability representing a probability that a partial region belongs to a partition class.

상기 프로세서는, 상기 타겟 특징 벡터와 함께 상기 가능성 점수에 더 기초하여 상기 식별 손실 값을 계산할 수 있다.The processor may calculate the identification loss value further based on the likelihood score together with the target feature vector.

상기 프로세서는, 상기 타겟 특징 벡터와 다른 특징 벡터의 유사도에 상기 가능성 점수를 곱한 값을 상기 타겟 특징 벡터 및 상기 다른 특징 벡터 간의 보정된 유사도로서 이용하여 상기 식별 손실 값을 계산할 수 있다.The processor may calculate the identification loss value by using a value obtained by multiplying a similarity between the target feature vector and another feature vector by the probability score as a corrected similarity between the target feature vector and the other feature vector.

상기 프로세서는, 상기 관심 영역에 대한 사람의 신원(identity)을 지시하는 참값 신원 클래스(identity class)가 존재하는 경우에 응답하여, 상기 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도 및 상기 타겟 특징 벡터 및 상기 참값 신원 클래스의 대표 특징 벡터(representative feature vector) 간의 유사도의 차이에 기초하여 상기 식별 손실 값을 계산할 수 있다.The processor determines, in response to a case where a true identity class indicating the identity of a person with respect to the region of interest exists, a degree of similarity between the target feature vector and other feature vectors and the target feature vector and the target feature vector. The identification loss value may be calculated based on a difference in similarity between representative feature vectors of true identity classes.

상기 프로세서는, 상기 관심 영역이 상기 참값 신원 클래스로 분류될지 여부에 대한 바이너리 분류 손실 값(binary classification loss value)을 계산할 수 있다.The processor may calculate a binary classification loss value for whether the ROI is classified into the true identity class.

상기 프로세서는, 상기 관심 영역에 대한 참값 신원 클래스가 존재하지 않는 경우에 응답하여, 상기 타겟 특징 벡터 및 각 신원 클래스의 대표 특징 벡터 간의 유사도에 기초하여 상기 식별 손실 값을 계산할 수 있다.The processor may calculate the identification loss value based on a similarity between the target feature vector and a representative feature vector of each identity class in response to a case where there is no true identity class for the ROI.

상기 프로세서는, 상기 이미지의 다른 관심 영역에 대한 신원 클래스가 존재하는 경우에 응답하여 상기 다른 관심 영역의 타겟 특징 벡터 및 상기 다른 관심 영역의 신원 클래스의 대표 특징 벡터 간의 유사도를 계산하고, 상기 관심 영역의 상기 타겟 특징 벡터 및 각 신원 클래스의 대표 특징 벡터 간의 유사도와 상기 다른 관심 영역의 타겟 특징 벡터 및 상기 다른 관심 영역의 신원 클래스의 대표 특징 벡터 간의 유사도의 차이에 기초하여 상기 식별 손실 값을 계산할 수 있다.The processor calculates a similarity between a target feature vector of another ROI and a representative feature vector of an identity class of another ROI in response to an identity class of another ROI of the image, and the ROI The identification loss value may be calculated based on a difference between the similarity between the target feature vector and the representative feature vector of each identity class and the similarity between the target feature vector of the other region of interest and the representative feature vector of the identity class of the other region of interest. there is.

상기 프로세서는, 상기 이미지로부터 복수의 관심 영역들이 획득된 경우에 응답하여, 복수의 관심 영역들에 대하여 계산된 목적함수 값들에 기초하여 상기 이미지에 대한 목적함수 값을 계산하고, 상기 이미지에 대하여 계산된 목적함수 값을 이용하여 상기 기계 학습 모델의 파라미터를 업데이트할 수 있다.The processor calculates an objective function value for the image based on objective function values calculated for the plurality of regions of interest in response to a plurality of regions of interest obtained from the image, and calculates the value of the objective function for the image. Parameters of the machine learning model may be updated using the objective function value.

본 발명은 투-스텝으로 사람 검색 모델을 트레이닝시킴으로써 발생될 수 있는 불필요한 시간 소모 및 비효율성을 고려하여 엔드 투 엔드(end-to-end) 기법으로 설계될 수 있다. 일반적으로 투 스텝 기법은 엔드 투 엔드 기법들보다 높은 성능을 도출하는 경향이 있으나, 본 발명에 따른 사람 검색 모델은 엔드 투 엔드 기법임에도 불구하고 투 스텝 기법들보다 높은 성능을 기록할 수 있다.The present invention can be designed as an end-to-end technique in consideration of unnecessary time consumption and inefficiency that may occur by training a human search model in two-step. In general, the two-step technique tends to yield higher performance than the end-to-end techniques, but the human search model according to the present invention can record higher performance than the two-step techniques despite being an end-to-end technique.

도 1은 일 실시예에 따른 기계 학습 모델을 이용한 사람 검색 방법을 나타낸다.
도 2는 비교 실시예 및 일 실시예에 따른 사람 검색을 위한 기계 학습 모델의 트레이닝을 나타낸다.
도 3는 일 실시예에 따른 사람 검색을 위한 기계 학습 모델의 트레이닝을 나타낸다.
도 4는 일 실시예에 따른 관심 영역이 분할된 부분 영역 및 부분 분류를 나타낸다.
도 5는 사람 검출의 신뢰도에 따른 기계 학습 모델의 목적함수 계산을 나타낸다.
도 6는 일 실시예에 따른 사람 검색을 위한 기계 학습 모델의 벤치마크(benchmark) 데이터 세트들을 나타낸다.
도 7은 비교 실시예 및 일 실시예에 따른 기계 학습 모델의 성능을 정성적으로 비교하는 것을 나타낸다. 1 shows a person search method using a machine learning model according to an embodiment.
2 shows training of a machine learning model for human search according to a comparative embodiment and an embodiment.
3 illustrates training of a machine learning model for human search according to one embodiment.
4 illustrates a partial region from which a region of interest is divided and partial classification according to an exemplary embodiment.
5 shows the calculation of the objective function of the machine learning model according to the reliability of human detection.
6 shows benchmark data sets of a machine learning model for human search according to an embodiment.
7 shows qualitative comparison of performance of machine learning models according to a comparative embodiment and an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only, and may be changed and implemented in various forms. Therefore, the form actually implemented is not limited only to the specific embodiments disclosed, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various components, such terms should only be construed for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but one or more other features or numbers, It should be understood that the presence or addition of steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this specification, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted.

사람 검색(person search)은 이미지로부터 사람에 대응하는 관심 영역(Region of Interest; ROI)을 추출하는 사람 검출(person detection) 및 추출된 관심 영역에 대한 사람 식별(person re-identification)을 함께 수행하는 것을 나타낼 수 있다. 이미지는, 복수의 사람들이 함께 촬영된 이미지로서, 사람들에 대한 복수의 관심 영역들을 포함할 수 있다. 다만, 이에 한정하는 것은 아니고 이미지는 한 사람이 촬영된 이미지로서 하나의 관심 영역을 포함할 수도 있다.Person search is a method of performing person detection (ROI) corresponding to a person from an image and person re-identification for the extracted region of interest (ROI). can indicate The image is an image in which a plurality of people are photographed together, and may include a plurality of regions of interest for the people. However, it is not limited thereto, and the image is an image of one person and may include one region of interest.

사람 검출(person detection)은 이미지로부터 사람에 대응하는 영역을 포함하는 관심 영역을 추출하는 동작을 나타낼 수 있다. 복수의 관심 영역들이 하나의 이미지로부터 추출될 수 있다. 관심 영역은 미리 정의된 형태로 추출될 수 있다. 예시적으로, 관심 영역은 직사각형의 형태로 추출될 수 있다. 다만, 이에 한정하는 것은 아니고, 관심 영역이 다른 형태(예를 들어, 타원)로 추출되거나 관심 영역마다 결정되는 형태로 추출될 수도 있다. Person detection may refer to an operation of extracting a region of interest including a region corresponding to a person from an image. A plurality of regions of interest may be extracted from one image. The region of interest may be extracted in a predefined form. For example, the region of interest may be extracted in a rectangular shape. However, it is not limited thereto, and the ROI may be extracted in a different shape (eg, an ellipse) or may be extracted in a shape determined for each ROI.

사람 식별(person re-identification)은 검출된 관심 영역이 쿼리(query) 사람에 대한 것인지를 결정하는 동작을 나타낼 수 있다. 관심 영역이 쿼리 사람에 대한 것인지 아닌지 여부는 특징 벡터(feature vector) 간의 유사도에 기초하여 결정될 수 있다. 쿼리 사람에 대한 특징 벡터는 외부 장치로부터 수신될 수 있다. 후술하겠으나, 이미지로부터 추출된 관심 영역의 타겟 특징 벡터(target feature vector)는 기계 학습 모델을 통해 획득될 수 있다. 쿼리 사람의 특징 벡터 및 이미지로부터 추출된 관심 영역의 타겟 특징 벡터 간의 유사도가 임계 이상인 경우, 해당 관심 영역은 쿼리 사람에 대한 관심 영역으로 결정될 수 있다.Person re-identification may refer to an operation of determining whether a detected region of interest is for a query person. Whether the region of interest is for the query person or not may be determined based on similarities between feature vectors. The feature vector for the query person may be received from an external device. As will be described later, a target feature vector of a region of interest extracted from an image may be obtained through a machine learning model. When a similarity between the feature vector of the query person and the target feature vector of the region of interest extracted from the image is greater than or equal to a critical value, the region of interest may be determined as the region of interest of the query person.

도 1은 일 실시예에 따른 기계 학습 모델을 이용한 사람 검색 방법을 나타낸다.1 shows a person search method using a machine learning model according to an embodiment.

사람 검색 장치는 이미지에 대하여 사람 검색을 수행하는 장치를 나타낼 수 있다. 사람 검색 장치는 기계 학습 모델을 이용할 수 있다.The person search device may indicate a device that performs a person search on an image. The human search device may use a machine learning model.

기계 학습 모델은 이미지(110)에 적용됨으로써 복수의 관심 영역들을 출력할 수 있다. 복수의 관심 영역들은 미리 정해진 개수(예를 들어, 도 1의 N)로 추출될 수 있다. 또한, 기계 학습 모델은 추출된 복수의 관심 영역들 각각에 대한 타겟 특징 벡터 및 가능성 점수(possibility score)를 더 출력할 수 있다. 예시적으로, 기계 학습 모델은 이미지로부터 도3의 Faster R-CNN 백본 네트워크(backbone network), 사람 검출 네트워크(person detection network), 부분 분류 네트워크(part classification network), 및 사람 식별 네트워크(person re-identification network)에 기초하여 구현될 수 있다. 일 실시예에 따른 기계 학습 모델의 구조는 도3에서 자세히 설명한다.The machine learning model may output a plurality of regions of interest by being applied to the image 110 . A plurality of regions of interest may be extracted in a predetermined number (eg, N in FIG. 1 ). In addition, the machine learning model may further output a target feature vector and a probability score for each of the plurality of extracted regions of interest. Exemplarily, the machine learning model is a Faster R-CNN backbone network of FIG. 3 from images, a person detection network, a part classification network, and a person re- identification network). The structure of a machine learning model according to an embodiment is described in detail in FIG. 3 .

복수의 관심 영역들은 이미지의 일부 영역으로서, 사람에 대응하는 관심 영역들, 배경에 대응하는 관심 영역들, 및 사람에 대응하는 영역과 일부 중첩되는 영역들을 포함할 수 있다. 배경은 이미지 중 사람에 대응하는 영역을 제외한 나머지를 나타낼 수 있다. 또한, 복수의 이미지들은 하나의 사람에 대한 서로 중첩되는 복수의 관심 영역들을 포함할 수 있다.The plurality of regions of interest are partial regions of the image, and may include regions of interest corresponding to a person, regions of interest corresponding to a background, and regions partially overlapping the region corresponding to a person. The background may represent the rest of the image except for a region corresponding to a person. Also, the plurality of images may include a plurality of overlapping regions of interest of one person.

가능성 점수(possibility score)는 사람 검출에 관한 평가 지표로서, 추출된 해당 관심 영역에 대한 신뢰도(confidence)를 나타낼 수 있다. 가능성 점수는 사람에 대응하는 영역이 해당 관심 영역에 포함될 가능성을 나타낼 수 있다. 예를 들어, 사람에 대응하는 영역이 포함된 제1 관심 영역에 대한 가능성 점수는 1에 가까운 값을 가질 수 있다. 다른 예를 들어, 배경에 대응하는 영역이 포함된 제2 관심 영역에 대한 가능성 점수는 0에 가까운 값을 가질 수 있다. A possibility score is an evaluation index for human detection and may indicate confidence in the extracted region of interest. The likelihood score may indicate a possibility that a region corresponding to a person is included in the corresponding region of interest. For example, a probability score for a first ROI including a region corresponding to a person may have a value close to 1. As another example, a probability score for the second ROI including the region corresponding to the background may have a value close to 0.

사람 검색 장치는 가능성 점수에 기초하여 획득된 복수의 관심 영역들 중 적어도 하나를 후보 관심 영역(121)으로 결정할 수 있다. 사람 검색 장치는 가능성 점수가 임계 이상인 경우, 해당 영역을 사람에 대응하는 영역으로 추정되는 후보 관심 영역(121)으로 결정할 수 있다. 다만, 도 1의 이미지(120)에서 나타난 바와 같이, 한 사람에 대한 서로 중첩되는 복수의 관심 영역들이 포함될 수 있으므로, 중첩되는 복수의 후보 관심 영역들 중 하나를 선택하는 것이 요구될 수 있다.The human search apparatus may determine at least one of the plurality of regions of interest obtained based on the probability score as the candidate region of interest 121 . When the probability score is greater than or equal to a critical value, the human search apparatus may determine the corresponding region as the candidate region of interest 121 estimated to correspond to a human. However, as shown in the image 120 of FIG. 1 , since a plurality of overlapping regions of interest for one person may be included, it may be required to select one of the plurality of overlapping candidate regions of interest.

사람 검색 장치는 후보 관심 영역들이 서로 중첩되는 경우, 중첩되는 후보 관심 영역들 중 하나의 후보 관심 영역을 추정 관심 영역(131)으로 결정할 수 있다. 예를 들어, 후보 관심 영역들 간의 IoU(Intersection over Union)이 임계 이상인 경우, 최대 가능성 점수를 갖는 후보 관심 영역이 추정 관심 영역으로 선택될 수 있다.When the candidate regions of interest overlap each other, the person search apparatus may determine one candidate region of interest among the overlapping candidate regions of interest as the estimated region of interest 131 . For example, when intersection over union (IoU) between candidate regions of interest is greater than or equal to a threshold, a candidate region of interest having a maximum likelihood score may be selected as the estimated region of interest.

사람 검색 장치는 추정 관심 영역의 타겟 특징 벡터 및 쿼리 사람의 특징 벡터 간의 유사도에 기초하여, 추정 관심 영역이 쿼리 사람에 대한 영역인지 여부를 결정할 수 있다.The person search apparatus may determine whether the estimated ROI is the region of the query person based on a similarity between the target feature vector of the estimated ROI and the feature vector of the query person.

이하에서는, 일 실시예에 따라 사람 검색을 위한 엔드-투-엔드(end-to-end) 기반의 기계 학습 모델을 트레이닝시키는 방법을 도 2 내지 도 4을 참조하여 상세히 설명한다.Hereinafter, a method of training an end-to-end machine learning model for human search according to an embodiment will be described in detail with reference to FIGS. 2 to 4 .

도 2는 비교 실시예 및 일 실시예에 따른 사람 검색을 위한 기계 학습 모델의 트레이닝을 나타낸다.2 shows training of a machine learning model for human search according to a comparative embodiment and an embodiment.

비교 실시예에 따른 기계 학습 모델(210)은 사람 검출(person detection) 및 사람 식별(person re-identification)을 함께 학습할 수 있다. 사람 검출은 사람의 공통된 특징을 찾는 것을 요구할 수 있다. 사람 식별은 사람들 간의 차별화된 특징을 찾는 것을 요구할 수 있다. 따라서, 비교 실시예에 따른 엔드-투-엔드 기법의 기계 학습 모델(210)에서, 사람 검출 태스크(task) 및 사람 식별 태스크 간의 충돌이 발생할 수 있다. The machine learning model 210 according to the comparative example may learn both person detection and person re-identification. Person detection may require finding common characteristics of persons. Person identification may require finding differentiating characteristics among people. Therefore, in the machine learning model 210 of the end-to-end technique according to the comparative embodiment, a conflict between a person detection task and a person identification task may occur.

일 실시예에 따른 기계 학습 모델(220)에서, 사람 검출 및 사람 식별 간의 태스크 불일치를 감소시키기 위한 기법이 적용될 수 있다. 도 2에서 나타난 바와 같이, 기계 학습 모델(220)은 사람 식별에서 사람 검출의 결과(예를 들어, 사람 검출의 신뢰도)를 반영할 수 있다. 또한, 기계 학습 모델(220)은 사람의 부분에 관한 특징을 추출하기 위하여, 사람에 대응하는 영역이 분할된 부분 영역으로부터 특징을 추출하는 부분 분류(part classification)를 이용하여 학습할 수 있다. 기계 학습 모델(220)은 부분 분류에서도 사람 검출의 결과(예를 들어, 사람 검출의 신뢰도)를 반영할 수 있다. 실험적으로, 일 실시예에 따른 기계 학습 모델(220)은 비교 실시예에 따른 기계 학습 모델(210)보다 향상된 성능을 가질 수 있다.In the machine learning model 220 according to one embodiment, techniques for reducing task discrepancies between person detection and person identification may be applied. As shown in FIG. 2 , the machine learning model 220 may reflect a result of human detection (eg, reliability of human detection) in human identification. In addition, the machine learning model 220 may learn by using part classification, which extracts features from a partial region in which a region corresponding to a human is divided, in order to extract features related to human parts. The machine learning model 220 may reflect the result of human detection (eg, reliability of human detection) even in partial classification. Experimentally, the machine learning model 220 according to an embodiment may have improved performance than the machine learning model 210 according to a comparative embodiment.

도 3는 일 실시예에 따른 사람 검색을 위한 기계 학습 모델의 트레이닝을 나타낸다.3 illustrates training of a machine learning model for human search according to one embodiment.

트레이닝 장치(300)는, 사람 검색을 위한 기계 학습 모델의 트레이닝을 수행하기 위한 장치로서, 프로세서(301)을 포함할 수 있다.The training device 300 is a device for performing training of a machine learning model for human search, and may include a processor 301 .

일 실시예에 따른 사람 검색을 위한 기계 학습 모델은 백본 네트워크(backbone network), 사람 검출 네트워크(person detection network), 부분 분류 네트워크(part classification network), 및 사람 식별 네트워크(person re-identification network)를 포함할 수 있다. A machine learning model for person search according to an embodiment includes a backbone network, a person detection network, a part classification network, and a person re-identification network. can include

단계(310)에서, 프로세서(301)는 이미지에 백본 네트워크를 적용함으로써 관심 영역, 관심 영역의 특징 데이터(feature data), 및 관심 영역의 보다 더 깊은 특징 데이터를 추출할 수 있다. 백본 네트워크는 Faster R-CNN 및 프로포절 헤드 네트워크(proposal head network)를 포함할 수 있다. Faster R-CNN은 이미지로부터 관심 영역(예를 들어, 이미지의 복수의 관심 영역들 중 i번째 관심 영역) 및 관심 영역의 특징 데이터(예를 들어, i번째 관심 영역의 특징 데이터(

))를 추출할 수 있다. 프로포절 헤드 네트워크는 Faster R-CNN으로부터 추출된 관심 영역의 특징 데이터보다 더 깊은 특징 데이터(예를 들어, 더 많은 개수의 채널(channel)을 갖는 i번째 관심 영역의 특징 데이터(

))를 추출할 수 있다. 결과적으로, 기계 학습 모델은 멀티-스케일 피처링(multi-scale featuring)에 기초하여 관심 영역의 제1 특징 데이터(

) 및 제2 특징 데이터(

)를 모두 이용할 수 있다.At step 310, processor 301 may extract a region of interest, feature data of the region of interest, and deeper feature data of the region of interest by applying the backbone network to the image. The backbone network may include a Faster R-CNN and a proposal head network. Faster R-CNN is a region of interest (eg, an i -th region of interest among a plurality of regions of interest in an image) and feature data of the region of interest (eg, feature data of the i-th region of interest (eg, an i -th region-of-interest) from an image.

)) can be extracted. The proposal head network has deeper feature data than feature data of the region of interest extracted from Faster R-CNN (eg, feature data of the ith region of interest having a larger number of channels (

)) can be extracted. As a result, the machine learning model is based on the multi-scale featuring the first feature data of the region of interest (

) and second feature data (

) are all available.

단계(320)에서, 프로세서(301)는 백본 네트워크로부터 획득된 특징 데이터로부터 특징 행렬(feature matrix)을 획득할 수 있다. i번째 관심 영역의 특징 행렬(

)은 특징 데이터(

)에 부분적 적응형 최대 풀링(partial adaptive max pooling)을 적용함으로써 획득될 수 있다. 또한, i번째 관심 영역의 다른 특징 행렬(

)은 더 깊은 특징 데이터(

)에 부분적 적응형 최대 풀링을 적용함으로써 획득될 수도 있다. 특징 행렬은 관심 영역의 부분 특징들을 포함할 수 있다. 부분 특징은, 관심 영역이 분할된 부분 영역의 특징 데이터를 나타낼 수 있다. 예를 들어, i번째 관심 영역의 특징 행렬(

)의 j번째 로우(

)는, i번째 관심 영역의 j번째 파티션(partition)의 특징 데이터를 나타낼 수 있다.In step 320, the processor 301 may obtain a feature matrix from feature data obtained from the backbone network. The feature matrix of the ith region of interest (

) is the feature data (

It can be obtained by applying partial adaptive max pooling to ). In addition, another feature matrix of the ith region of interest (

) is the deeper feature data (

) may be obtained by applying partial adaptive max pooling to . The feature matrix may include partial features of the region of interest. The partial feature may represent feature data of a partial region from which the region of interest is divided. For example, the feature matrix of the ith region of interest (

) of the j -th row (

) may represent feature data of a j -th partition of the i -th region of interest.

단계들(330, 340, 350)에서, 프로세서(301)는 획득된 특징 행렬에 사람 검출 네트워크, 부분 분류 네트워크, 및 사람 식별 네트워크를 적용할 수 있다.In steps 330, 340, and 350, the processor 301 may apply a person detection network, a partial classification network, and a person identification network to the obtained feature matrix.

단계(330)에서, 프로세서(301)은 특징 행렬에 사람 검출 네트워크를 적용함으로써 가능성 점수를 획득할 수 있다. 사람 검출 네트워크는 글로벌 적응형 최대 풀링(global adaptive max pooling), 인코더(encoder), 배치 놈(batch normalization; BN), 연결(concatenation)(도 3에서 C로 나타남), 및 사람 및 배경 분류기(foreground and background classifier)(도 3에서 FG/BG classifier로 나타남)를 포함할 수 있다. 가능성 점수는 관심 영역이 사람에 대응하는 영역일 가능성을 나타낼 수 있다. 가능성 점수는, 백본 네트워크를 적용함으로써 획득된 관심 영역이 사람에 대응하는 영역일 사람 검출에 대한 신뢰도(confidence)를 나타낼 수 있다.At step 330, processor 301 may obtain a likelihood score by applying the person detection network to the feature matrix. The human detection network includes global adaptive max pooling, encoder, batch normalization (BN), concatenation (shown as C in Fig. 3), and human and background classifiers (foreground). and background classifier) (represented by FG / BG classifier in FIG. 3). The likelihood score may represent a probability that the region of interest is a region corresponding to a person. The likelihood score may represent confidence in detecting a person in which the region of interest obtained by applying the backbone network is a region corresponding to a person.

참고로, 사람 검출 네트워크에 관하여, 분류 손실(classification loss)(도 3에서, FG/BG cls. loss로 표현됨) 및 회귀 손실(regression loss)(도 3에서, Box reg. loss로 표현됨)이 계산될 수 있다. 기계 학습 모델의 목적함수는 분류 손실 및 회귀 손실을 포함할 수 있다.For reference, with respect to the human detection network, the classification loss (represented by FG/BG cls. loss in FIG. 3) and the regression loss (represented by Box reg. loss in FIG. 3) are calculated It can be. The objective function of the machine learning model may include classification loss and regression loss.

단계(340)에서, 프로세서(301)는 특징 행렬에 부분 분류 네트워크를 적용함으로써 확률 행렬(probability matrix)을 획득할 수 있다. 확률 행렬은, 관심 영역의 부분 영역이 파티션 클래스에 속할 확률을 포함할 수 있다. 부분 영역 및 파티션 클래스은 도 4에서 설명한다. 프로세서(301)은 확률 행렬에 기초하여 부분 분류 손실 값(part classification loss value)(도 3에서, Part cls. loss로 표시됨)을 계산할 수 있다. 부분 분류 손실 값의 계산은 도 5에서 설명한다.At step 340, the processor 301 may obtain a probability matrix by applying a partial classification network to the feature matrix. The probability matrix may include a probability that the partial region of the ROI belongs to the partition class. Subregions and partition classes are described in FIG. 4 . The processor 301 may calculate a part classification loss value (indicated as Part cls. loss in FIG. 3 ) based on the probability matrix. Calculation of the partial classification loss value is explained in FIG. 5 .

단계(350)에서, 프로세서(301)는 특징 행렬에 사람 식별 네트워크를 적용함으로써 타겟 특징 벡터(target feature vector)를 획득할 수 있다. 타겟 특징 벡터는, 관심 영역의 특징을 나타내는 벡터로서, 쿼리 사람의 특징 벡터와 비교 대상일 수 있다. 부분 영역의 특징 데이터를 가지는 특징 행렬로부터 획득된 타겟 특징 벡터는, 관심 영역 중 일부 영역(예를 들어, 하나의 부분 영역)에 치우치지 않고 부분 영역들 각각의 특징을 나타낼 수 있는 특징 벡터로 획득될 수 있다.At step 350, the processor 301 may obtain a target feature vector by applying a person identification network to the feature matrix. The target feature vector is a vector representing features of a region of interest and may be compared with a feature vector of a person being queried. A target feature vector obtained from a feature matrix having feature data of partial regions is obtained as a feature vector capable of representing characteristics of each of the partial regions without being biased toward some regions (eg, one partial region) of the ROI. It can be.

단계(360)에서, 프로세서(301)는 목적함수 값을 이용하여 기계 학습 모델의 파라미터를 업데이트할 수 있다. 목적함수는, 사람 검출 네트워크에 관하여 계산된 분류 손실 값(도 3에서 FG/BG cls. loss로 표현됨) 및 회귀 손실 값(도 3에서 Box reg.loss로 표현됨), 부분 분류 네트워크에 관하여 계산된 부분 분류 손실 값, 사람 식별 네트워크에 관하여 계산된 식별 손실 값을 포함할 수 있다. 기계 학습 모델의 목적함수 계산은 도 5에서 후술한다.In step 360, the processor 301 may update parameters of the machine learning model using the objective function value. The objective function is the classification loss value calculated with respect to the human detection network (represented by FG/BG cls.loss in FIG. 3) and the regression loss value (represented by Box reg.loss in FIG. It may include a partial classification loss value and an identification loss value calculated with respect to a human identification network. Calculation of the objective function of the machine learning model will be described later with reference to FIG. 5 .

도 4는 일 실시예에 따른 관심 영역이 분할된 부분 영역 및 부분 분류를 나타낸다.4 illustrates a partial region from which a region of interest is divided and partial classification according to an exemplary embodiment.

부분 영역들(410)은 관심 영역이 분할됨으로써 획득될 수 있다. 예를 들어, 부분 영역은 관심 영역을 복수의 파티션들(partitions)로 수평 분할한 관심 영역의 일부를 나타낼 수 있다.The partial regions 410 may be obtained by segmenting the region of interest. For example, the partial region may represent a part of a region of interest obtained by horizontally dividing the region of interest into a plurality of partitions.

파티션 클래스는 사람에 대응하는 관심 영역 중 부분 영역이 차지하는 위치(예를 들어, 관심 영역 중 첫번째 부분 영역)를 지시할 수 있다. 참고로, 파티션 클래스는 머리 또는 다리와 같은 특정한 신체 부위를 지시할 수도 있으나, 이에 한정되지 않고, 신체 부위 중 하나로 정의하기 어려운 신체의 일부를 나타낼 수 있다.The partition class may indicate a position occupied by a partial region of the region of interest corresponding to a person (eg, a first partial region of the region of interest). For reference, the partition class may indicate a specific body part, such as a head or a leg, but is not limited thereto, and may indicate a part of the body that is difficult to define as one of the body parts.

파티션 클래스는 배경에 대응하는 관심 영역의 부분 영역을 지시하는 더미 클래스(dummy class)를 포함할 수 있다. 예를 들어, 관심 영역이 미리 정의된 개수(N _p )로 분할된 부분 영역들이 있는 경우, 사람에 대응하는 관심 영역의 부분 영역들의 미리 정의된 개수(N _p )개의 파티션 클래스들 및 배경에 대응하는 관심 영역의 부분 영역들의 1개의 파티션 클래스를 포함하는 N _p +1개의 파티션 클래스들이 존재할 수 있다.The partition class may include a dummy class indicating a partial region of the ROI corresponding to the background. For example, if the region of interest has subregions divided into a predefined number ( N _p ), the subregions of the region of interest corresponding to a person correspond to a predefined number ( N _p ) of partition classes and a background. There may be N _p +1 partition classes including one partition class of subregions of the region of interest.

부분 분류 네트워크가 적용됨으로써 획득된 확률 행렬은 부분 영역이 파티션 클래스에 속할 확률을 포함할 수 있다. 이미지의 i번째 관심 영역에 대한 확률 행렬(

)의 j행 k열 원소(

)는, i번째 관심 영역의 j번째 부분 영역이 k번째 파티션 클래스에 속할 확률을 나타낼 수 있다.The probability matrix obtained by applying the partial classification network may include a probability that the partial region belongs to the partition class. The probability matrix for the ith region of interest in the image (

) of row j of column k element (

) may indicate a probability that the j th subregion of the i th ROI belongs to the k th partition class.

도 5는 사람 검출의 신뢰도에 따른 기계 학습 모델의 목적함수 계산을 나타낸다.5 shows the calculation of the objective function of the machine learning model according to the reliability of human detection.

크로스 엔트로피 손실(cross-entropy loss)은 분류를 위한 기계 학습 모델의 트레이닝에서 사용될 수 있다.

는 i번째 관심 영역의 스코어 벡터를 나타내고,

의 j번째 원소

는i번째 관심 영역이 j번째 클래스에 속할 스코어를 나타낼 때, 소프트맥스 확률들(softmax probabilities)을 갖는 크로스 엔트로피 손실은 다음과 같이 정의될 수 있다:Cross-entropy loss can be used in training machine learning models for classification.

denotes the score vector of the ith region of interest,

the jth element of

denotes the score that the ith region of interest belongs to the jth class, the cross entropy loss with softmax probabilities can be defined as:

여기서,

는 i번째 관심 영역의 참값 클래스 인덱스를 나타내고,

는 손실 함수의 민감도(sensitivity)를 제어하는 온도 계수(temperature coefficient)를 나타낼 수 있다. here,

denotes the true value class index of the ith region of interest,

may represent a temperature coefficient that controls the sensitivity of the loss function.

크로스 엔트로피 손실은 다음과 같이 표현될 수도 있다:The cross entropy loss can also be expressed as:

여기서, 크로스 엔트로피 손실은 참값 클래스의 스코어에 대한 다른 클래스들의 상대적인 스코어들에 의존하는 것을 알 수 있다. 손실 함수는, 관심 영역이 다른 클래스들에 속할 것으로 예측되는 스코어가 참값 클래스의 스코어보다 상대적으로 높을 때(예를 들어,

일 때), 높은 값을 가질 수 있다.Here, it can be seen that the cross entropy loss depends on the relative scores of other classes to the score of the true value class. The loss function is calculated when the score predicted that the region of interest belongs to other classes is relatively higher than the score of the true value class (e.g.,

), it can have a high value.

트레이닝 동안, 기계 학습 모델의 파라미터들은 다음과 같은 체인 룰(chain rule)에 기초하여 후방 전파 그래디언트들(back-propagated gradients)에 의하여 업데이트될 수 있다:During training, the parameters of the machine learning model can be updated by back-propagated gradients based on the following chain rule:

여기서,

는 기계 학습 모델의 파라미터들을 나타낼 수 있다.here,

may represent parameters of the machine learning model.

는 크로스 엔트로피 손실에 기초한 손실 함수의

에 대한 도함수(derivative)로 정의되면, 기계 학습 모델의 파라미터와 연관된 그래디언트(

)의 가중치 함수(weighting function)로 간주될 수 있다. 가중치 함수의 동작을 해석하기 위하여,

는 다음과 같이 상대적인 스코어들의 소프트맥스와 유사한(softmax-like) 함수로 도출될 수 있다:

is the loss function based on the cross entropy loss.

Defined as the derivative of , the gradient associated with the parameters of the machine learning model (

) can be regarded as a weighting function of To interpret the behavior of the weight function,

can be derived as a softmax-like function of the relative scores as follows:

도 5의 그래프들(510, 520)는 참값 클래스는 1이고 다른 클래스는 0인 간단한 이항 분류(binary classification) 태스크의 예시에 따른, 일반 가중치 함수(

) 및 적응형 그래디언트 가중치 함수(

)의 모양을 나타낼 수 있다. 그래프들(510, 520)는 참값 클래스일 확률(

)가 0.5이고 온도 계수(

)가 0.3으로 고정될 때 사람에 대응하는 다른 클래스일 확률(

)의 변화에 따른 일반 가중치 함수(

) 및 다양한 신뢰도(

) 값들에 따른 적응형 그래디언트 가중치 함수(

)의 플롯(plot)을 나타낼 수 있다. The

graphs

510 and 520 of FIG. 5 show the general weight function (

) and the adaptive gradient weight function (

) can represent the shape of The

graphs

510 and 520 show the true value class probability (

) is 0.5 and the temperature coefficient (

) is fixed at 0.3, the probability (

) according to the change in the general weight function (

) and various reliability (

) Adaptive gradient weight function according to the values (

) can be plotted.

기계 학습 모델을 트레이닝할 때 그래디언트들의 흐름을 제어하는 표준 가중치 함수로서

를 고려하면, 사람 식별 네트워크 및 부분 분류 네트워크를 적응적으로 트레이닝시키기 위하여, 다음과 같은 적응형 그래디언트 가중치 함수(adaptive gradient weighting function; AGWF)가 이용될 수 있다:As a standard weight function that controls the flow of gradients when training a machine learning model.

, in order to adaptively train the person identification network and the partial classification network, the following adaptive gradient weighting function (AGWF) can be used:

여기서,

는 i번째 관심 영역에 대한 사람 검출의 신뢰도로서, i번째 관심 영역이 사람에 대응하는 영역이 포함될 확률(예를 들어, 가능성 점수)을 나타낼 수 있다.here,

Is the reliability of detecting a person for the i -th ROI, and may indicate a probability (eg, a probability score) that the i -th ROI includes a region corresponding to a person.

그래프(510)에서, 다른 클래스일 확률(

)이 참값 클래스일 확률(

)보다 상대적으로 클 때, 기계 학습 모델의 파라미터들은, 온도 계수의 역수(

)에 가까운 큰 가중치를 갖는 최대로 그래디언트(

)를 전파시킴으로써 상당히(significantly) 벗어날(deviate) 수 있다. 대조적으로, 다른 클래스일 확률(

)가 참값 클래스일 확률(

)보다 작을 때, 0에 가까운 작은 가중치는 그래디언트 후방 전파를 억제할 수 있다.In the graph 510, the probability of being in a different class (

) is the true value class (

), the parameters of the machine learning model are the reciprocal of the temperature coefficient (

) to the maximal gradient with large weights close to (

) can significantly deviate. In contrast, the probability of being in a different class (

) is the true value class (

), small weights close to 0 can suppress gradient backpropagation.

그래프(520)는, 다양한 신뢰도(

) 값들에 따른 적응형 그래디언트 가중치 함수(

)의 모양들을 나타낼 수 있다. 일반 가중치 함수(

)의 그래프(510)에 비하여, 입력에 대한 민감도(sensitivity) 및 출력 범위는 모두 나타날 수 있다. 신뢰도(

)가 감소함에 따라, 적응형 그래디언트 가중치 함수의 출력 범위가 감소하고 적응형 그래디언트 가중치 함수의 모양이 완만해질 수 있다. 적응형 그래디언트 가중치 함수는, 사람 검출의 품질뿐만 아니라 공유된 특징 맵들의 트레이닝 상태(state)도 내재적으로(implicitly) 반영할 수 있다. 예를 들어, 네트워크가 트레이닝의 초기 단계(early phase)이고 검출된 관심 영역이 상당한 부분이 배경에 대응하는 영역을 포함하는 경우, 신뢰도(

)는 낮아질 수 있고, 적응형 그래디언트 가중치 함수는 식별 손실 값으로부터의 후방 전파를 억제하면서 신원 클래스들에 대한 관심 영역의 특징 벡터 유사도들에 둔감(insensitive)해질 수 있다.Graph 520 shows various degrees of reliability (

) Adaptive gradient weight function according to the values (

) can be represented. The general weight function (

Compared to the graph 510 of ), both the sensitivity to the input and the output range may appear. Reliability (

) decreases, the output range of the adaptive gradient weight function decreases and the shape of the adaptive gradient weight function may become smooth. The adaptive gradient weight function may implicitly reflect the quality of human detection as well as the training state of the shared feature maps. For example, if the network is in an early phase of training and a significant portion of the detected region of interest includes a region corresponding to the background, the reliability (

) can be lowered, and the adaptive gradient weight function can be made insensitive to the feature vector similarities of the region of interest to the identity classes while suppressing backpropagation from the identification loss value.

사람 검색에서, 사람 검출에 의하여 획득된 관심 영역에 기초하여 사람 식별이 수행되므로, 사람 검출의 품질(quality)은 사람 식별의 품질에 큰 영향을 미칠 수 있다. 특히, 엔드-투-엔드 기반의 기계 학습 모델에서, 사람 검출 및 사람 식별에 대하여 공유되는 특징 맵(feature map)이 사람 검출에 대하여 과적합(overfit)되는 경우, 사람 식별에 대한 특징 맵의 대표성이 열화(degrade)될 수 있다. 일 실시예에 따른 기계 학습 모델의 트레이닝은 사람 검출에 대한 신뢰도(예를 들어, 사람 검출 네트워크에 의하여 획득된 가능성 점수)에 따라 사람 식별 네트워크 및 부분 분류 네트워크를 적응적으로(adaptively) 트레이닝시키는 것을 포함할 수 있다. 이하, 사람 검출에 대한 신뢰도에 따라 사람 식별 네트워크 및 부분 분류 네트워크를 적응적으로 트레이닝하기 위한 목적함수 계산에 대하여 설명한다.In person search, since person identification is performed based on a region of interest obtained by person detection, the quality of person detection may have a great influence on the quality of person identification. In particular, in an end-to-end based machine learning model, if a feature map shared for human detection and human identification is overfitted for human detection, representativeness of the feature map for human identification This may degrade. Training of a machine learning model according to an embodiment comprises adaptively training a person identification network and a partial classification network according to a confidence level for person detection (eg, a likelihood score obtained by the person detection network). can include Hereinafter, calculation of an objective function for adaptively training a person identification network and a partial classification network according to the reliability of person detection will be described.

기계 학습 모델의 목적함수는 검출 손실 값(detection loss value), 식별 손실 값(re-identification loss value), 및 부분 분류 손실 값(part classification loss value)을 포함할 수 있다. The objective function of the machine learning model may include a detection loss value, a re-identification loss value, and a part classification loss value.

검출 손실 값은, 사람 검출 네트워크에 관한 손실로서, 도 3에서 전술한 바와 같이 분류 손실 및 회귀 손실을 포함할 수 있다. 분류 손실은 관심 영역이 사람에 대응하는 영역인지 아닌지를 분류하는 것에 대한 손실 값을 나타낼 수 있다. 회귀 손실은 관심 영역의 경계(예를 들어, 바운딩 박스)의 회귀에 대한 손실 값을 나타낼 수 있다. 검출 손실 값은 가능성 점수에 기초하여 계산될 수 있다.The detection loss value is a loss for the human detection network and may include classification loss and regression loss as described above in FIG. 3 . The classification loss may represent a loss value for classifying whether the region of interest corresponds to a human or not. The regression loss may represent a loss value for regression of a boundary (eg, bounding box) of a region of interest. A detection loss value can be calculated based on the likelihood score.

식별 손실 값은, 사람 식별 네트워크에 관한 손실로서, 관심 영역의 타겟 특징 벡터에 기초하여 계산될 수 있다. 식별 손실 값은, 적응형 그래디언트 가중치 함수로서, i번째 관심 영역의 사람 검출에 대한 신뢰도(

)에 기초하여 계산될 수 있다. 식별 손실 값은 타겟 특징 벡터와 함께 가능성 점수에 더 기초하여 계산될 수 있다.The identification loss value, as the loss for the person identification network, may be calculated based on the target feature vector of the region of interest. The identification loss value is, as an adaptive gradient weight function, the confidence in detecting a person in the ith region of interest (

) can be calculated based on The discrimination loss value can be calculated further based on the likelihood score along with the target feature vector.

식별 손실 값은 관심 영역의 타겟 특징 벡터 및 참값 신원 클래스(identity class)의 대표 특징 벡터(representative feature vector) 간의 유사도가 타겟 특징 벡터 및 다른 신원 클래스의 특징 벡터 간의 유사도보다 높도록 기계 학습 모델을 트레이닝시키기 위하여 설계될 수 있다. 신원 클래스는 관심 영역에 대한 사람의 신원을 지시하는 클래스를 나타낼 수 있다. 기계 학습 모델의 트레이닝을 위하여, 신원 클래스의 대표 특징 벡터는 외부 장치 또는 트레이닝 장치의 메모리로부터 획득될 수 있다. 관심 영역의 참값 신원 클래스가 존재하는 경우에 대하여 주로 설명하였으나, 후술될 바와 같이, 관심 영역의 참값 신원 클래스가 존재하지 않는 경우에도 식별 손실 값이 계산될 수 있다.The identification loss value trains a machine learning model such that the similarity between the target feature vector of the region of interest and the representative feature vector of the true identity class is higher than the similarity between the target feature vector and feature vectors of other identity classes. can be designed to The identity class may represent a class indicative of a person's identity for a region of interest. For training of the machine learning model, a representative feature vector of an identity class may be obtained from an external device or a memory of a training device. Although the case where the true value identity class of the region of interest exists has been mainly described, as will be described later, the identification loss value can be calculated even when the true value identity class of the region of interest does not exist.

식별 손실 값은 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도를 이용할 수 있다. 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도에 가능성 점수를 곱한 값은 보정된 유사도로서 이용될 수 있다. The identification loss value may use a similarity between the target feature vector and other feature vectors. A value obtained by multiplying the similarity between the target feature vector and the other feature vector by the likelihood score may be used as the calibrated similarity.

다른 특징 벡터는, 신원 클래스가 알려진 경우에 응답하여, 신원 클래스에 매핑된 대표 특징 벡터(representative feature vector)를 포함할 수 있다. 신원 클래스에 매핑된 대표 특징 벡터는 룩 업 테이블(look up table)에 저장될 수 있다. 또한, 다른 특징 벡터는 신원 클래스가 알려지지 않은(unlabeled) 특징 벡터를 포함할 수 있다. 라벨링되지 않은 다른 특징 벡터는 원형 큐(circular queue)에 저장될 수 있다.Other feature vectors may include a representative feature vector mapped to the identity class, in response to the case where the identity class is known. Representative feature vectors mapped to identity classes may be stored in a look up table. Also, other feature vectors may include feature vectors whose identity classes are unlabeled. Other unlabeled feature vectors can be stored in a circular queue.

식별 손실 값은 관심 영역에 대하여 계산될 수 있고, 관심 영역에 대한 참값 신원 클래스는 존재할 수도 있고, 존재하지 않을 수도 있다. An identification loss value may be computed for the region of interest, and a true identity class for the region of interest may or may not exist.

관심 영역의 참값 신원 클래스가 존재하는 경우, 식별 손실 값은 기계 학습 모델의 사람 식별 네트워크로부터 획득된 관심 영역의 타겟 특징 벡터 및 참값 신원 클래스의 대표 특징 벡터 간의 유사도를 기준 유사도로서 계산될 수 있다. 식별 손실 값은 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도와 기준 유사도의 차이에 기초하여 계산될 수 있다. If there is a true value identity class of the region of interest, the identification loss value may be calculated as a reference similarity between the target feature vector of the region of interest obtained from the human identification network of the machine learning model and the representative feature vector of the true identity class. The identification loss value may be calculated based on the difference between the similarity and the reference similarity between the target feature vector and other feature vectors.

가능성 점수에 더 기초한 식별 손실 값은, 기준 유사도(예를 들어, 타겟 특징 벡터 및 참값 신원 클래스의 대표 특징 벡터 간의 유사도)에 가능성 점수를 곱한 값을 보정된 기준 유사도로서 이용할 수 있다. 또한, 가능성 점수에 더 기초한 식별 손실 값은 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도에 가능성 점수를 곱한 값을 보정된 유사도로서 이용할 수 있다. 예를 들어, 참값 신원 클래스가 존재하는 i번째 관심 영역에 대한 식별 손실 값은 다음과 같이 계산될 수 있다:The identification loss value further based on the likelihood score may use a value obtained by multiplying the likelihood score by the reference similarity (eg, the similarity between the target feature vector and the representative feature vector of the true identity class) as the calibrated reference similarity. In addition, as the identification loss value further based on the likelihood score, a value obtained by multiplying the similarity between the target feature vector and other feature vectors by the likelihood score may be used as the corrected similarity. For example, the identification loss value for the ith region of interest in which the true identity class exists can be calculated as follows:

여기서,

는 신원 클래스가 알려진 다른 특징 벡터(예를 들어, 룩 업 테이블에 저장된 다른 특징 벡터)의 인덱스 세트를 나타내고,

는 신원 클래스가 알려지지 않은 다른 특징 벡터(예를 들어, 원형 큐에 저장된 다른 특징 벡터)의 인덱스 세트를 나타내고,

는 온도 계수를 나타낼 수 있다.here,

denotes a set of indices of other feature vectors whose identity classes are known (e.g., other feature vectors stored in a lookup table);

denotes a set of indices of other feature vectors whose identity classes are unknown (e.g., other feature vectors stored in circular queues);

can represent the temperature coefficient.

참값 신원 클래스가 존재하는 관심 영역의 식별 손실 값(

)은 상대적인 유사도에 의존하고, 기준 유사도(예를 들어, 타겟 특징 벡터 및 참값 신원 클래스의 대표 특징 벡터 간의 유사도(

))가 그 자체로 높은 스코어에 도달하기 전에 포화(saturate)될 수 있다. 기준 유사도가 높은 스코어에 도달하기 전에 포화되는 것을 방지하기 위하여, 다음과 같은 바이너리 분류 손실 값(binary classification loss value)(

)이 이용될 수 있다:The identification loss value of the region of interest where the true identity class exists (

) depends on the relative similarity, and the reference similarity (eg, the similarity between the target feature vector and the representative feature vector of the true identity class (

)) can saturate before reaching a high score by itself. In order to prevent the criterion similarity from being saturated before reaching a high score, the following binary classification loss value (

) can be used:

여기서, 상수

는 1의 최대 유사도로 설정될 수 있고,

는 온도 계수를 나타낼 수 있다.where constant

may be set to a maximum similarity of 1,

can represent the temperature coefficient.

관심 영역의 참값 신원 클래스가 존재하지 않는 경우, 식별 손실 값은 다른 관심 영역에 대한 타겟 특징 벡터 및 다른 관심 영역의 신원 클래스의 대표 특징 벡터 간의 유사도를 기준 유사도로 계산될 수 있다. 다른 관심 영역은 사람에 대응하는 관심 영역으로서, 관심 영역이 획득된 이미지와 같은 미니 배치(mini-batch)에 속하는 이미지로부터 획득될 수 있다. 미니 배치에 속하는 이미지에서 획득된 신원 클래스가 존재하는 다른 관심 영역이 복수개인 경우, 다른 관심 영역의 타겟 특징 벡터 및 다른 관심 영역의 신원 클래스의 대표 특징 벡터 간의 유사도들의 평균이 기준 유사도로서 이용될 수 있다.When the true value identity class of the ROI does not exist, the identification loss value may be calculated based on similarity between a target feature vector for another ROI and a representative feature vector of an identity class of another ROI. Another region of interest is a region of interest corresponding to a person, and may be obtained from images belonging to the same mini-batch as the image in which the region of interest is obtained. When there are a plurality of different regions of interest in which identity classes are present, obtained from images belonging to a mini-batch, an average of similarities between target feature vectors of other regions of interest and representative feature vectors of identity classes of different regions of interest may be used as the reference similarity. there is.

참고로, 미니 배치는 기계 학습 모델의 파라미터를 업데이트할 목적함수를 계산하는 단위를 나타낼 수 있다. 미니 배치는 하나 이상의 이미지들을 포함할 수 있다. 하나의 미니 배치로부터 획득된 관심 영역에 대한 목적함수를 계산할 수 있다. 하나의 미니 배치에 대하여 계산된 목적함수에 기초하여 기계 학습 모델의 파라미터가 업데이트될 수 있다.For reference, a mini-batch may represent a unit for calculating an objective function to update parameters of a machine learning model. A mini-batch can contain one or more images. An objective function for a region of interest obtained from one mini-batch can be calculated. Parameters of the machine learning model may be updated based on the objective function calculated for one mini-batch.

식별 손실 값은 관심 영역의 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도와 기준 유사도의 차이에 기초하여 계산될 수 있다. 다른 특징 벡터는, 신원 클래스가 알려진 경우에 응답하여, 신원 클래스에 매핑된 대표 특징 벡터(representative feature vector)를 포함할 수 있다. 신원 클래스에 매핑된 대표 특징 벡터는 룩 업 테이블(look up table)에 저장될 수 있다.The identification loss value may be calculated based on the difference between the similarity and the reference similarity between the target feature vector and other feature vectors of the region of interest. Other feature vectors may include a representative feature vector mapped to the identity class, in response to the case where the identity class is known. Representative feature vectors mapped to identity classes may be stored in a look up table.

가능성 점수에 더 기초한 식별 손실 값은, 기준 유사도(예를 들어, 다른 관심 영역에 대한 타겟 특징 벡터 및 다른 관심 영역의 신원 클래스의 대표 특징 벡터 간의 유사도)에 가능성 점수를 곱한 값을 보정된 기준 유사도로서 이용할 수 있다. 또한, 가능성 점수에 더 기초한 식별 손실 값은 타겟 특징 벡터 및 다른 특징 벡터 간의 유사도에 가능성 점수를 곱한 값을 보정된 유사도로서 이용할 수 있다. 예를 들어, 참값 신원 클래스가 존재하지 않는 i번째 관심 영역에 대한 식별 손실 값은 다음과 같은 식에 따라 계산될 수 있다:An identification loss value based further on a likelihood score is the adjusted baseline similarity multiplied by the likelihood score by the baseline similarity (e.g., the similarity between the target feature vector for another region of interest and the representative feature vector of the identity class of the other region of interest). can be used as In addition, as the identification loss value further based on the likelihood score, a value obtained by multiplying the similarity between the target feature vector and other feature vectors by the likelihood score may be used as the corrected similarity. For example, the identification loss value for the i -th region of interest in which the true identity class does not exist may be calculated according to the following equation:

여기서,

일 수 있고,

는 미니 배치의 관심 영역 중 신원 클래스가 존재하는 관심 영역의 인덱스 세트를 나타낼 수 있다. here,

can be,

may represent an index set of regions of interest in which an identity class exists among regions of interest in a mini-batch.

식별 손실 값은 이미지에 대하여 계산될 수 있다. 이미지로부터 복수의 관심 영역들이 획득된 경우, 각 관심 영역에 대하여 계산된 식별 손실 값에 기초하여 이미지에 대한 식별 손실 값이 계산될 수 있다. 예를 들어, 관심 영역에 대한 식별 손실 값들이 평균됨으로써 이미지에 대한 식별 손실 값이 계산될 수 있다. 신원 클래스가 존재하는 관심 영역들의 식별 손실 값들의 평균 및 신원 클래스가 존재하지 않는 관심 영역들의 식별 손실 값들의 평균을 덧셈함으로써 이미지에 대한 식별 손실 값이 계산될 수 있다. 이미지에 대한 식별 손실 값은 다음의 식으로 정의될 수 있다:An identification loss value may be calculated for an image. When a plurality of regions of interest are obtained from the image, an identification loss value for the image may be calculated based on an identification loss value calculated for each region of interest. For example, the identification loss value for the image may be calculated by averaging the identification loss values for the region of interest. The identification loss value for the image may be calculated by adding the average of the identification loss values of the regions of interest where the identity class exists and the average of the identification loss values of the regions of interest where the identity class does not exist. The identification loss value for an image can be defined as:

여기서,

는 미니 배치의 관심 영역 중 신원 클래스가 존재하는 관심 영역의 인덱스 세트를 나타낼 수 있고,

는 미니 배치의 관심 영역 중 신원 클래스가 존재하지 않는 관심 영역의 인덱스 세트를 나타낼 수 있고,

는 0.1로 실험적으로 설정될 수 있다.here,

May represent an index set of regions of interest in which an identity class exists among regions of interest in a mini-batch,

May represent an index set of regions of interest in the mini-batch where no identity class exists,

can be experimentally set to 0.1.

결과적으로, 적응형 그래디언트 가중치 함수는 사람 검출의 결과가 신뢰할 수 있을 때(예를 들어, 가중치 점수가 클 때), 기계 학습 모델의 파라미터를 업데이트하기 위하여 식별 손실 값의 기여도(contribution)을 증가시킬 수 있다. 도 3에서 나타난 바와 같이, 사람 검출 네트워크 및 사람 식별 네트워크 간의 스킵 연결(skip connection)은 식별 손실 값에 대한 적응형 그래디언트 가중치 함수의 적용(application)을 나타낼 수 있다.As a result, the adaptive gradient weight function will increase the contribution of the identification loss value to update the parameters of the machine learning model when the result of human detection is reliable (e.g., when the weight score is large). can As shown in FIG. 3 , a skip connection between a person detection network and a person identification network may represent an application of an adaptive gradient weight function to an identification loss value.

부분 분류 손실 값은, 부분 분류 네트워크에 관한 손실로서, 확률 행렬에 기초하여 계산될 수 있다. 부분 분류 손실 값은 부분 영역이 참값 파티션 클래스에 속할 확률이 다른 파티션 클래스에 속할 확률보다 크도록 기계 학습 모델을 트레이닝시키기 위하여 설계될 수 있다. 부분 분류 손실 값은 기준 확률 및 부분 영역이 다른 파티션 클래스에 속할 확률 간의 차이에 기초하여 계산될 수 있다. 기준 확률은 부분 분류 손실 값은 참값 파티션 클래스에 속할 확률을 나타낼 수 있다.A partial classification loss value, as a loss for a partial classification network, can be calculated based on a probability matrix. Subclassification loss values can be designed to train a machine learning model such that the probability that a subregion belongs to a true partition class is greater than the probability that it belongs to another partition class. The partial classification loss value may be calculated based on the difference between the reference probability and the probability that the partial region belongs to a different partition class. The criterion probability may represent the probability that the partial classification loss value belongs to the true value partition class.

부분 분류 손실 값은 확률 행렬과 함께 가능성 점수에 더 기초하여 계산될 수 있다. 가능성 점수에 더 기초한 부분 분류 손실 값은, 기준 확률(예를 들어, 부분 영역이 참값 파티션 클래스에 속할 확률)에 가능성 점수를 곱한 값을 보정된 기준 확률로서 이용할 수 있다. 또한, 가능성 점수에 더 기초한 부분 분류 손실 값은, 부분 영역이 다른 파티션 클래스에 포함될 확률에 가능성 점수를 곱한 값을 보정된 확률로서 이용할 수 있다. 부분 분류 손실 값은 크로스 엔트로피 기반의 분류 손실을 이용할 수 있다. 예를 들어, 확률 행렬 및 가능성 점수에 기초한 부분 분류 손실 값은 다음과 같이 계산될 수 있다:The subclassification loss value can be calculated further based on the probability score along with the probability matrix. The partial classification loss value further based on the likelihood score may use a value obtained by multiplying the probability score by the reference probability (eg, the probability that the subregion belongs to the true partition class) as the corrected reference probability. Further, as the partial classification loss value further based on the probability score, a value obtained by multiplying the probability score of the partial region to be included in another partition class may be used as a corrected probability. The partial classification loss value may use cross entropy-based classification loss. For example, a partial classification loss value based on a probability matrix and likelihood score can be calculated as:

여기서,

는 i번째 관심 영역의 확률 행렬의 j행 k열 원소(예를 들어, i번째 관심 영역의 j번째 부분 영역이 k번째 파티션 클래스에 속할 확률)를 나타내고,

는 미니 배치로부터 획득된 관심 영역들의 세트를 나타내며,

는 온도 계수를 나타내고,

는 미니 배치의 총 파티션들의 개수를 나타내며,

는 관심 영역마다 부분 영역의 개수를 나타낼 수 있다.here,

denotes an element of row j of column k of the probability matrix of the i -th region of interest (eg, probability that the j -th subregion of the i -th region of interest belongs to the k -th partition class),

denotes the set of regions of interest obtained from the mini-batch,

represents the temperature coefficient,

represents the total number of partitions in the mini-batch,

may indicate the number of subregions for each region of interest.

도 6는 일 실시예에 따른 사람 검색을 위한 기계 학습 모델의 벤치마크(benchmark) 데이터 세트들을 나타낸다.6 shows benchmark data sets of a machine learning model for human search according to an embodiment.

데이터 세트(610)는, CUHK-SYSU 데이터 세트로서, 영화 및 도시의 길거리 이미지들을 포함할 수 있다. 총 18,184개의 이미지들이 포함될 수 있고, 96,143개의 사람들에 대응하는 관심 영역들(예를 들어, 바운딩 박스(bounding box)) 및 8,432개의 신원들(identities; IDs)이 제공될 수 있다. 데이터 세트 중 11,206개의 이미지들 및 5,532개의 신원들이 기계 학습 모델의 트레이닝에 사용될 수 있고, 6,978장의 이미지들 및 2,900명의 쿼리 사람들이 기계 학습 모델의 평가(evaluation)에 사용될 수 있다. The data set 610 is a CUHK-SYSU data set and may include movies and city street images. A total of 18,184 images may be included, and regions of interest (eg, bounding boxes) corresponding to 96,143 people and 8,432 identities (IDs) may be provided. 11,206 images and 5,532 identities in the data set can be used for training of the machine learning model, and 6,978 images and 2,900 query people can be used for evaluation of the machine learning model.

데이터 세트(620)는, PRW 데이터 세트로서, 6개의 각도들로 촬영된 이미지들을 포함할 수 있다. 총 11,816장의 이미지들이 포함될 수 있고, 43,110개의 사람들에 대응하는 관심 영역들(예를 들어, 바운딩 박스(bounding box)) 및 932개의 신원들이 제공될 수 있다. 데이터 세트 중 5,704장의 이미지들 및 482개의 신원들이 기계 학습 모델의 트레이닝에 사용될 수 있고, 6,112장의 이미지들 및 2,057명의 쿼리 사람들이 기계 학습 모델의 평가에 사용될 수 있다. The data set 620, as a PRW data set, may include images taken at six angles. A total of 11,816 images may be included, and regions of interest (eg, bounding boxes) corresponding to 43,110 people and 932 identities may be provided. Of the data set, 5,704 images and 482 identities can be used for training of the machine learning model, and 6,112 images and 2,057 query people can be used for evaluation of the machine learning model.

데이터 세트(620)는, 데이터 세트(610)와 달리, 레귤러(regular) 갤러리(gallery) 및 멀티뷰(Multiview) 갤러리를 포함하는 평가를 위한 데이터 세트를 설계할 수 있다. 레귤러 갤러리는 쿼리 사람과 같은 카메라 각도에서 촬영된 이미지들을 포함할 수 있고, 이미지 내의 라벨링되지 않은 사람에 대응하는 관심 영역들(예를 들어, 참값 신원 클래스가 존재하지 않는 관심 영역들)은 평가에서 제외될 수 있다. 멀티뷰 갤러리는 쿼리 사람과 같은 카메라 각도에서 촬영된 이미지들이 제외될 수 있고, 라벨링되지 않은 사람에 대응하는 관심 영역들은 포함되어 평가에 사용될 수 있다. 기계 학습 모델의 평가에서, 멀티뷰 갤러리는 레귤러 갤러리보다 더 도전적인(challenging) 세팅(setting)일 수 있다.Unlike the data set 610, the data set 620 may design a data set for evaluation including a regular gallery and a multiview gallery. The regular gallery may include images taken from the same camera angle as the query person, and regions of interest corresponding to unlabeled persons in the image (eg, regions of interest for which no true identity class exists) are included in the evaluation. may be excluded. In the multi-view gallery, images captured at the same camera angle as the query person may be excluded, and regions of interest corresponding to unlabeled persons may be included and used for evaluation. In the evaluation of machine learning models, multiview galleries can be a more challenging setting than regular galleries.

학습 방법learning method CUHK-SYSUCUHK-SYSU PRWPRW mAPmAP Rank-1Rank-1 mAPmAP Rank-1Rank-1 비교 실시예 1Comparative Example 1 Two-stepTwo-step -- -- 20.520.5 48.348.3 비교 실시예 2Comparative Example 2 Two-stepTwo-step 83.083.0 83.783.7 32.632.6 72.172.1 비교 실시예 3Comparative Example 3 Two-stepTwo-step 87.287.2 88.588.5 38.738.7 65.065.0 비교 실시예 4Comparative Example 4 Two-stepTwo-step 93.093.0 94.294.2 42.942.9 70.270.2 비교 실시예 5Comparative Example 5 Two-stepTwo-step 93.993.9 95.195.1 46.846.8 87.587.5 비교 실시예 6Comparative Example 6 Two-stepTwo-step 90.390.3 91.491.4 47.247.2 87.087.0 비교 실시예 7Comparative Example 7 End-to-endEnd-to-end 75.575.5 78.778.7 21.321.3 49.949.9 비교 실시예 8Comparative Example 8 End-to-endEnd-to-end 76.376.3 80.180.1 23.023.0 61.961.9 비교 실시예 9Comparative Example 9 End-to-endEnd-to-end 77.977.9 81.281.2 24.224.2 53.153.1 비교 실시예 10Comparative Example 10 End-to-endEnd-to-end 79.379.3 81.381.3 -- -- 비교 실시예 11Comparative Example 11 End-to-endEnd-to-end 84.184.1 86.586.5 33.433.4 73.673.6 비교 실시예 12Comparative Example 12 End-to-endEnd-to-end 90.090.0 90.790.7 45.345.3 81.781.7 비교 실시예 13Comparative Example 13 End-to-endEnd-to-end 90.590.5 88.488.4 48.548.5 87.987.9 비교 실시예 14Comparative Example 14 End-to-endEnd-to-end 88.788.7 89.689.6 36.036.0 76.176.1 비교 실시예 15Comparative Example 15 End-to-endEnd-to-end 88.988.9 89.389.3 41.241.2 81.481.4 비교 실시예 16Comparative Example 16 End-to-endEnd-to-end 92.192.1 92.992.9 44.044.0 81.181.1 비교 실시예 17Comparative Example 17 End-to-endEnd-to-end 89.789.7 90.890.8 39.839.8 80.480.4 일 실시예one embodiment End-to-endEnd-to-end 93.393.3 94.294.2 53.353.3 87.787.7 비교 실시예 16*Comparative Example 16* End-to-endEnd-to-end -- -- 40.040.0 67.567.5 비교 실시예 17*Comparative Example 17* End-to-endEnd-to-end -- -- 36.536.5 65.065.0 일 실시예*One embodiment* End-to-endEnd-to-end -- -- 48.048.0 73.273.2

표 1은 비교 실시예들 및 일 실시예에 따른 사람 검색을 위한 기계 학습 모델의 데이터 세트들의 성능 비교를 나타낸다.Table 1 shows comparative examples and performance comparison of data sets of machine learning models for human search according to an example embodiment.

비교 실시예 1 내지 6에 따른 모델은 사람 검출 및 사람 식별을 개별적으로 학습하는 투 스텝(two-step) 기반의 모델로서, 각각 다음과 같은 논문에 개시된 모델을 나타낼 수 있다. 비교 실시예 1에 따른 모델은 논문 'Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017'에 개시된 모델을 나타낼 수 있다. 비교 실시예 2에 따른 모델은 논문 'Di Chen, Shanshan Zhang, Jian Yang, and Ying Tai. Person search via a mask-guided two-stream cnn model. In Proceedings of the European Conference on Computer Vision, 2018'에 개시된 모델을 나타낼 수 있다. 비교 실시예 3에 따른 모델은 논문 'Xu Lan, Xiatian Zhu, and Shaogang Gong. Person search by multi-scale matching. In Proceedings of the European Conference on Computer Vision, 2018'에 개시된 모델을 나타낼 수 있다. 비교 실시예 4에 따른 모델은 논문 'Chuchu Han, Jiacheng Ye, Yunshan Zhong, Xin Tan, Chi Zhang, Changxin Gao, and Nong Sang. Re-id driven localization refinement for person search. In Proceedings of the IEEE Conference on International Conference on Computer Vision, 2019'에 개시된 모델을 나타낼 수 있다. 비교 실시예 5에 따른 모델은 논문 'Cheng Wang, Bingpeng Ma, Hong Chang, Shiguang Shan, and Xilin Chen. Tcts: A task-consistent two-stage framework for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020'에 개시된 모델을 나타낼 수 있다. 비교 실시예 6에 따른 모델은 논문 'Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. Instance guided proposal network for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020'에 개시된 모델을 나타낼 수 있다.The models according to Comparative Examples 1 to 6 are two-step based models that individually learn human detection and human identification, and may represent models disclosed in the following papers, respectively. The model according to Comparative Example 1 is described in the paper 'Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017' may be represented. The model according to Comparative Example 2 is described in the paper 'Di Chen, Shanshan Zhang, Jian Yang, and Ying Tai. Person search via a mask-guided two-stream cnn model. A model disclosed in 'In Proceedings of the European Conference on Computer Vision, 2018' may be represented. The model according to Comparative Example 3 is described in the paper 'Xu Lan, Xiatian Zhu, and Shaogang Gong. Person search by multi-scale matching. A model disclosed in 'In Proceedings of the European Conference on Computer Vision, 2018' may be represented. The model according to Comparative Example 4 is described in the paper 'Chuchu Han, Jiacheng Ye, Yunshan Zhong, Xin Tan, Chi Zhang, Changxin Gao, and Nong Sang. Re-id driven localization refinement for person search. A model disclosed in 'In Proceedings of the IEEE Conference on International Conference on Computer Vision, 2019' may be represented. The model according to Comparative Example 5 is described in the paper 'Cheng Wang, Bingpeng Ma, Hong Chang, Shiguang Shan, and Xilin Chen. Tcts: A task-consistent two-stage framework for person search. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020' may be represented. The model according to Comparative Example 6 is described in the paper 'Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. Instance guided proposal network for person search. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020' may be represented.

비교 실시예 7 내지 17에 따른 모델은, 사람 검출 및 사람 식별을 한 번에 학습하는 엔드 투 엔드(end-to-end) 기반의 모델로서, 각각 다음과 같은 논문에 개시된 모델을 나타낼 수 있다. 비교 실시예 7에 따른 모델은 논문 'Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017'에 개시된 모델을 나타낼 수 있다. 비교 실시예8에 따른 모델은 논문 'Jimin Xiao, Yanchun Xie, Tammam Tillo, Kaizhu Huang, Yunchao Wei, and Jiashi Feng. Ian: the individual aggregation network for person search. Pattern Recognition, 87:332- 340, 2019'에 개시된 모델을 나타낼 수 있다. 비교 실시예 9에 따른 모델은 논문 'Hao Liu, Jiashi Feng, Zequn Jie, Karlekar Jayashree, Bo Zhao, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. Neural person search machines. In Proceedings of the IEEE Conference on International Conference on Computer Vision, 2017'에 개시된 모델을 나타낼 수 있다. 비교 실시예 10에 따른 모델은 논문 'Xiaojun Chang, Po-Yao Huang, Yi-Dong Shen, Xiaodan Liang, Yi Yang, and Alexander G. Hauptmann. Rcaa: Relational context-aware agents for person search. In Proceedings of the European Conference on Computer Vision, 2018'에 개시된 모델을 나타낼 수 있다. 비교 실시예 11에 따른 모델은 논문 'Yichao Yan, Qiang Zhang, Bingbing Ni, Wendong Zhang, Minghao Xu, and Xiaokang Yang. Learning context graph for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019'에 개시된 모델을 나타낼 수 있다. 비교 실시예 12에 따른 모델은 논문 'Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. Bi-directional interaction network for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020'에 개시된 모델을 나타낼 수 있다. 비교 실시예 13에 따른 모델은 논문 'Kun Tian, Houjing Huang, Yun Ye, Shiyu Li, Jinbin Lin, and Guan Huang. End-to-end thorough body perception for person search. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020'에 개시된 모델을 나타낼 수 있다. 비교 실시예 14에 따른 모델은 논문 'Ju Dai, Pingping Zhang, Huchuan Lu, and Hongyu Wang. Dynamic imposter based online instance matching for person search. Pattern Recognition, 100:107120, 2020'에 개시된 모델을 나타낼 수 있다. 비교 실시예 15에 따른 모델은 논문 'Yingji Zhong, Xiaoyu Wang, and Shiliang Zhang. Robust partial matching for person search in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020'에 개시된 모델을 나타낼 수 있다. 비교 실시예 16에 따른 모델은 논문 'Di Chen, Shanshan Zhang, Jian Yang, and Bernt Schiele. Norm-aware embedding for efficient person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020'에 개시된 모델을 나타낼 수 있다. 비교 실시예 17에 따른 모델은 논문 'Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Bernt Schiele. Hierarchical online instance matching for person search. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020'에 개시된 모델을 나타낼 수 있다. The models according to Comparative Examples 7 to 17 are end-to-end based models that learn human detection and human identification at once, and may represent models disclosed in the following papers, respectively. The model according to Comparative Example 7 is described in the paper 'Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017' may be represented. The model according to Comparative Example 8 is described in the paper 'Jimin Xiao, Yanchun Xie, Tammam Tillo, Kaizhu Huang, Yunchao Wei, and Jiashi Feng. Ian: the individual aggregation network for person search. Pattern Recognition, 87:332-340, 2019'. The model according to Comparative Example 9 is described in the paper 'Hao Liu, Jiashi Feng, Zequn Jie, Karlekar Jayashree, Bo Zhao, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. Neural person search machines. A model disclosed in 'In Proceedings of the IEEE Conference on International Conference on Computer Vision, 2017' may be represented. The model according to Comparative Example 10 is described in the paper 'Xiaojun Chang, Po-Yao Huang, Yi-Dong Shen, Xiaodan Liang, Yi Yang, and Alexander G. Hauptmann. Rcaa: Relational context-aware agents for person search. A model disclosed in 'In Proceedings of the European Conference on Computer Vision, 2018' may be represented. The model according to Comparative Example 11 is described in the paper 'Yichao Yan, Qiang Zhang, Bingbing Ni, Wendong Zhang, Minghao Xu, and Xiaokang Yang. Learning context graph for person search. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019' may be represented. The model according to Comparative Example 12 is described in the paper 'Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. Bi-directional interaction network for person search. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020' may be represented. The model according to Comparative Example 13 is described in the paper 'Kun Tian, Houjing Huang, Yun Ye, Shiyu Li, Jinbin Lin, and Guan Huang. End-to-end thorough body perception for person search. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020'. The model according to Comparative Example 14 is described in the paper 'Ju Dai, Pingping Zhang, Huchuan Lu, and Hongyu Wang. Dynamic imposter based online instance matching for person search. Pattern Recognition, 100:107120, 2020'. The model according to Comparative Example 15 is described in the paper 'Yingji Zhong, Xiaoyu Wang, and Shiliang Zhang. Robust partial matching for person search in the wild. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020' may be represented. The model according to Comparative Example 16 is described in the paper 'Di Chen, Shanshan Zhang, Jian Yang, and Bernt Schiele. Norm-aware embedding for efficient person search. A model disclosed in 'In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020' may be represented. The model according to Comparative Example 17 is described in the paper 'Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Bernt Schiele. Hierarchical online instance matching for person search. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020'.

표 1에서 나타난 바와 같이, 비교 실시예들 및 일 실시예에 따른 기계 학습 모델의 평가 성능이 데이터 세트들(610, 620)로 측정되어 정량적으로 비교될 수 있다. 투 스텝 기법 및 엔드 투 엔드 기법으로 분류된 비교 실시예들 및 일 실시예의 성능이 비교될 수 있다.As shown in Table 1, the evaluation performance of the comparison examples and the machine learning model according to the example may be measured with the data sets 610 and 620 and quantitatively compared. The performance of one embodiment and comparative embodiments classified as two-step schemes and end-to-end schemes can be compared.

일 실시예 따른 기계 학습 모델은 CUHK-SYSU 데이터 세트(610)에 대하여 비교 실시예들에 따른 엔드 투 엔드 기법의 기계 학습 모델들 중 최고 성능을 가질 수 있다. 일 실시예에 따른 기계 학습 모델은 또한, 투 스텝 기법의 기계 학습 모델들 중 최고 성능과도 비슷한 성능을 가질 수 있다. 투 스텝 기반의 기계 학습 모델은, 사람 검출 네트워크 및 사람 식별 네트워크를 개별적으로 학습하는 불편함을 가질 수 있다. 따라서, 일 실시예에 따른 기계 학습 모델은, 엔드 투 엔드 기법에 기반한 모델로서, 효율성 및 성능에 있어서 사람 검색을 위한 기계 학습 모델의 트레이닝에 대하여 우수한 성과를 가질 수 있다.The machine learning model according to an embodiment may have the highest performance among end-to-end machine learning models according to comparative embodiments with respect to the CUHK-SYSU data set 610 . The machine learning model according to an embodiment may also have performance similar to the highest performance among two-step machine learning models. The two-step based machine learning model may have the inconvenience of separately learning the person detection network and the person identification network. Accordingly, a machine learning model according to an embodiment, as a model based on an end-to-end technique, may have excellent results in training of a machine learning model for human search in terms of efficiency and performance.

또한, 일 실시예에 따른 기계 학습 모델은 PRW 데이터 세트(620)에 대하여 일 실시예 및 비교 실시예들에 따른 기계 학습 모델들 중 최고 성능을 가질 수 있다. 표 1에서, PRW 데이터 세트(620)의 레귤러 갤러리 및 멀티뷰 갤러리(표 1에서 *로 표시됨) 각각에 대한 성능이 비교될 수 있다. 레귤러 갤러리보다 더 도전적인 멀티뷰 갤러리에 대한 평가 성능은, 레귤러 갤러리에 대한 평가 성능보다 낮은 성능을 가질 수 있다. 표 1에서 나타난 바와 같이, 일 실시예에 따른 기계 학습 모델은 멀티뷰 갤러리에 대한 평가에서도 비교 실시예에 따른 기계 학습 모델들보다 우수한 성능을 가질 수 있다. CUHK와 마찬가지로 본 연구팀이 제안한 end-to-end 기법은 PRW 데이터 셋에서 최고 성능을 기록하며 사람 검색 기술에서 우수한 성과를 도출한 것으로 판단될 수 있다.In addition, the machine learning model according to an embodiment may have the highest performance among machine learning models according to an embodiment and comparative embodiments with respect to the PRW data set 620 . In Table 1, the performance of each of the regular gallery and multi-view gallery (indicated by * in Table 1) of the PRW data set 620 can be compared. Evaluation performance for the multi-view gallery, which is more challenging than the regular gallery, may have lower performance than evaluation performance for the regular gallery. As shown in Table 1, the machine learning model according to one embodiment may have better performance than the machine learning models according to the comparative embodiment even in evaluation of the multi-view gallery. Like CUHK, the end-to-end method proposed by this research team recorded the best performance in the PRW data set, and it can be judged to have produced excellent results in human search technology.

[표 3][Table 3]

표 3은 부분 분류 태스크 및 적응형 그래디언트 가중치 함수가 사람 검출 및 사람 식별에 미치는 영향을 나타낸다. Table 3 shows the effect of the partial classification task and the adaptive gradient weight function on person detection and person identification.

사람 검색의 성능은 mAP(mean Average Precision) 및 Top-k 스코어로 평가될 수 있다. 여기서 mAP는 매칭 결과들에 대한 정밀도-재현율 곡선 아래 영역(area under the precision-recall curve)을 측정할 수 있고, Top-k 스코어는 쿼리 사람과 상위 유사도를 갖는 적어도 k개의 관심 영역들이 신원 매칭에 성공한 비율을 나타낼 수 있다. 참값 바운딩 박스에 대한 IoU(Intersection over Union)가 0.5 미만인 관심 영역은, Top-k 스코어에 계산되지 않을 수 있다.The performance of human search can be evaluated by mean average precision (mAP) and Top-k score. Here, mAP can measure the area under the precision-recall curve for matching results, and the Top-k score is at least k areas of interest with the highest similarity to the query person that match the identity. The success rate can be indicated. A region of interest in which the intersection over union (IoU) for the true bounding box is less than 0.5 may not be counted in the Top-k score.

도 7은 비교 실시예 및 일 실시예에 따른 기계 학습 모델의 성능을 정성적으로 비교하는 것을 나타낸다. 7 shows qualitative comparison of performance of machine learning models according to a comparative embodiment and an embodiment.

비교 실시예에 따른 기계 학습 모델은 백본 네트워크, 사람 검출 네트워크, 및 사람 식별 네트워크를 포함하지만 부분 분류 네트워크가 구현되지 않은 기계 학습 모델을 나타낼 수 있다. 일 실시예에 따른 기계 학습 모델은, 백본 네트워크, 사람 검출 네트워크, 및 사람 식별 네트워크와 함께 부분 분류 네트워크를 포함하는 기계 학습 모델을 나타낼 수 있다.The machine learning model according to the comparative example may represent a machine learning model that includes a backbone network, a person detection network, and a person identification network, but does not implement a partial classification network. A machine learning model according to an embodiment may represent a machine learning model including a partial classification network together with a backbone network, a person detection network, and a person identification network.

표 3에서, 사람 검출에 대한 성능은 mAP 및 재현율(recall) 스코어로 측정될 수 있고, 사람 식별에 대한 성능은 mAP와 Top-1 스코어로 측정될 수 있다. 일 실시예에 따른 기계 학습 모델의 사람 검출에 대한 성능은 mAP 및 재현율 스코어는 각각 1.32% 및 0.47% 만큼 비교 실시예에 따른 기계 학습 모델보다 향상될 수 있다. 일 실시예에 따른 기계 학습 모델의 사람 식별에 대한 성능은, mAP 및 Top-1 스코어는 각각 1.52% 및 1.02% 만큼 비교 실시예에 따른 기계 학습 모델보다 향상될 수 있다. In Table 3, performance for human detection can be measured with mAP and recall score, and performance for human identification can be measured with mAP and Top-1 score. The performance of the machine learning model according to an embodiment for human detection can be improved over the machine learning model according to the comparative embodiment by 1.32% and 0.47% in mAP and recall scores, respectively. The performance of the machine learning model according to an embodiment for human identification, mAP and Top-1 score may be improved by 1.52% and 1.02%, respectively, compared to the machine learning model according to the comparative embodiment.

도 7의 행(column)은 쿼리 사람, 비교 실시예에 따른 기계 학습 모델 및 일 실시예에 따른 기계 학습 모델에 의하여 추정된 최고 유사도를 갖는 관심 영역을 나타낸다. 도 7에서, 초록색 외곽선을 갖는 관심 영역은 쿼리 사람에 대응하는 관심 영역을 검색, 다시 말해, 검출 및 식별한 경우를 나타낼 수 있다. 빨간색 외곽선을 갖는 관심 영역은, 쿼리 사람에 대응하는 관심 영역을 검색하지 못한 경우를 나타낼 수 있다.A column in FIG. 7 represents a region of interest having the highest similarity estimated by a query person, a machine learning model according to a comparison embodiment, and a machine learning model according to an embodiment. In FIG. 7 , a region of interest with a green outline may indicate a case in which the region of interest corresponding to the query person is searched for, that is, detected and identified. A region of interest with a red outline may indicate a case where the region of interest corresponding to the query person is not searched.

일 실시예에 따른 기계 학습 모델은 부분 분류 네트워크에 의하여, 공유된 특징 맵을 보다 더 설명적으로(descriptive) 트레이닝될 수 있다. 도 7에서, 일 실시예에 따른 기계 학습 모델은, 옷의 패턴 및 액세서리(예를 들어, 가방 및/또는 우산)와 같은 뚜렷한(distinct) 부분적 특징들을 갖는 쿼리 사람에 대한 관심 영역을 정확하게 찾을 수 있다. 이와 달리, 부분 분류 네트워크가 포함되지 않은 비교 실시예에 따른 기계 학습 모델은, 쿼리 사람과 옷 색상이 유사한 잘못된 사람에 대한 관심 영역을 반환될 수 있다.A machine learning model according to an embodiment may be trained more descriptively on a shared feature map by means of a partial classification network. In FIG. 7 , a machine learning model according to an embodiment may accurately find a region of interest for a query person having distinct partial features such as clothing patterns and accessories (eg, bags and/or umbrellas). there is. In contrast, the machine learning model according to the comparative embodiment, which does not include the partial classification network, may return a region of interest for an erroneous person whose clothing color is similar to that of the query person.

표 3에서, 적응형 그래디언트 가중치 함수(AGWF)의 효과가 나타날 수 있다. 적응형 그래디언트 가중치 함수는 사람 검출 및 사람 식별 모두에 대하여 효과적일 수 있다. In Table 3, the effect of the adaptive gradient weight function (AGWF) can be shown. An adaptive gradient weight function can be effective for both person detection and person identification.

적응형 그래디언트 가중치 함수가 비활성화(deactivate)되는 경우(예를 들어, 일반 가중치 함수가 적용되는 경우), 적응형 그래디언트 가중치 함수가 활성화된 경우보다, 사람 검출의 성능은 mAP 및 재현율 스코어에서 각각 2.24% 및 2.29% 감소될 수 있고, 사람 식별의 성능이 mAP 및 Top-1 스코어에서 각각 1.17% 및 0.29% 감소될 수 있다. When the adaptive gradient weight function is deactivated (eg, when the general weight function is applied), the performance of human detection is 2.24% in mAP and recall score, respectively, compared to when the adaptive gradient weight function is activated. and 2.29%, and the performance of human identification can be reduced by 1.17% and 0.29% in mAP and Top-1 score, respectively.

또한, 부분 분류 네트워크가 포함되지 않은 기계 학습 모델은, 적응형 그래디언트 가중치 함수가 비활성화되는 경우, 부분 분류 네트워크가 포함된 기계 학습 모델보다 더 저하되므로, 사람 식별에 대한 낮은 성능을 가질 수 있다. 다만, 엔드 투 엔드 기법의 기계 학습 모델에서 태스크의 개수가 증가하면, 태스크 간의 충돌 위험이 발생할 수 있다. 따라서, 적응형 그래디언트 가중치 함수가 비활성화된 부분 분류 네트워크를 포함하는 기계 학습 모델은, 적응형 그래디언트 가중치 함수가 비활성화된 부분 분류 네트워크를 포함하지 않는 기계 학습 모델보다 사람 검출에 대한 낮은 성능을 가질 수 있다. 적응형 그래디언트 가중치 함수는 일 실시예에 따른 기계 학습 모델뿐만 아니라, 종래 사람 검출 및 사람 식별의 2개의 태스크들을 위한 네트워크에 적용될 수 있다.In addition, a machine learning model without a partial classification network may have low performance for human identification because it deteriorates more than a machine learning model with a partial classification network when the adaptive gradient weight function is disabled. However, if the number of tasks increases in an end-to-end machine learning model, a risk of collision between tasks may occur. Therefore, a machine learning model that includes a partial classification network with the adaptive gradient weight function disabled may have lower performance for human detection than a machine learning model that does not include a partial classification network with the adaptive gradient weight function disabled. . The adaptive gradient weight function can be applied to a network for the two tasks of conventional person detection and person identification, as well as a machine learning model according to one embodiment.

[표 4][Table 4]

표 4는 참값 신원 클래스를 갖는 관심 영역에 대한 바이너리 분류 손실 값(

) 및 참값 신원 클래스가 존재하지 않는 관심 영역에 대한 식별 손실 값(

)이 사람 검색을 위한 기계 학습 모델의 성능에 미치는 영향을 나타낸다. Table 4 shows the binary classification loss values for regions of interest with true identity classes (

) and the identification loss value for regions of interest where no true identity class exists (

) on the performance of the machine learning model for human search.

참값 신원 클래스를 갖는 관심 영역에 대한 바이너리 분류 손실 값(

)은 타겟 특징 벡터 및 참값 신원 클래스와 다른 신원 클래스들(예를 들어, 네거티브 클래스들(negative classes))의 대표 특징 벡터와의 유사도에 관계없이 참값 신원 클래스(예를 들어, 포지티브 클래스(positive class))의 대표 특징 벡터와의 유사도를 높이도록 설계될 수 있다. 이와 달리, 참값 신원 클래스가 존재하지 않는 관심 영역에 대한 식별 손실 값(

)은 존재하지 않는 참값 신원 클래스로 인해, 다른 신원 클래스들(예를 들어, 네거티브 클래스)의 대표 특징 벡터와의 유사도를 줄이도록 설계될 수 있다. 그러므로, 기계 학습 모델의 성능 평가에 있어서, 참값 신원 클래스를 갖는 관심 영역에 대한 바이너리 분류 손실 값(

)은 모두, 원하지 않는(undesired) 바이어스(bias)을 피하기 위하여 보완적인 값으로 함께 사용되는 것이 보다 더 타당(reasonable)할 수 있다. 표 4에서 나타난 바와 같이, 바이너리 분류 손실 값(

) 및 식별 손실 값(

)을 사용하지 않고 트레이닝된 기계 학습 모델은, 일 실시예에 따른 기계 학습 모델보다 mAP 및 Top-1 스코어에서 각각 0.6% 및 1.36% 낮은 성능을 가질 수 있다.A binary classification loss value for a region of interest with a true identity class (

) is the true value identity class (e.g., the positive class) regardless of the similarity between the target feature vector and the true value identity class and the representative feature vector of other identity classes (e.g., negative classes). )) can be designed to increase the similarity with the representative feature vector. In contrast, the identification loss value for a region of interest where no true identity class exists (

) can be designed to reduce the similarity with representative feature vectors of other identity classes (e.g., negative class) due to the non-existent true value identity class. Therefore, in the performance evaluation of the machine learning model, the binary classification loss value for the region of interest with the true identity class (

) may be more reasonable to use together as complementary values to avoid undesired bias. As shown in Table 4, the binary classification loss value (

) and identification loss values (

) may have 0.6% and 1.36% lower performance in mAP and Top-1 scores, respectively, than the machine learning model according to an embodiment.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination, and the program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in the art of computer software. may be Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware device described above may be configured to operate as one or a plurality of software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

obtaining a region of interest (ROI) and feature data for the region of interest by applying a backbone network of a machine learning model to an image;
acquiring a feature matrix representing partial features of a plurality of partial regions of the ROI from the obtained feature data;
obtaining a possibility score indicating a possibility that the region of interest is a region corresponding to a person by applying a person detection network of the machine learning model to the obtained feature matrix;
A probability matrix including a probability that each partial region belongs to partition classes indicating the plurality of partial regions by applying a part classification network of the machine learning model to the obtained feature matrix. Obtaining a (probability matrix);
obtaining a target feature vector from the region of interest by applying a person re-identification network of the machine learning model to the obtained feature matrix; and
Updating parameters of the machine learning model using an objective function value calculated based on the likelihood score, the probability matrix, and the target feature vector.
A method for training a machine learning model comprising a.

According to claim 1,
Updating the parameters of the machine learning model,
calculating a detection loss value for the person detection network in the region of interest based on the likelihood score;
calculating a part classification loss value for the partial classification network based on the probability matrix; and
Calculating a re-identification loss value for the person identification network based on the target feature vector.
How to train a machine learning model.

According to claim 2,
The step of calculating the partial classification loss value,
Calculating the partial classification loss value further based on the likelihood score together with the probability matrix.
How to train a machine learning model.

According to claim 3,
Calculating the partial classification loss value further based on the likelihood score comprises:
Calculating the partial classification loss value by using a value obtained by multiplying the probability of the probability matrix by the probability score as a corrected probability representing a probability that a partial region belongs to a partition class.
How to train a machine learning model.

According to claim 2,
In the step of calculating the identification loss value,
Calculating the identification loss value further based on the likelihood score together with the target feature vector.
How to train a machine learning model.

According to claim 5,
Calculating the identification loss value further based on the likelihood score comprises:
Calculating the identification loss value by using a value obtained by multiplying a similarity between the target feature vector and another feature vector by the likelihood score as a calibrated similarity between the target feature vector and the other feature vector,
How to train a machine learning model.

According to claim 2,
In the step of calculating the identification loss value,
In response to the presence of a true-value identity class indicating the identity of a person for the region of interest, the degree of similarity between the target feature vector and another feature vector and the relationship between the target feature vector and the true-value identity class. Comprising the step of calculating the identification loss value based on the difference in similarity between representative feature vectors,
How to train a machine learning model.

According to claim 7,
In the step of calculating the identification loss value,
Further comprising calculating a binary classification loss value for whether the region of interest is classified into the true identity class.
How to train a machine learning model.

According to claim 2,
In the step of calculating the identification loss value,
In response to a case where no true identity class for the region of interest exists, calculating the identification loss value based on a degree of similarity between the target feature vector and a representative feature vector of each identity class.
How to train a machine learning model.

According to claim 9,
Calculating the identification loss value based on the similarity between the target feature vector and the representative feature vector of each identity class,
calculating a similarity between a target feature vector of the other region of interest and a representative feature vector of the identity class of the other region of interest in response to an identity class of another region of interest of the image; and
Based on a difference between the similarity between the target feature vector of the ROI and the representative feature vector of each identity class and the similarity between the target feature vector of the other ROI and the representative feature vector of the identity class of the other ROI, the identification loss Including the step of calculating the value,
How to train a machine learning model.

According to claim 1,
Updating the parameters of the machine learning model using the calculated objective function value,
calculating an objective function value for the image based on objective function values calculated for the plurality of regions of interest in response to a case in which a plurality of regions of interest are obtained from the image; and
Updating parameters of the machine learning model using the objective function value calculated for the image,
How to train a machine learning model.

A computer program stored in a computer readable recording medium in order to execute the method of any one of claims 1 to 11 in combination with hardware.

A region of interest (ROI) and feature data for the region of interest are obtained by applying a backbone network of a machine learning model to an image, and from the obtained feature data, the region of interest A feature matrix representing partial features of a plurality of subregions of is obtained, and the possibility that the region of interest is an region corresponding to a person is determined by applying a person detection network of the machine learning model to the obtained feature matrix. By obtaining a possibility score and applying a part classification network of the machine learning model to the acquired feature matrix, partition classes indicating the plurality of partial regions A probability matrix including a probability to which each subregion belongs is acquired, and a target feature vector is obtained from the ROI by applying a person re-identification network of the machine learning model to the acquired feature matrix. A processor for obtaining a target feature vector and updating parameters of the machine learning model using an objective function value calculated based on the likelihood score, the probability matrix, and the target feature vector
Apparatus for training a machine learning model comprising a.

According to claim 13,
the processor,
Calculate a detection loss value for the person detection network of the region of interest based on the comparison of the likelihood score and ground truth likelihood score, and calculate a detection loss value for the partial classification network based on the probability matrix. calculating a part classification loss value and calculating a re-identification loss value for the person identification network based on the target feature vector;
A device that trains a machine learning model.

According to claim 14,
the processor,
calculating the partial classification loss value further based on the likelihood score together with the probability matrix;
A device that trains a machine learning model.

According to claim 15,
the processor,
Calculating the partial classification loss value by using a value obtained by multiplying a probability of the probability matrix by the probability score as a corrected probability representing a probability that a partial region belongs to a partition class.
A device that trains a machine learning model.

According to claim 14,
the processor,
calculating the identification loss value further based on the likelihood score together with the target feature vector;
A device that trains a machine learning model.

According to claim 17,
the processor,
Calculating the identification loss value by using a value obtained by multiplying a similarity between the target feature vector and another feature vector by the likelihood score as a corrected similarity between the target feature vector and the other feature vector,
A device that trains a machine learning model.

According to claim 14,
the processor,
In response to the presence of a true-value identity class indicating the identity of a person for the region of interest, the degree of similarity between the target feature vector and another feature vector and the relationship between the target feature vector and the true-value identity class. Calculating the identification loss value based on the difference in similarity between representative feature vectors,
A device that trains a machine learning model.

According to claim 19,
the processor,
Calculating a binary classification loss value for whether the region of interest is classified into the true value identity class,
A device that trains a machine learning model.

According to claim 14,
the processor,
Calculating the identification loss value based on a degree of similarity between the target feature vector and a representative feature vector of each identity class in response to a case in which a true value identity class for the region of interest does not exist.
A device that trains a machine learning model.

According to claim 18,
the processor,
In response to an identity class of another ROI of the image, a similarity between a target feature vector of the other ROI and a representative feature vector of the identity class of the other ROI is calculated, and the target feature of the ROI is calculated. Calculating the identification loss value based on a similarity between a vector and a representative feature vector of each identity class and a difference between a similarity between a target feature vector of the other region of interest and a representative feature vector of the identity class of the other region of interest.
A device that trains a machine learning model.

According to claim 13,
the processor,
In response to a case where a plurality of regions of interest are obtained from the image, an objective function value for the image is calculated based on objective function values calculated for the plurality of regions of interest, and the objective function value calculated for the image Updating parameters of the machine learning model using
A device that trains a machine learning model.