KR102702069B1

KR102702069B1 - Method of controlling sports activity classification learning apparatus, computer readable medium and apparatus for performing the method

Info

Publication number: KR102702069B1
Application number: KR1020210104123A
Authority: KR
Inventors: 이수원; 류광현
Original assignee: 숭실대학교 산학협력단
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2024-09-03
Also published as: KR20230022010A; WO2023013809A1

Abstract

스포츠 활동 영상정보와 상기 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장하는 데이터 베이스 모듈; 프레임을 추출하는 프레임 추출 모듈; 추출된 프레임으로부터 중요 프레임을 추출하는 중요 프레임 추출 모듈; 중요 프레임의 공간 정보를 추출하는 공간 정보 추출 모듈; 중요 프레임의 특징벡터를 획득하는 특징벡터 획득 모듈; 및 스포츠 활동으로 분류하는 스포츠 활동분류 모듈;을 포함하는 스포츠 활동분류 학습장치를 제공한다.The present invention provides a sports activity classification learning device including: a database module for matching sports activity video information and the sports activity classification information and storing them as learning data; a frame extraction module for extracting frames; an important frame extraction module for extracting important frames from the extracted frames; a spatial information extraction module for extracting spatial information of important frames; a feature vector acquisition module for acquiring feature vectors of important frames; and a sports activity classification module for classifying them as sports activities.

Description

METHOD OF CONTROLLING SPORTS ACTIVITY CLASSIFICATION LEARNING APPARATUS, COMPUTER READABLE MEDIUM AND APPARATUS FOR PERFORMING THE METHOD

본 발명은 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치에 관한 것으로, 보다 상세하게는 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 입력하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치에 관한 것이다.The present invention relates to a control method for a sports activity classification learning device, a recording medium and a device for performing the same, and more specifically, to a control method for a sports activity classification learning device that classifies sports activities by inputting at least one frame from sports activity image information, and a recording medium and a device for performing the same.

활동인식 기술은 어떠한 활동이 일어나고 있는지 인식하고, 인식된 활동을 분류하는 기술이다. Activity recognition technology is a technology that recognizes what activities are occurring and classifies the recognized activities.

활동인식은 센서 기반의 방법으로 이루어질 수 있다. 센서 기반의 활동인식은, 가속도계와 자이로센서 등의 센서를 사용한다. 하지만 스포츠 경기에서 운동선수들은 가속도계, 자이로센서 등의 센서를 포함한 모든 전자기 장비를 착용하는 것은 금지되어 있다. 따라서, 스포츠 경기 분야의 활동인식에서는 센서 기반의 활동인식 방법은 사용되기 어렵다는 한계점이 있다. Activity recognition can be done with a sensor-based method. Sensor-based activity recognition uses sensors such as accelerometers and gyroscopes. However, in sports games, athletes are prohibited from wearing any electromagnetic equipment including sensors such as accelerometers and gyroscopes. Therefore, there is a limitation that sensor-based activity recognition methods are difficult to use in the field of sports game activity recognition.

그 결과로 심층신경망(Deep Neural Network, DNN), 합성 곱 신경망(Convolutional Neural Network, CNN), 심층 신뢰망(Deep Belief Network, DBN) 및 순환 신경망 (Recurrent Neural Network, RNN) 등과 같은 다양한 딥 러닝 기법들이 스포츠 경기의 영상정보에 적용되어 선수들의 스포츠 활동을 인식하고, 인식된 스포츠 활동을 분류하는 것이 요구되는 실정이다.As a result, various deep learning techniques such as deep neural networks (DNNs), convolutional neural networks (CNNs), deep belief networks (DBNs), and recurrent neural networks (RNNs) are required to be applied to video information of sports games to recognize athletes' sports activities and classify recognized sports activities.

본 발명이 해결하고자 하는 기술적 과제는 스포츠 경기의 영상정보로부터 프레임을 추출하여 시간적 주의집중 점수를 획득하고, 공간적 주의집중 맵을 획득하여, 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치를 제공하는 것이다.The technical problem to be solved by the present invention is to provide a control method for a sports activity classification learning device that extracts frames from video information of a sports game to obtain a temporal attention score and a spatial attention map to classify sports activities, and a recording medium and device for performing the same.

본 발명의 일측면은, 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치로서, 상기 스포츠 활동 영상정보와 상기 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장하는 데이터 베이스 모듈; 상기 데이터 베이스 모듈에 저장된 영상정보로부터 적어도 하나 이상의 프레임을 추출하는 프레임 추출 모듈; 상기 추출된 프레임으로부터 중요 프레임을 추출하는 중요 프레임 추출 모듈; 상기 중요 프레임의 공간 정보를 추출하는 공간 정보 추출 모듈; 상기 중요 프레임의 공간 정보가 LSTM(Long Short Term Memory)모델의 입력값으로 입력되고, 상기 중요 프레임의 특징벡터를 획득하는 특징벡터 획득 모듈; 및 상기 중요 프레임의 특징벡터로부터 확률 값을 획득하고, 상기 확률 값이 미리 설정된 확률 값 이상인 경우에만 상기 중요 프레임을 구성하는 영상정보와 매칭된 상기 스포츠 활동으로 분류하는 스포츠 활동분류 모듈;을 포함할 수 있다.One aspect of the present invention is a sports activity classification learning device that extracts at least one frame from sports activity image information and classifies sports activities, the device including: a database module that matches the sports activity image information with the sports activity classification information and stores the matched information as learning data; a frame extraction module that extracts at least one frame from the image information stored in the database module; an important frame extraction module that extracts an important frame from the extracted frame; a spatial information extraction module that extracts spatial information of the important frame; a feature vector acquisition module that inputs the spatial information of the important frame as an input value of an LSTM (Long Short Term Memory) model and acquires a feature vector of the important frame; and a sports activity classification module that acquires a probability value from the feature vector of the important frame and classifies the sport activity as matching the image information constituting the important frame only when the probability value is greater than or equal to a preset probability value.

또한, 상기 중요 프레임 추출 모듈은, 상기 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출하는 픽셀 추출부; 상기 추출된 픽셀을 합하는 픽셀 합산부; 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행하는 픽셀 풀링 연산부;상기 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 상기 최대 풀링(MAX-POOLING)에 설정한 가중치와 상기 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정되는 가중치 설정부; 및 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 시간적 주의집중 점수 획득부;를 포함할 수 있다.In addition, the important frame extraction module may include a pixel extraction unit that extracts pixels of an R channel from the frame, pixels of a G channel, and pixels of a B channel; a pixel summing unit that sums the extracted pixels; a pixel pooling operation unit that performs MAX-POOLING and AVERAGE-POOLING operations on the summed pixel values; a weight setting unit that sets weights for each of the MAX-POOLING operation values and the AVERAGE-POOLING operation values, wherein the sum of the weights set for the MAX-POOLING operation value and the weights set for the AVERAGE-POOLING operation value is set to 1; and a temporal attention score obtaining unit that obtains a temporal attention score by using a value obtained by summing the MAX-POOLING operation value and the AVERAGE-POOLING operation value.

또한, 상기 중요 프레임 추출 모듈은, 상기 시간적 주의집중 점수 획득부로부터 획득된 시간적 주의집중 점수가 곱해진 값이 상대적으로 큰 프레임을 중요 프레임으로 추출하는 것을 포함할 수 있다.In addition, the important frame extraction module may include extracting a frame having a relatively large value multiplied by a temporal attention score obtained from the temporal attention score obtaining unit as an important frame.

또한, 상기 시간적 주의집중 점수 획득부는, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 입력되는 제1 Fully Connected Layer; 상기 제1 Fully Connected Layer로부터 출력된 출력값이 입력값으로 입력되는 ReLU 활성화 함수 수행부; 상기 ReLU 활성화 함수 수행부로부터 출력된 출력값이 입력값으로 입력되는 제2 Fully Connected Layer; 및 상기 제2 Fully Connected Layer로부터 출력된 출력값이 입력값으로 입력되는 Sigmoid 활성화 함수 수행부;를 포함할 수 있다.In addition, the temporal attention score acquisition unit may include a first Fully Connected Layer into which a sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value is input; a ReLU activation function execution unit into which an output value output from the first Fully Connected Layer is input as an input value; a second Fully Connected Layer into which an output value output from the ReLU activation function execution unit is input as an input value; and a Sigmoid activation function execution unit into which an output value output from the second Fully Connected Layer is input as an input value.

또한, 상기 공간 정보 추출 모듈은, 상기 중요 프레임이 입력되고, Feature Map을 출력하는 제1 Convolution　Layer; 상기 Feature Map에 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행하는 Feature Map 풀링 연산부; 상기 Feature Map 풀링 연산부의 출력값으로부터 채널 주의집중 맵(Channel Attention Map)을 획득하는 채널 주의집중 맵 획득부; 및 상기 채널 주의집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 공간적 주의집중 맵 획득부;를 포함할 수 있다.In addition, the spatial information extraction module may include a first convolution layer for inputting the important frame and outputting a feature map; a feature map pooling operation unit for performing a max pooling operation and an average pooling operation on the feature map; a channel attention map acquisition unit for acquiring a channel attention map from an output value of the feature map pooling operation unit; and a spatial attention map acquisition unit for acquiring a spatial attention map from the channel attention map.

본 발명의 다른 일측면은, 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법으로서, 상기 스포츠 활동 영상정보와 상기 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장하고, 상기 학습 데이터로 저장된 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하며, 상기 추출된 프레임으로부터 중요 프레임을 추출하고, 상기 중요 프레임의 공간 정보를 추출하며, 상기 중요 프레임의 공간 정보가 LSTM(Long Short Term Memory)모델의 입력값으로 입력되고, 상기 중요 프레임의 특징벡터를 획득하며, 상기 프레임의 특징벡터로부터 확률 값을 획득하고, 상기 확률 값이 미리 설정된 확률 값 이상인 경우에만 상기 중요 프레임을 구성하는 영상정보와 매칭된 상기 스포츠 활동으로 분류하는 것을 포함할 수 있다.Another aspect of the present invention is a control method of a sports activity classification learning device that extracts at least one frame from sports activity image information and classifies a sports activity, the control method including matching the sports activity image information and the sports activity classification information and storing the matching as learning data, extracting at least one frame from the sports activity image information stored as the learning data, extracting an important frame from the extracted frame, extracting spatial information of the important frame, inputting the spatial information of the important frame as an input value of an LSTM (Long Short Term Memory) model, obtaining a feature vector of the important frame, obtaining a probability value from the feature vector of the frame, and classifying the sports activity as matching the image information constituting the important frame only when the probability value is greater than or equal to a preset probability value.

또한, 상기 추출된 프레임으로부터 중요 프레임을 추출하는 것은, 상기 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출하고, 상기 추출된 픽셀을 합하며, 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행하고, 상기 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 상기 최대 풀링(MAX-POOLING)에 설정한 가중치와 상기 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정되며, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 것을 포함할 수 있다.In addition, extracting important frames from the extracted frames may include extracting pixels of an R channel from the frames, extracting pixels of a G channel, extracting pixels of a B channel, summing the extracted pixels, performing MAX-POOLING and AVERAGE-POOLING operations on the summed values of the pixels, setting weights for each of the MAX-POOLING operation values and the AVERAGE-POOLING operation values, wherein the sum of the weights set for the MAX-POOLING operation value and the weights set for the AVERAGE-POOLING operation value is set to 1, and obtaining a temporal attention score by using a value obtained by summing the MAX-POOLING operation value and the AVERAGE-POOLING operation value.

또한, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 것은, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 제1 Fully Connected Layer에 입력값으로 입력되고, 상기 제1 Fully Connected Layer로부터 출력된 출력값이 입력값으로 ReLU 활성화 함수 수행부에 입력되며, 상기 ReLU 활성화 함수 수행부로부터 출력된 출력값이 제2 Fully Connected Layer에 입력값으로 입력되고, 상기 제2 Fully Connected Layer로부터 출력된 출력값이 Sigmoid 활성화 함수 수행부에 입력값으로 입력되는 것을 포함할 수 있다.In addition, obtaining a temporal attention score by using the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value may include inputting the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value as an input value to a first Fully Connected Layer, inputting the output value output from the first Fully Connected Layer as an input value to a ReLU activation function execution unit, inputting the output value output from the ReLU activation function execution unit as an input value to a second Fully Connected Layer, and inputting the output value output from the second Fully Connected Layer as an input value to a Sigmoid activation function execution unit.

또한, 상기 중요 프레임의 공간 정보를 추출하는 것은, 상기 시간적 주의집중 점수가 곱해진 프레임이 제1 Convolution　Layer에 입력되어 Feature Map을 출력하고, 상기 Feature Map에 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행하며, 상기 Feature Map의 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행한 값을 이용하여 채널 주의집중 맵(Channel Attention Map)을 획득하고, 상기 채널 집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 것을 포함할 수 있다.In addition, extracting spatial information of the important frame may include inputting a frame multiplied by the temporal attention score into a first convolution layer to output a feature map, performing a MAX-POOLING operation and an AVERAGE-POOLING operation on the feature map, obtaining a channel attention map using values obtained by performing the MAX-POOLING operation and the AVERAGE-POOLING operation on the feature map, and obtaining a spatial attention map from the channel attention map.

본 발명의 또 다른 일측면은, 컴퓨터로 판독 가능한 저장 매체에는 스포츠 활동분류 학습장치의 제어방법을 수행하기 위한 컴퓨터 프로그램이 기록될 수 있다.In another aspect of the present invention, a computer program for performing a control method of a sports activity classification learning device can be recorded on a computer-readable storage medium.

상술한 본 발명의 일측면에 따르면, 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치를 제공함으로써, 시간적 주의집중 점수와 공간적 주의집중 맵을 모두 고려하여 스포츠 활동을 분류할 수 있다.According to one aspect of the present invention described above, by providing a control method of a sports activity classification learning device that classifies sports activities, a recording medium and a device for performing the same, sports activities can be classified by considering both a temporal attention score and a spatial attention map.

도1은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치를 나타내는 개념도이다.
도2는 중요 프레임 추출 모듈을 나타내는 개념도이다.
도3은 TAM 계산과정을 거쳐 복수의 프레임에 시간적 주의집중 점수가 곱해진 모습을 나타낸 개념도이다.
도4는 2D CNN 모델을 거쳐 공간적 주의집중 맵을 획득하는 과정을 나타낸 개념도이다.
도5는 2D CNN 모델을 거친 다음, 스포츠 활동을 분류하는 과정을 나타낸 개념도이다.
도6은 본 발명의 스포츠 활동분류 학습장치를 mAP를 통해 평가한 표를 나타낸 도면이다.
도7은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치의 제어방법을 나타낸 순서도이다.
도8 내지 도9는 시간적 주의집중 점수를 획득하는 과정을 나타낸 순서도이다.
도10은 공간적 주의집중 맵을 획득하는 과정을 나타낸 순서도이다.Figure 1 is a conceptual diagram illustrating a sports activity classification learning device according to one embodiment of the present invention.
Figure 2 is a conceptual diagram showing the important frame extraction module.
Figure 3 is a conceptual diagram showing how temporal attention scores are multiplied across multiple frames through the TAM calculation process.
Figure 4 is a conceptual diagram showing the process of obtaining a spatial attention map through a 2D CNN model.
Figure 5 is a conceptual diagram showing the process of classifying sports activities after passing through a 2D CNN model.
Figure 6 is a diagram showing a table evaluating the sports activity classification learning device of the present invention through mAP.
Figure 7 is a flowchart showing a control method of a sports activity classification learning device according to one embodiment of the present invention.
Figures 8 and 9 are flowcharts showing the process of obtaining temporal attention scores.
Figure 10 is a flowchart showing the process of obtaining a spatial attention map.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.The detailed description of the invention set forth below refers to the accompanying drawings which illustrate, by way of example, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the invention, while different from one another, are not necessarily mutually exclusive. For example, specific shapes, structures, and features described herein may be implemented in other embodiments without departing from the spirit and scope of the invention. It should also be understood that the positions or arrangements of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the invention is defined only by the appended claims, along with the full scope of equivalents to which such claims are entitled, if properly so described. Like reference numerals in the drawings designate the same or similar functionality throughout the several aspects.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도1은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치를 나타내는 개념도이다.Figure 1 is a conceptual diagram illustrating a sports activity classification learning device according to one embodiment of the present invention.

스포츠 활동분류 학습장치(1)는 데이터 베이스 모듈(10), 프레임 추출 모듈(20), 중요 프레임 추출 모듈(30), 공간 정보 추출 모듈(40), 특징벡터 획득 모듈(50) 및 스포츠 활동분류 모듈(60)을 포함할 수 있다.A sports activity classification learning device (1) may include a database module (10), a frame extraction module (20), an important frame extraction module (30), a spatial information extraction module (40), a feature vector acquisition module (50), and a sports activity classification module (60).

스포츠 활동분류 학습장치(1)는 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하고, 추출된 프레임으로부터 어떤 스포츠 활동이 발생하고 있는지 분류하도록 학습하는 장치일 수 있다. 이 때, 스포츠 활동이란, 야구 경기에서 타자가 스윙을 하고 있는 활동, 타자가 홈런 또는 스트라이크를 치는 활동 등일 수 있으며, 축구 경기에서 선수가 골을 넣는 활동, 슈팅을 하고 있는 활동, 코너킥이나 프리킥을 하고 있는 활동, 반칙을 하고 있는 활동, 파울을 하고 있는 활동, 패널티킥을 하고 있는 활동 등을 포함할 수 있다. 다만, 스포츠 활동에 대한 예시를 들었으나 이에 한정되는 것은 아니다.A sports activity classification learning device (1) may be a device that extracts at least one frame from sports activity video information and learns to classify what kind of sports activity is occurring from the extracted frame. At this time, the sports activity may be an activity in which a batter is swinging in a baseball game, an activity in which a batter is hitting a home run or a strike, etc., and may include an activity in which a player is scoring a goal in a soccer game, an activity in which a player is shooting, an activity in which a corner kick or a free kick is being taken, an activity in which a foul is being committed, an activity in which a foul is being committed, an activity in which a penalty kick is being committed, etc. However, examples of sports activities have been given, but are not limited thereto.

데이터 베이스 모듈(10)은 스포츠 활동 영상정보와 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장할 수 있다. 예를 들면, 스포츠 활동 영상정보가 야구 경기에서 타자가 스윙하고 있는 영상정보라면 스포츠 활동 분류정보는 '스윙'으로 매칭하여 저장할 수 있으며, 스포츠 활동 영상정보가 야구 경기에서 타자가 홈런을 치고 있는 영상정보라면 스포츠 활동 분류정보는 '홈런'으로 매칭하여 저장할 수 있다.The database module (10) can match sports activity video information and sports activity classification information and store them as learning data. For example, if the sports activity video information is video information of a batter swinging in a baseball game, the sports activity classification information can be matched and stored as 'swing', and if the sports activity video information is video information of a batter hitting a home run in a baseball game, the sports activity classification information can be matched and stored as 'home run'.

프레임 추출 모듈(20)은 데이터 베이스 모듈(10)에 저장된 영상정보로부터 적어도 하나 이상의 프레임을 추출할 수 있다. 예를 들면, 야구 경기에서 타자가 홈런을 치고 있는 영상정보로부터 미리 설정된 시간차별로 연속된 적어도 하나 이상의 프레임을 추출하거나 불연속된 적어도 하나 이상의 프레임을 추출할 수 있다.The frame extraction module (20) can extract at least one frame from the image information stored in the database module (10). For example, in a baseball game, at least one continuous frame can be extracted at a preset time interval from the image information of a batter hitting a home run, or at least one discontinuous frame can be extracted.

중요 프레임 추출 모듈(30)은 프레임의 픽셀 값으로부터 시간적 주의집중 점수를 계산하고, 프레임에 시간적 주의집중 점수를 곱할 수 있다. 시간적 주의집중 점수를 계산하는 방식은 TAM(Temporal Attention Module)을 통해 시간적 주의집중 점수를 계산할 수 있다. 이 때, TAM(Temporal Attention Module)이란, 복수개의 프레임 간에 중요도를 산출하여 중요 프레임과 비중요 프레임을 추출할 수 있는 방법이다. The important frame extraction module (30) can calculate a temporal attention score from the pixel value of the frame and multiply the temporal attention score by the frame. The method for calculating the temporal attention score can calculate the temporal attention score through the Temporal Attention Module (TAM). At this time, the Temporal Attention Module (TAM) is a method for extracting important frames and non-important frames by calculating the importance between multiple frames.

예를 들면, 시간적 주의집중 점수가 곱해진 제 1프레임보다 시간적 주의집중 점수가 곱해진 제 2프레임 값이 더 크다면, 제 2프레임이 중요 프레임으로 추출되는 것이고, 제 1프레임은 비중요 프레임으로 추출될 수 있다. For example, if the value of the second frame multiplied by the temporal attention score is greater than that of the first frame multiplied by the temporal attention score, the second frame may be extracted as an important frame, and the first frame may be extracted as a non-important frame.

TAM(Temporal Attention Module)에 대한 구체적인 설명은 도2 내지 도3에서 상세히 후술하도록 한다.A detailed description of the TAM (Temporal Attention Module) will be given in detail later in Figures 2 and 3.

공간 정보 추출 모듈(40)은 시간적 주의집중 점수가 곱해진 중요 프레임 및 비중요 프레임의 공간적 주의 집중 맵(Spatial Ateention Map)을 획득할 수 있다.The spatial information extraction module (40) can obtain a spatial attention map of important and non-important frames multiplied by a temporal attention score.

본 발명은 2D CNN모델을 통해서 공간적 주의 집중 맵(Spatial Ateention Map)을 획득하였고, 2D CNN모델은 ResNet50v2모델에 CBAM(Convolutional Block Attention Module)이 추가된 모델을 의미한다. The present invention obtains a spatial attention map through a 2D CNN model, and the 2D CNN model refers to a model in which a CBAM (Convolutional Block Attention Module) is added to the ResNet50v2 model.

CBAM(Convolutional Block Attention Module) 계산을 통해 중요 프레임과 비중요 프레임 각각의 공간적 주의 집중 맵(Spatial Attention MAP)을 획득할 수 있다. Spatial attention maps for each important and unimportant frame can be obtained through CBAM (Convolutional Block Attention Module) calculation.

공간적 주의 집중 맵(Spatial Attention MAP)은 특정 벡터에 주목하여 연산을 수행함으로써 네트워크 모델의 성능을 높이는 기법인 CBAM(Convolutional Block Attention Module)을 통해 산출되는 값이다. 보다 구체적으로, 주의집중 기법은 인간의 인식 프로세스를 모방하여 계층적 구조(hierarchical structure)를 가지고 있는 데이터 중 중요한 데이터에 가중치를 더해줌으로써 합성곱 신경망의 분류 성능을 높이기 위해 사용된다. 이와 같은 주의집중 기법은 상대적으로 중요한 정보를 포함하고 있는 데이터와 중요하지 않은 정보를 포함하고 있는 데이터를 구분하고 학습함으로써 합성곱 신경망 네트워크 내의 정보 흐름을 원활하게 하는 효과가 있다.The spatial attention map (Spatial Attention MAP) is a value produced by the Convolutional Block Attention Module (CBAM), a technique that improves the performance of the network model by performing operations by focusing on specific vectors. More specifically, the attention technique is used to improve the classification performance of the convolutional neural network by adding weights to important data among data with a hierarchical structure by imitating the human recognition process. This attention technique has the effect of facilitating the flow of information within the convolutional neural network by distinguishing and learning data containing relatively important information from data containing unimportant information.

예를 들면, 타자가 스윙을 하고 있는 프레임 내부에는 관객 정보와 타자 정보가 동시에 포함될 수 있는데, 타자가 위치하는 영역의 픽셀값을 최대화한 MAP을 획득하고, 관객이 위치하는 영역의 픽셀 값을 최소화한 MAP을 획득하는 것을 의미할 수 있다.For example, a frame in which a batter is swinging may include both audience information and batter information, which may mean obtaining a MAP that maximizes the pixel value of the area where the batter is located and obtaining a MAP that minimizes the pixel value of the area where the spectator is located.

CBAM(Convolutional Block Attention Module)에 대한 구체적인 설명은 도4에서 상세히 후술하도록 한다.A detailed description of the CBAM (Convolutional Block Attention Module) is provided later in Fig. 4.

특징벡터 획득 모듈(50)은 중요 프레임과 비중요 프레임 각각의 특징벡터를 획득할 수 있다. 이 때, 중요 프레임과 비중요 프레임 각각의 공간적 주의 집중 맵(Spatial Attention MAP)이 LSTM(Long Short Term Memory)모델의 입력값으로 입력되고, LSTM(Long Short Term Memory)모델의 출력값이 중요 프레임의 특징벡터, 비중요 프레임의 특징벡터 일 수 있다. 이 때, LSTM(Long Short Term Memory)모델은 순환 신경망(RNN)의 한 종류이며, 셀(cell), 입력 게이트 (input gate), 출력 게이트(output gate) 및 망각 게이트(forget gate)로 구성된다. 이 때 LSTM(Long Short Term Memory)단위로 구성된 순환 신경망(RNN)을 LSTM(Long Short Term Memory)모델이라고 한다.The feature vector acquisition module (50) can acquire feature vectors of each of the important and non-important frames. At this time, the spatial attention maps of each of the important and non-important frames are input as input values of the LSTM (Long Short Term Memory) model, and the output values of the LSTM (Long Short Term Memory) model can be the feature vectors of the important and non-important frames. At this time, the LSTM (Long Short Term Memory) model is a type of recurrent neural network (RNN) and is composed of a cell, an input gate, an output gate, and a forget gate. At this time, a recurrent neural network (RNN) composed of LSTM (Long Short Term Memory) units is called a LSTM (Long Short Term Memory) model.

스포츠 활동분류 모듈(60)은 중요 프레임의 특징벡터, 비중요 프레임의 특징벡터로부터 확률 값을 획득하고, 확률 값이 미리 설정된 확률 값 이상인 경우에만 프레임을 구성하는 영상정보와 매칭된 스포츠 활동으로 분류할 수 있다. 이 때, 확률 값이 미리 설정된 확률 값 미만인 경우에는 LSTM(Long Short Term Memory)모델이 반복적으로 수행될 수 있다.The sports activity classification module (60) obtains probability values from the feature vectors of important frames and the feature vectors of unimportant frames, and can classify the frame into a sports activity matching the image information only when the probability value is greater than or equal to a preset probability value. In this case, when the probability value is less than the preset probability value, the LSTM (Long Short Term Memory) model can be repeatedly performed.

도2는 중요 프레임 추출 모듈을 나타내는 개념도이고, 도3은 TAM 계산과정을 거쳐 복수의 프레임에 시간적 주의집중 점수가 곱해진 모습을 나타낸 개념도이다.Figure 2 is a conceptual diagram showing a key frame extraction module, and Figure 3 is a conceptual diagram showing the temporal attention scores multiplied to multiple frames through the TAM calculation process.

중요 프레임 추출 모듈(30)은 픽셀 추출부(31), 픽셀 합산부(32), 픽셀 풀링 연산부(33), 가중치 설정부(34) 및 시간적 주의집중 점수 획득부(35)를 포함할 수 있다.The important frame extraction module (30) may include a pixel extraction unit (31), a pixel summation unit (32), a pixel pooling operation unit (33), a weight setting unit (34), and a temporal attention score acquisition unit (35).

픽셀 추출부(31)는 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출할 수 있고, 픽셀 합산부(32)는 추출된 픽셀을 합할 수 있다. The pixel extraction unit (31) can extract pixels of the R channel from the frame, pixels of the G channel, and pixels of the B channel, and the pixel summing unit (32) can sum the extracted pixels.

픽셀 풀링 연산부(33)는 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행할 수 있다. 이 때, 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행함으로써, 픽셀 데이터의 크기를 축소할 수 있다.The pixel pooling operation unit (33) can perform maximum pooling (MAX-POOLING) and average pooling (AVERAGE-POOLING) operations on the sum of pixels. At this time, by performing maximum pooling (MAX-POOLING) and average pooling (AVERAGE-POOLING) operations, the size of pixel data can be reduced.

가중치 설정부(34)는 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 최대 풀링(MAX-POOLING)에 설정한 가중치와 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정될 수 있다. 이 때, 가중치를 설정하는 것은 사용자가 최대 풀링(MAX-POOLING) 연산값에 상대적으로 더 큰 가중치를 설정하거나, 사용자가 평균풀링(AVERAGE-POOLING) 연산값에 상대적으로 더 큰 가중치를 설정할 수도 있다.The weight setting unit (34) sets weights for each of the MAX-POOLING operation value and the AVERAGE-POOLING operation value, but the sum of the weights set for MAX-POOLING and the weights set for AVERAGE-POOLING can be set to 1. At this time, when setting the weights, the user can set a relatively larger weight for the MAX-POOLING operation value, or the user can set a relatively larger weight for the AVERAGE-POOLING operation value.

시간적 주의집중 점수 획득부(35)는 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득할 수 있다.The temporal attention score acquisition unit (35) can acquire the temporal attention score by using the value obtained by adding the maximum pooling (MAX-POOLING) operation value and the average pooling (AVERAGE-POOLING) operation value.

시간적 주의집중 점수 획득부(35)는 제1 Fully Connected Layer, ReLU 활성화 함수 수행부, 제2 Fully Connected Layer 및 Sigmoid 활성화 함수 수행부로 마련될 수 있다.The temporal attention score acquisition unit (35) can be composed of a first fully connected layer, a ReLU activation function execution unit, a second fully connected layer, and a sigmoid activation function execution unit.

제1 Fully Connected Layer는 픽셀 풀링 연산부(33)에 의해 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 입력될 수 있다.The first fully connected layer can input a value that is the sum of the maximum pooling (MAX-POOLING) operation value and the average pooling (AVERAGE-POOLING) operation value by the pixel pooling operation unit (33).

ReLU 활성화 함수 수행부는 제1 Fully Connected Layer로부터 출력된 출력값이 입력될 수 있다.The ReLU activation function execution unit can input the output value output from the first fully connected layer.

제2 Fully Connected Layer는 ReLU 함수 수행부로부터 출력된 출력값이 입력될 수 있다.The second fully connected layer can receive the output value from the ReLU function execution unit.

Sigmoid 활성화 함수 수행부는 제2 Fully Connected Layer로부터 출력된 출력값이 입력값으로 입력될 수 있다.The sigmoid activation function execution section can input the output value from the second fully connected layer as an input value.

본 발명의 일 실시예에 따른 시간적 주의집중 점수를 산출하는 과정은 아래와 같은 수학식 1로 표현될 수 있다. 이는 TAM(Temporal Attention Module)계산 과정을 의미한다.The process of calculating a temporal attention score according to one embodiment of the present invention can be expressed by the following mathematical expression 1. This refers to the TAM (Temporal Attention Module) calculation process.

(수학식 1)(Mathematical formula 1)

(1) (1)

(2) (2)

(3) (3)

(4) (4)

(5) (5)

식 (1)에서 I는 프레임 추출 모듈(20)로부터 추출된 특정 프레임을 의미하고, 은 픽셀 추출부(31)로부터 추출된 R채널, G채널, B채널의 픽셀의 합산 값을 의미할 수 있다.는 R채널의 픽셀 값, 은 G채널의 픽셀 값, 는 B채널의 픽셀 값을 의미할 수 있다.In equation (1), I means a specific frame extracted from the frame extraction module (20), may mean the sum of the pixels of the R channel, G channel, and B channel extracted from the pixel extraction unit (31). is the pixel value of the R channel, is the pixel value of the G channel, can mean the pixel value of the B channel.

식 (2)에서 는 픽셀 풀링 연산부(33)에 의해 평균풀링(AVERAGE-POOLING) 연산을 수행한 값을 의미할 수 있다.In equation (2) may mean a value obtained by performing an average pooling operation by a pixel pooling operation unit (33).

식 (3)에서는 픽셀 풀링 연산부(33)에 의해 최대풀링(MAX-POOLING)연산을 수행한 값을 의미할 수 있다.In equation (3) may mean a value obtained by performing a maximum pooling operation by a pixel pooling operation unit (33).

식 (4)에서 은 과 을 합한 값을 의미할 수 있다. a1,a2는 optimal pooling parameter로서, 가중치를 의미할 수 있다. 사용자가 에 더 큰 가중치를 설정하는 경우 a1>a2이며, 사용자가 에 더 큰 가중치를 설정하는 경우 a2>a1일 수 있고, 이 때 a1과 a2의 합은 1로 설정될 수 있다.In equation (4) silver class It can mean the sum of a1 and a2. a1 and a2 are optimal pooling parameters and can mean weights. The user If you set a larger weight to a1>a2, the user If a larger weight is set, a2>a1, in which case the sum of a1 and a2 can be set to 1.

식 (5)는 주의집중 점수 획득부(35)가 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 과정을 나타낸 식을 의미할 수 있다.Equation (5) may mean an equation that represents the process in which the attention score acquisition unit (35) acquires a temporal attention score by using the value obtained by adding the maximum pooling (MAX-POOLING) operation value and the average pooling (AVERAGE-POOLING) operation value.

은 제1 Fully Connected Layer에 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 입력으로 입력되는 것을 의미하며, 이는 다층 신경망 구조의 가중치 벡터를 의미한다. 이 때, Fully Connected Layer는 픽셀 값을 연결하는 완전 결합 층일 수 있으며, 복수개로 마련될 수 있다. 이 때, W1은 의 벡터 형태이고, C는Fully Connected Layer의 채널을 의미하며, r은 Reduction Ratio를 의미한다. It means that the value of the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value is input to the first Fully Connected Layer, which means the weight vector of the multilayer neural network structure. At this time, the Fully Connected Layer can be a fully connected layer that connects pixel values, and can be provided in multiple layers. At this time, W1 is is a vector form, and C isIt refers to the channel of the Fully Connected Layer, and r refers to the Reduction Ratio.

는 를 ReLU의 활성화 함수에 입력한 값을 의미할 수 있다. 이 때, ReLU의 활성화 함수는 입력값이 양수인 경우 입력값이 출력값이 되며, 입력값이 음수인 경우 출력값이 0이되는 함수를 의미한다. Is It can mean the value input to the activation function of ReLU. At this time, the activation function of ReLU means a function in which the input value becomes the output value when the input value is positive, and the output value becomes 0 when the input value is negative.

는 의 출력 값이 제2 Fully Connected Layer에 입력되는 것을 의미하며, 이는 다층 신경망 구조의 가중치 벡터를 의미한다. 이 때, 의 벡터 형태이고, C는Fully Connected Layer의 채널을 의미하며, r은 Reduction Ratio를 의미한다. Is This means that the output value is input to the second fully connected layer, which means the weight vector of the multilayer neural network structure. At this time, is a vector form, and C isIt refers to the channel of the Fully Connected Layer, and r refers to the Reduction Ratio.

는 Sigmoid 함수에 입력한 것을 의미하는 식이며, 로 표현된 Sigmoid 함수는 0에서 1사이값으로 출력된다. 이 때, 출력된 값을 시간적 주의집중 점수라고 한다. is an expression that means what is input to the Sigmoid function, The Sigmoid function, expressed as , outputs a value between 0 and 1. At this time, the output value is called the temporal attention score.

상기의 내용은 "S. Liu, X. Ma, H. Wu, Y. Li, An End to End Framework with Adaptive Spatio-Temporal Attention Module for Human Action Recognition, Digital Object Identifier, Vol.8 (2020) 47220-47231"에 자세하게 나와 있다.The above is described in detail in "S. Liu, X. Ma, H. Wu, Y. Li, An End to End Framework with Adaptive Spatio-Temporal Attention Module for Human Action Recognition, Digital Object Identifier, Vol.8 (2020) 47220-47231".

도3을 참조하면, 프레임에 TAM(Temporal Attention Module)계산 과정을 거쳐 시간적 주의집중 점수를 획득하고, 시간적 주의집중 점수를 프레임에 곱한 결과를 나타낸 도면이다. Referring to Figure 3, this is a diagram showing the result of obtaining a temporal attention score by going through the Temporal Attention Module (TAM) calculation process for a frame and multiplying the temporal attention score by the frame.

예를 들면, 제 1프레임에 시간적 주의집중 점수를 곱한 벡터 값이 제 2프레임에 시간적 주의집중 점수를 곱한 벡터 값보다 큰 경우 제 1프레임을 중요프레임으로 설정하고, 제 2프레임을 비중요 프레임으로 설정하는 것을 의미한다. 도3에서는 TAM(Temporal Attention Module)계산 과정을 거쳐 시간적 주의집중 점수를 획득하고, 시간적 주의집중 점수를 프레임에 곱한 결과를 나타낸 것으로서, 테두리가 실선인 프레임이 중요프레임으로 설정된 것을 의미하고, 테두리가 점선인 프레임이 비중요프레임으로 설정된 것을 의미할 수 있다. For example, if the vector value obtained by multiplying the temporal attention score of the first frame is greater than the vector value obtained by multiplying the temporal attention score of the second frame, it means that the first frame is set as an important frame and the second frame is set as a non-important frame. Fig. 3 shows the result of obtaining a temporal attention score through the TAM (Temporal Attention Module) calculation process and multiplying the temporal attention score by the frame. This can mean that a frame with a solid border is set as an important frame, and a frame with a dotted border is set as a non-important frame.

도4는 2D CNN과정을 거쳐 공간적 주의집중 맵을 획득하는 과정을 나타낸 개념도이다.Figure 4 is a conceptual diagram showing the process of obtaining a spatial attention map through a 2D CNN process.

시간적 주의집중 점수가 곱해진 중요 프레임 및 비중요 프레임이 2D CNN 모델에 입력될 수 있다. 2D CNN 모델은 복수개의 Convolution Layer와 CBAM(Convolutional Block Attention Module)으로 마련된 것일 수 있다.Important and unimportant frames multiplied by temporal attention scores can be input to a 2D CNN model. The 2D CNN model can be constructed with multiple convolutional layers and a CBAM (Convolutional Block Attention Module).

공간 정보 추출 모듈(40)은 중요 프레임과 비중요 프레임의 공간 정보를 추출할 수 있다. 이 때, 공간 정보란, 공간적 주의집중 맵(Spatial Attention Map)을 의미할 수 있다.The spatial information extraction module (40) can extract spatial information of important frames and non-important frames. At this time, spatial information may mean a spatial attention map.

공간 정보 추출 모듈(40)은 Convolution　Layer, Feauter Map 풀링 연산부, 채널 주의집중 맵 획득부 및 공간적 주의집중 맵 생성부를 포함할 수 있다.The spatial information extraction module (40) may include a convolution layer, a factor map pooling operation unit, a channel attention map acquisition unit, and a spatial attention map generation unit.

시간적 주의집중 점수가 곱해진 프레임이 Convolution　Layer에 입력되면, Feature Map을 출력할 수 있다. Feature Map은 Feature Map 풀링 연산부에 입력되어, 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산이 수행될 수 있다.When a frame multiplied by a temporal attention score is input to a convolution layer, a feature map can be output. The feature map can be input to a feature map pooling operation unit, and a maximum pooling (MAX-POOLING) operation and an average pooling (AVERAGE-POOLING) operation can be performed.

채널 주의집중 맵 획득부는 Feature Map 풀링 연산부의 출력값으로부터 채널 주의집중 맵(Channel Attention Map)을 획득할 수 있다.The channel attention map acquisition unit can acquire a channel attention map from the output value of the feature map pooling operation unit.

공간적 주의집중 맵 획득부는 채널 주의집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득할 수 있다.A spatial attention map acquisition unit can acquire a spatial attention map from a channel attention map.

본 발명의 일 실시예에 따른 채널 주의집중 맵(Channel Attention Map)과 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 과정은 아래와 같은 수학식 2로 표현될 수 있다. 이는 CBAM(Convolution Block Attention Module)계산 과정을 의미한다.The process of obtaining a channel attention map and a spatial attention map according to one embodiment of the present invention can be expressed by the following mathematical expression 2. This means a CBAM (Convolution Block Attention Module) calculation process.

(수학식 2)(Mathematical formula 2)

(1) (1)

(2) (2)

(3) (3)

(4) (4)

식 (1)에서 시간적 주의집중 점수가 곱해진 프레임이 Convolution　Layer에 입력되면, Feature Map을 획득할 수 있는데, 이는 를 의미한다. 는 를 채널 축으로 풀링(Pooling)연산을 수행하고, 수행한 결과로 얻어진 벡터를 의미한다. 는 를 채널 축으로 평균풀링(AVERAGE-POOLING)연산을 수행하여 얻어진 벡터를 의미하며, 는 를 채널 축으로 최대풀링(MAX-POOLING)연산을 수행하여 얻어진 벡터를 의미하고, σ는 Sigmoid 함수를 의미하며, 는 채널 주의집중 맵을 의미한다.When the frame multiplied by the temporal attention score in Equation (1) is input to the Convolution Layer, a Feature Map can be obtained, which is It means. Is It means a vector obtained as a result of performing a pooling operation on the channel axis. Is It refers to a vector obtained by performing an average pooling operation on the channel axis. Is It means a vector obtained by performing a MAX-POOLING operation on the channel axis, and σ means a Sigmoid function. stands for channel attention map.

식 (2)에서 는 원소간 곱을 나타내며, 는 를 정제한 Feature Map을 의미한다.In equation (2) represents the product of elements, Is It refers to a refined Feature Map.

식 (3)에서 AvgPool( ₎은 을평균풀링(AVERAGE-POOLING) 연산한 값이며, Maxpool( ₎은 을최대풀링(MAX-POOLING)연산한 값을 의미한다. 는 컨벌루션 연산을 의미하고, 은 7x7 컨벌루션 연산을 의미하며, 은 공간적 주의집중 맵(Spatial Attention Map)을 의미한다.In Equation (3), AvgPool( ₎ silver secondThis is the value calculated by AVERAGE-POOLING, and Maxpool( ₎ silver secondMAX-POOLINGIt means the calculated value. means convolution operation, stands for 7x7 convolution operation, stands for Spatial Attention Map.

도4에는 Convolution　Layer를 2개로만 도시하였으나, 복수개의 Convolution　Layer로 마련될 수 있다. 식(4)에서 얻어진 는 을 정제한 Feature Map을 의미한다. 식(4)에서 얻어진 은 Convolution　Layer에 입력되어 식(3)의 과정이 수행될 수 있다. 즉, 합성곱 신경망의 구조에 따라 의 Feature Map이 새로운Convolution　Layer에 입력되는 과정이 반복될 수 있는 것이다.In Fig. 4, only two convolution layers are depicted, but multiple convolution layers can be provided. Obtained from Equation (4) Is It means the refined feature map obtained from equation (4). is input to the Convolution Layer and the process of Equation (3) can be performed. That is, according to the structure of the convolutional neural network. The Feature Map of the newThe process of inputting to the Convolution Layer can be repeated.

채널 주의집중 맵 및 공간적 주의집중 맵 계산의 수행은 합성곱 신경망의 복수의 Convolution　Layer와 Pooling연산을 위한 Pooling Layer사이 구간 중 하나 이상의 데이터 연결 구간에 삽입되는 소프트웨어 모듈을 통해 이루어질 수 있으며, 채널 주의집중 맵 및 공간적 주의집중 맵 계산이 수행되는 Convolution　Layer와 Pooling Layer사이의 데이터 연결 구간은 특정 구간으로 한정되지 않는다.The calculation of the channel attention map and the spatial attention map can be performed through a software module inserted into one or more data connection sections among the sections between multiple convolution layers of the convolution neural network and the pooling layer for the pooling operation, and the data connection section between the convolution layer and the pooling layer where the channel attention map and the spatial attention map calculation are performed is not limited to a specific section.

상기의 내용은 "S. Woo, J. Park, J. Lee, I. Kweon, CBAM: Convolutional Block Attention Module, Proceedings of the European Conference on Computer Vision (ECCV) (2018) 3-19 "에 자세하게 나와 있다.The above is described in detail in "S. Woo, J. Park, J. Lee, I. Kweon, CBAM: Convolutional Block Attention Module, Proceedings of the European Conference on Computer Vision (ECCV) (2018) 3-19 ".

도5는 2D CNN 모델을 거친 다음, 스포츠 활동을 분류하는 과정을 나타낸 개념도이다.Figure 5 is a conceptual diagram showing the process of classifying sports activities after passing through a 2D CNN model.

은 LSTM(Long Short Term Memory)모델의 입력값으로 입력될 수 있다. can be input as an input value for the LSTM (Long Short Term Memory) model.

LSTM(Long Short Term Memory)모델의 출력값은 MLP(Multi-Layer Perceptron)에 입력될 수 있다. MLP(Multi-Layer Perceptron)은 입력 레이어(Input Layer), 출력 레이어(Output Layer) 및 숨겨진 레이어(Hidden Layer)를 포함하는 적어도 세 개의 레이어로 구성되는 일종의 피드포워드(Feed-Forward) 인공신경망(Neural Network)이다. 훈련을 위해 역전파(Back-Propagation)이라고 하는 감독 학습 기법을 사용한다. 이 때, 선형으로 분리할 수 없는 데이터를 구별할 수 있다. MLP(Multi-Layer Perceptron)의 출력값은 특징벡터일 수 있다.The output of the LSTM (Long Short Term Memory) model can be input to the MLP (Multi-Layer Perceptron). The MLP (Multi-Layer Perceptron) is a type of feed-forward artificial neural network consisting of at least three layers, including an input layer, an output layer, and a hidden layer. It uses a supervised learning technique called back-propagation for training. At this time, data that cannot be separated linearly can be distinguished. The output of the MLP (Multi-Layer Perceptron) can be a feature vector.

출력된 중요 프레임의 특징벡터와 비중요 프레임의 특징벡터를 Sigmoid 활성화 함수에 입력하면, 각각의 확률값이 출력될 수 있다. 이 때, 출력된 확률값이 미리 설정된 확률 값 이상인 경우에만 데이터 베이스 모듈(10)에 영상정보와 매칭하여 저장한 스포츠 활동으로 분류할 수 있다. 만약, 출력된 확률 값이 미리 설정된 확률 값 미만인 경우에는 LSTM과정을 반복수행할 수 있다. When the feature vector of the output important frame and the feature vector of the unimportant frame are input to the sigmoid activation function, the probability value of each can be output. At this time, only when the output probability value is greater than the preset probability value, it can be classified as a sports activity stored by matching with the image information in the database module (10). If the output probability value is less than the preset probability value, the LSTM process can be repeatedly performed.

도6은 본 발명의 스포츠 활동분류 학습장치를 mAP를 통해 평가한 표를 나타낸 도면이다.Figure 6 is a diagram showing a table evaluating the sports activity classification learning device of the present invention through mAP.

본 발명의 스포츠 활동분류 학습장치(1)의 활동분류 성능을 평가하기 위해서 MLB Youtube Dataset을 이용하여 실험을 진행하였다. MLB Youtube Dataset은 2017년 MLB 2017 Post Season 20 경기의 동영상을 이용해 만들어진 데이터셋이다. 데이터셋에는 총 5,846개의 영상이 포함되어 있으며, 각 영상 내의 스포츠 활동에 따라 Ball, Strike, Swing, Bunt, Foul, Hit, Hit By Pitch, In Play 총 8개의 스포츠 활동 중 일부 스포츠 활동들이 레이블 되어있다. In order to evaluate the activity classification performance of the sports activity classification learning device (1) of the present invention, an experiment was conducted using the MLB Youtube Dataset. The MLB Youtube Dataset is a dataset created using videos of MLB 2017 Post Season 20 games in 2017. The dataset contains a total of 5,846 videos, and some of the eight sports activities, Ball, Strike, Swing, Bunt, Foul, Hit, Hit By Pitch, and In Play, are labeled according to the sports activities in each video.

본 발명에서는 적절한 하이퍼파라미터 탐색을 위해, 데이터셋에서 Stratified Sampling을 통해서 463개의 영상을 추출하여 검증 데이터 셋으로 사용하고, 나머지 4200개의 영상을 이용하여 학습하며, 1,183개의 영상을 이용하여 평가를 진행하였다.In the present invention, for appropriate hyperparameter search, 463 images were extracted from the dataset through Stratified Sampling and used as a verification dataset, the remaining 4,200 images were used for learning, and 1,183 images were used for evaluation.

평가척도로 mAP(mean Average Precision)을 사용하였다. mAP는 Object Detection등의 분야에서 자주 사용되는 수치로, 각 활동별로 Precision을 구한 이후, 모두 더한 이후 전체 class의 수로 나뉘어 주는 것으로, 전체 활동에 대해서 평균적으로 얼마나 맞추었는가를 나타내는 척도이다. mAP (mean Average Precision) was used as the evaluation metric. mAP is a frequently used metric in fields such as Object Detection. It calculates the precision for each activity, adds them all up, and then divides them by the total number of classes. It is a metric that shows how accurate the average is for all activities.

도6에서 RGB는 RGB 프레임만 입력으로 사용한 경우를 의미하고, Flow는 광학 흐름 데이터만 입력으로 사용한 것을 의미하며, Two-stream은 RGB 프레임과 광학흐름을 입력으로 사용한 것을 의미한다.In Fig. 6, RGB means that only RGB frames were used as input, Flow means that only optical flow data was used as input, and Two-stream means that both RGB frames and optical flow were used as input.

본 발명인 Proposed Method는 오직 RGB 프레임만 입력으로 사용하였음에도 불구하고, 기존 모델보다 더 높은 수치를 나타내고 있다. The Proposed Method of the present invention shows higher values than existing models even though it uses only RGB frames as input.

발명의 일 실시예에 따른 스포츠 활동분류 학습장치의 제어방법은 도1에 도시된 스포츠 활동분류 학습장치(1)와 실질적으로 동일한 구성상에서 진행되므로, 도1의 스포츠 활동분류 학습장치(1)와 동일한 구성요소에 대해 동일한 도면 부호를 부여하고, 반복되는 설명은 생략하기로 한다.The control method of the sports activity classification learning device according to one embodiment of the invention is performed on a configuration substantially identical to that of the sports activity classification learning device (1) illustrated in FIG. 1, and therefore the same drawing reference numerals are assigned to the same components as those of the sports activity classification learning device (1) illustrated in FIG. 1, and repetitive descriptions are omitted.

도7은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치의 제어방법을 나타낸 순서도이다.Figure 7 is a flowchart showing a control method of a sports activity classification learning device according to one embodiment of the present invention.

스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법으로서, 스포츠 활동 영상정보와 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장하는 단계(100), 학습 데이터로 저장된 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하는 단계(110), 추출된 프레임으로부터 중요 프레임을 추출하는 단계(120), 중요 프레임의 공간 정보를 추출하는 단계(130), 중요 프레임의 공간 정보가 LSTM(Long Short Term Memory)모델의 입력값으로 입력되는 단계(140), 중요 프레임의 특징벡터를 획득하는 단계(150), 중요 프레임의 특징벡터로부터 확률 값을 획득하는 단계(160) 및 확률 값이 미리 설정된 확률 값 이상인 경우에만 중요 프레임을 구성하는 영상정보와 매칭된 스포츠 활동으로 분류하는 단계(170)를 포함할 수 있다.A control method of a sports activity classification learning device that extracts at least one frame from sports activity image information and classifies sports activities, the control method may include a step (100) of matching sports activity image information and sports activity classification information and storing them as learning data, a step (110) of extracting at least one frame from the sports activity image information stored as learning data, a step (120) of extracting important frames from the extracted frames, a step (130) of extracting spatial information of the important frames, a step (140) of inputting the spatial information of the important frames as input values of an LSTM (Long Short Term Memory) model, a step (150) of obtaining a feature vector of the important frames, a step (160) of obtaining a probability value from the feature vector of the important frames, and a step (170) of classifying the sports activity as matching the image information constituting the important frames only when the probability value is greater than or equal to a preset probability value.

도8 내지 도9는 시간적 주의집중 점수를 획득하는 과정을 나타낸 순서도이다.Figures 8 and 9 are flowcharts showing the process of obtaining temporal attention scores.

도8을 참조하면, 추출된 프레임으로부터 중요 프레임을 추출하는 단계(120)는, 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출하는 단계(200), 추출된 픽셀을 합하는 단계(210), 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행하는 단계(220), 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 최대 풀링(MAX-POOLING)에 설정한 가중치와 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정되는 단계(230) 및 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 단계(240)를 포함할 수 있다.Referring to FIG. 8, the step (120) of extracting important frames from the extracted frames may include the steps (200) of extracting pixels of the R channel from the frame, pixels of the G channel, and pixels of the B channel, the step (210) of summing the extracted pixels, the step (220) of performing MAX-POOLING and AVERAGE-POOLING operations on the summed pixel value, the step (230) of setting weights for each of the MAX-POOLING operation value and the AVERAGE-POOLING operation value, but setting the sum of the weights set for MAX-POOLING and the weights set for AVERAGE-POOLING to 1, and the step (240) of obtaining a temporal attention score using the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value.

또한, 도9를 참조하면, 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 단계(240)는, 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 제1 fully connected layer에 입력값으로 입력되는 단계(300), 제1 fully connected layer로부터 출력된 출력값이 입력값으로 ReLU 활성화 함수 수행부에 입력되는 단계(310), ReLU 활성화 함수 수행부로부터 출력된 출력값이 제2 fully connected layer에 입력값으로 입력되는 단계(320) 및 제2 fully connected layer로부터 출력된 출력값이 Sigmoid 활성화 함수 수행부에 입력값으로 입력되는 단계(330)를 포함할 수 있다.In addition, referring to FIG. 9, the step (240) of obtaining a temporal attention score by using the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value may include the step (300) of inputting the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value as an input value to the first fully connected layer, the step (310) of inputting the output value output from the first fully connected layer as an input value to a ReLU activation function execution unit, the step (320) of inputting the output value output from the ReLU activation function execution unit as an input value to the second fully connected layer, and the step (330) of inputting the output value output from the second fully connected layer as an input value to a Sigmoid activation function execution unit.

도10은 공간적 주의집중 맵을 획득하는 과정을 나타낸 순서도이다.Figure 10 is a flowchart showing the process of obtaining a spatial attention map.

중요 프레임의 공간 정보를 추출하는 단계(130)는, 시간적 주의집중 점수가 곱해진 프레임이 제1 Convolution　Layer에 입력되어 Feature Map을 출력하는 단계(400), Feature Map에 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행하는 단계(410), Feature Map의 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행한 값을 이용하여 채널 주의집중 맵(Channel Attention Map)을 획득하는 단계(420) 및 채널 집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 단계(430)를 포함할 수 있다. The step (130) of extracting spatial information of an important frame may include a step (400) of inputting a frame multiplied by a temporal attention score into a first convolution layer to output a feature map, a step (410) of performing a MAX-POOLING operation and an AVERAGE-POOLING operation on the feature map, a step (420) of obtaining a channel attention map using the values obtained by performing the MAX-POOLING operation and the AVERAGE-POOLING operation on the feature map, and a step (430) of obtaining a spatial attention map from the channel attention map.

또한, 비중요 프레임의 공간 정보를 추출하는 단계도 중요 프레임의 공간 정보를 추출하는 단계와 동일하게 진행될 수 있다.Additionally, the step of extracting spatial information of non-important frames can be performed in the same manner as the step of extracting spatial information of important frames.

이와 같은, 스포츠 활동분류 학습장치(1)의 제어방법은 애플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거니와 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. The control method of the sports activity classification learning device (1) of this kind may be implemented as an application or may be implemented in the form of program commands that can be executed through various computer components and may be recorded on a computer-readable recording medium. The computer-readable recording medium may include program commands, data files, data structures, etc., singly or in combination. The program commands recorded on the computer-readable recording medium may be those specifically designed and configured for the present invention, or may be those known to and usable by those skilled in the art of computer software.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자 기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스 크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, and flash memory.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬 가지이다.Examples of program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter, etc. A hardware device may be configured to operate as one or more software modules to perform processing according to the present invention, and vice versa.

이상에서 본 발명의 실시예들에 대하여 설명하였으나, 본 발명의 사상은 본 명세서에 제시되는 실시 예에 제한되지 아니하며, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서, 구성요소의 부가, 변경, 삭제, 추가 등에 의해서 다른 실시 예를 용이하게 제안할 수 있을 것이나, 이 또한 본 발명의 사상범위 내에 든다고 할 것이다. Although the embodiments of the present invention have been described above, the spirit of the present invention is not limited to the embodiments presented in this specification, and those skilled in the art who understand the spirit of the present invention will be able to easily propose other embodiments by adding, changing, deleting, or adding components within the scope of the same spirit, but this will also be considered to fall within the spirit of the present invention.

1: 스포츠 활동분류 학습장치
10: 데이터 베이스 모듈
20: 프레임 추출 모듈
30: 중요 프레임 추출 모듈
31: 픽셀 추출부
32: 픽셀 합산부
33: 픽셀 풀링 연산부
34: 가중치 설정부
35: 시간적 주의집중 점수 획득부
40: 공간 정보 추출 모듈
50: 특징벡터 획득 모듈
60: 스포츠 활동분류 모듈1: Sports activity classification learning device
10: Database Module
20: Frame Extraction Module
30: Critical Frame Extraction Module
31: Pixel Extraction Section
32: Pixel summation unit
33: Pixel pooling operation unit
34: Weight setting section
35: Temporal Attention Score Acquisition Section
40: Spatial information extraction module
50: Feature vector acquisition module
60: Sports Activity Classification Module

Claims

A sports activity classification learning device that extracts at least one frame from sports activity video information and classifies sports activities,
A database module that matches the above sports activity video information and sports activity classification information and stores them as learning data;
A frame extraction module for extracting at least one frame from image information stored in the above database module;
An important frame extraction module that extracts important frames from the above extracted frames;
A spatial information extraction module for extracting spatial information of the above important frame;
A feature vector acquisition module in which spatial information of the above important frame is input as an input value of an LSTM (Long Short Term Memory) model and a feature vector of the above important frame is acquired; and
A sports activity classification module that obtains a probability value from a feature vector of the above important frame and classifies the sports activity as matching the image information constituting the above important frame only when the probability value is greater than or equal to a preset probability value;
The above important frame extraction module is,
A pixel extraction unit that extracts pixels of the R channel, pixels of the G channel, and pixels of the B channel from the above frame;
A pixel summing unit that sums the above extracted pixels;
A pixel pooling operation unit that performs MAX-POOLING and AVERAGE-POOLING operations on the sum of pixels;
A weight setting unit in which weights are set for each of the above maximum pooling (MAX-POOLING) operation value and the average pooling (AVERAGE-POOLING) operation value, and the sum of the weights set for the maximum pooling (MAX-POOLING) and the weights set for the average pooling (AVERAGE-POOLING) is set to 1; and
A sports activity classification learning device including a temporal attention score acquisition unit that acquires a temporal attention score by using the value obtained by adding the MAX-POOLING operation value and the AVERAGE-POOLING operation value.

delete

In paragraph 1,
The above important frame extraction module is,
A sports activity classification learning device including extracting a frame having a relatively large value multiplied by a temporal attention score obtained from the temporal attention score obtaining unit as an important frame.

In the third paragraph,
The above temporal attention focus score acquisition section is,
The first fully connected layer inputs the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value;
A ReLU activation function execution unit in which the output value output from the first Fully Connected Layer is input as an input value;
A second fully connected layer in which the output value output from the above ReLU activation function performing unit is input as an input value; and
A sports activity classification learning device including a sigmoid activation function performing unit in which the output value output from the second fully connected layer is input as an input value.

In paragraph 1,
The above spatial information extraction module,
The first Convolution Layer, which inputs the above important frame and outputs a Feature Map;
Feature map pooling operation unit that performs MAX-POOLING operation and AVERAGE-POOLING operation on the above feature map;
A channel attention map acquisition unit that acquires a channel attention map from the output value of the above feature map pooling operation unit; and
A sports activity classification learning device including a spatial attention map acquisition unit that acquires a spatial attention map from the above channel attention map.

A control method of a sports activity classification learning device that classifies sports activities by extracting at least one frame from sports activity video information,
Match the above sports activity video information and sports activity classification information and save them as learning data.
Extracting at least one frame from the sports activity video information stored as the above learning data,
Extract important frames from the above extracted frames,
Extract spatial information of the above important frames,
The spatial information of the above important frame is input as an input value of the LSTM (Long Short Term Memory) model, and the feature vector of the above important frame is obtained.
Including obtaining a probability value from the feature vector of the above important frame, and classifying the sport activity as matching the image information constituting the above important frame only when the probability value is greater than or equal to a preset probability value,
Extracting important frames from the above extracted frames is as follows:
Extract the pixels of the R channel from the above frame, extract the pixels of the G channel, extract the pixels of the B channel, and combine the extracted pixels.
Perform MAX-POOLING and AVERAGE-POOLING operations on the sum of pixels.
A weight is set for each of the above maximum pooling (MAX-POOLING) operation values and average pooling (AVERAGE-POOLING) operation values, but the sum of the weight set for the maximum pooling (MAX-POOLING) and the weight set for the average pooling (AVERAGE-POOLING) is set to 1.
A control method for a sports activity classification learning device, which includes obtaining a temporal attention score by using a value obtained by adding the MAX-POOLING operation value and the AVERAGE-POOLING operation value.

delete

In paragraph 6,
The temporal attention score is obtained by using the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value.
The sum of the above MAX-POOLING operation value and the AVERAGE-POOLING operation value is input to the first Fully Connected Layer as an input value.
The output value from the first Fully Connected Layer is input as an input value to the ReLU activation function execution unit.
The output value from the above ReLU activation function execution section is input as an input value to the second fully connected layer.
A control method for a sports activity classification learning device, comprising inputting an output value output from the second fully connected layer as an input value to a sigmoid activation function performing unit.

In paragraph 6,
Extracting spatial information of the above important frames is:
The frame multiplied by the above temporal attention score is input to the first convolution layer to output a feature map.
MAX-POOLING and AVERAGE-POOLING operations are performed on the above feature map.
A channel attention map is obtained by using the values obtained by performing the MAX-POOLING operation and the AVERAGE-POOLING operation of the above feature map.
A control method for a sports activity classification learning device, comprising obtaining a spatial attention map from the above channel attention map.

A computer-readable storage medium having recorded thereon a computer program for performing a control method of the sports activity classification learning device according to Article 6.