KR20230022010A

KR20230022010A - Method of controlling sports activity classification learning apparatus, computer readable medium and apparatus for performing the method

Info

Publication number: KR20230022010A
Application number: KR1020210104123A
Authority: KR
Inventors: 이수원; 류광현
Original assignee: 숭실대학교산학협력단
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2023-02-14
Also published as: KR102702069B1; WO2023013809A1

Abstract

Provided is a sports activity classification learning device, which includes: a database module which matches sports activity image information with sports activity classification information and stores matching data as learning data; a frame extraction module which extracts a frame; an important frame extraction module which extracts important frames from the extracted frames; a spatial information extraction module which extracts spatial information of important frames; a feature vector acquisition module which acquires feature vectors of important frames; and a sports activity classification module which classifies into sports activities.

Description

Control method of sports activity classification learning device, recording medium and device for performing the same

본 발명은 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치에 관한 것으로, 보다 상세하게는 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 입력하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치에 관한 것이다.The present invention relates to a control method of a sports activity classification learning device, a recording medium and an apparatus for performing the same, and more particularly, to a sports activity classification learning device for classifying sports activities by inputting at least one frame from sports activity image information. It relates to a control method, a recording medium and a device for performing this.

활동인식 기술은 어떠한 활동이 일어나고 있는지 인식하고, 인식된 활동을 분류하는 기술이다. Activity recognition technology is a technology that recognizes what kind of activity is occurring and classifies the recognized activity.

활동인식은 센서 기반의 방법으로 이루어질 수 있다. 센서 기반의 활동인식은, 가속도계와 자이로센서 등의 센서를 사용한다. 하지만 스포츠 경기에서 운동선수들은 가속도계, 자이로센서 등의 센서를 포함한 모든 전자기 장비를 착용하는 것은 금지되어 있다. 따라서, 스포츠 경기 분야의 활동인식에서는 센서 기반의 활동인식 방법은 사용되기 어렵다는 한계점이 있다. Activity recognition may be performed in a sensor-based manner. Sensor-based activity recognition uses sensors such as an accelerometer and a gyro sensor. However, in sporting events, athletes are prohibited from wearing any electronic equipment, including sensors such as accelerometers and gyroscopes. Therefore, there is a limitation that it is difficult to use the sensor-based activity recognition method in the field of sports activity recognition.

그 결과로 심층신경망(Deep Neural Network, DNN), 합성 곱 신경망(Convolutional Neural Network, CNN), 심층 신뢰망(Deep Belief Network, DBN) 및 순환 신경망 (Recurrent Neural Network, RNN) 등과 같은 다양한 딥 러닝 기법들이 스포츠 경기의 영상정보에 적용되어 선수들의 스포츠 활동을 인식하고, 인식된 스포츠 활동을 분류하는 것이 요구되는 실정이다.As a result, various deep learning techniques such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), Deep Belief Network (DBN), and Recurrent Neural Network (RNN) have been developed. It is required to recognize sports activities of players by applying them to image information of a sports game, and to classify the recognized sports activities.

본 발명이 해결하고자 하는 기술적 과제는 스포츠 경기의 영상정보로부터 프레임을 추출하여 시간적 주의집중 점수를 획득하고, 공간적 주의집중 맵을 획득하여, 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치를 제공하는 것이다.A technical problem to be solved by the present invention is a control method of a sports activity classification learning device that extracts a frame from video information of a sports game, obtains a temporal attention score, acquires a spatial attention map, and classifies sports activities. It is to provide a recording medium and a device to perform.

본 발명의 일측면은, 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치로서, 상기 스포츠 활동 영상정보와 상기 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장하는 데이터 베이스 모듈; 상기 데이터 베이스 모듈에 저장된 영상정보로부터 적어도 하나 이상의 프레임을 추출하는 프레임 추출 모듈; 상기 추출된 프레임으로부터 중요 프레임을 추출하는 중요 프레임 추출 모듈; 상기 중요 프레임의 공간 정보를 추출하는 공간 정보 추출 모듈; 상기 중요 프레임의 공간 정보가 LSTM(Long Short Term Memory)모델의 입력값으로 입력되고, 상기 중요 프레임의 특징벡터를 획득하는 특징벡터 획득 모듈; 및 상기 중요 프레임의 특징벡터로부터 확률 값을 획득하고, 상기 확률 값이 미리 설정된 확률 값 이상인 경우에만 상기 중요 프레임을 구성하는 영상정보와 매칭된 상기 스포츠 활동으로 분류하는 스포츠 활동분류 모듈;을 포함할 수 있다.One aspect of the present invention is a sports activity classification learning device for extracting at least one frame from sports activity image information and classifying the sports activity, wherein the sports activity image information and the sports activity classification information are matched and stored as learning data. database module; a frame extraction module extracting at least one frame from the image information stored in the database module; an important frame extraction module extracting important frames from the extracted frames; a spatial information extraction module extracting spatial information of the important frame; a feature vector acquisition module for receiving the spatial information of the important frame as an input value of a Long Short Term Memory (LSTM) model and acquiring a feature vector of the important frame; and a sports activity classification module that obtains a probability value from the feature vector of the important frame and classifies it into the sports activity matched with the image information constituting the important frame only when the probability value is greater than or equal to a preset probability value. can

또한, 상기 중요 프레임 추출 모듈은, 상기 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출하는 픽셀 추출부; 상기 추출된 픽셀을 합하는 픽셀 합산부; 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행하는 픽셀 풀링 연산부;상기 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 상기 최대 풀링(MAX-POOLING)에 설정한 가중치와 상기 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정되는 가중치 설정부; 및 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 시간적 주의집중 점수 획득부;를 포함할 수 있다.The important frame extraction module may include a pixel extraction unit which extracts R channel pixels, G channel pixels, and B channel pixels from the frame; a pixel summing unit summing the extracted pixels; A pixel pooling operation unit that performs MAX-POOLING and AVERAGE-POOLING operations on the sum of pixels; Each of the MAX-POOLING operation value and the average pooling operation value a weight setting unit configured to set a weight, wherein the sum of the weight set for the maximum pooling (MAX-POOLING) and the weight set for the average pooling (AVERAGE-POOLING) is set to 1; and a temporal attentional attention score acquisition unit that obtains a temporal attentional attention score using a value obtained by summing the maximum pooling operation value and the average pooling operation value.

또한, 상기 중요 프레임 추출 모듈은, 상기 시간적 주의집중 점수 획득부로부터 획득된 시간적 주의집중 점수가 곱해진 값이 상대적으로 큰 프레임을 중요 프레임으로 추출하는 것을 포함할 수 있다.Also, the important frame extraction module may include extracting, as important frames, a frame having a relatively large value multiplied by the temporal attentional attention score obtained from the temporal attentional attention score obtaining unit.

또한, 상기 시간적 주의집중 점수 획득부는, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 입력되는 제1 Fully Connected Layer; 상기 제1 Fully Connected Layer로부터 출력된 출력값이 입력값으로 입력되는 ReLU 활성화 함수 수행부; 상기 ReLU 활성화 함수 수행부로부터 출력된 출력값이 입력값으로 입력되는 제2 Fully Connected Layer; 및 상기 제2 Fully Connected Layer로부터 출력된 출력값이 입력값으로 입력되는 Sigmoid 활성화 함수 수행부;를 포함할 수 있다.The temporal attention score obtaining unit may include: a first fully connected layer to which a value obtained by summing the MAX-POOLING calculation value and the AVERAGE-POOLING calculation value is input; a ReLU activation function performer receiving an output value output from the first Fully Connected Layer as an input value; a second Fully Connected Layer to which an output value output from the ReLU activation function performer is input as an input value; and a sigmoid activation function execution unit to which an output value output from the second Fully Connected Layer is input as an input value.

또한, 상기 공간 정보 추출 모듈은, 상기 중요 프레임이 입력되고, Feature Map을 출력하는 제1 Convolution　Layer; 상기 Feature Map에 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행하는 Feature Map 풀링 연산부; 상기 Feature Map 풀링 연산부의 출력값으로부터 채널 주의집중 맵(Channel Attention Map)을 획득하는 채널 주의집중 맵 획득부; 및 상기 채널 주의집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 공간적 주의집중 맵 획득부;를 포함할 수 있다.In addition, the spatial information extraction module may include: a first convolution layer inputting the important frame and outputting a feature map; a feature map pooling operation unit that performs a MAX-POOLING operation and an AVERAGE-POOLING operation on the feature map; a channel attention map obtaining unit acquiring a channel attention map from an output value of the feature map pooling operation unit; and a spatial attention map obtaining unit acquiring a spatial attention map from the channel attention map.

본 발명의 다른 일측면은, 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법으로서, 상기 스포츠 활동 영상정보와 상기 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장하고, 상기 학습 데이터로 저장된 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하며, 상기 추출된 프레임으로부터 중요 프레임을 추출하고, 상기 중요 프레임의 공간 정보를 추출하며, 상기 중요 프레임의 공간 정보가 LSTM(Long Short Term Memory)모델의 입력값으로 입력되고, 상기 중요 프레임의 특징벡터를 획득하며, 상기 프레임의 특징벡터로부터 확률 값을 획득하고, 상기 확률 값이 미리 설정된 확률 값 이상인 경우에만 상기 중요 프레임을 구성하는 영상정보와 매칭된 상기 스포츠 활동으로 분류하는 것을 포함할 수 있다.Another aspect of the present invention is a control method of a sports activity classification learning device for extracting at least one frame from sports activity image information and classifying the sports activity, wherein the sports activity image information and the sports activity classification information are matched for learning. data, at least one frame is extracted from the sports activity image information stored as the learning data, an important frame is extracted from the extracted frame, spatial information of the important frame is extracted, and spatial information of the important frame is extracted. is input as an input value of a Long Short Term Memory (LSTM) model, a feature vector of the important frame is obtained, a probability value is obtained from the feature vector of the frame, and only when the probability value is greater than or equal to a preset probability value It may include classifying the sports activity matched with the image information constituting the important frame.

또한, 상기 추출된 프레임으로부터 중요 프레임을 추출하는 것은, 상기 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출하고, 상기 추출된 픽셀을 합하며, 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행하고, 상기 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 상기 최대 풀링(MAX-POOLING)에 설정한 가중치와 상기 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정되며, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 것을 포함할 수 있다.In addition, extracting an important frame from the extracted frame extracts an R channel pixel from the frame, extracts a G channel pixel, extracts a B channel pixel, sums the extracted pixels, and The maximum pooling (MAX-POOLING) and average pooling (AVERAGE-POOLING) operations are performed on the sum of , The sum of the weight set for the maximum pooling (MAX-POOLING) and the weight set for the average pooling (AVERAGE-POOLING) is set to 1, and the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) This may include obtaining a temporal attention score by using a value obtained by summing POOLING) calculation values.

또한, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 것은, 상기 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 제1 Fully Connected Layer에 입력값으로 입력되고, 상기 제1 Fully Connected Layer로부터 출력된 출력값이 입력값으로 ReLU 활성화 함수 수행부에 입력되며, 상기 ReLU 활성화 함수 수행부로부터 출력된 출력값이 제2 Fully Connected Layer에 입력값으로 입력되고, 상기 제2 Fully Connected Layer로부터 출력된 출력값이 Sigmoid 활성화 함수 수행부에 입력값으로 입력되는 것을 포함할 수 있다.In addition, obtaining a temporal attention score using the sum of the MAX-POOLING operation value and the average pooling operation value is the average of the MAX-POOLING operation value and the average pooling operation value. The sum of the AVERAGE-POOLING operation values is input to the first fully connected layer as an input value, and the output value output from the first fully connected layer is input to the ReLU activation function performing unit as an input value, and the ReLU An output value output from the activation function execution unit may be input as an input value to the second fully connected layer, and an output value output from the second fully connected layer may be input as an input value to the sigmoid activation function execution unit.

또한, 상기 중요 프레임의 공간 정보를 추출하는 것은, 상기 시간적 주의집중 점수가 곱해진 프레임이 제1 Convolution　Layer에 입력되어 Feature Map을 출력하고, 상기 Feature Map에 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행하며, 상기 Feature Map의 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행한 값을 이용하여 채널 주의집중 맵(Channel Attention Map)을 획득하고, 상기 채널 집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 것을 포함할 수 있다.In addition, in extracting the spatial information of the important frame, the frame multiplied by the temporal attention score is input to the first convolution layer to output a feature map, and a MAX-POOLING operation and averaging are performed on the feature map. AVERAGE-POOLING operation is performed, and a channel attention map is obtained using the value obtained by performing the MAX-POOLING operation and the AVERAGE-POOLING operation of the feature map. and obtaining a spatial attention map from the channel attention map.

본 발명의 또 다른 일측면은, 컴퓨터로 판독 가능한 저장 매체에는 스포츠 활동분류 학습장치의 제어방법을 수행하기 위한 컴퓨터 프로그램이 기록될 수 있다.In another aspect of the present invention, a computer program for performing a control method of a sports activity classification learning device may be recorded in a computer-readable storage medium.

상술한 본 발명의 일측면에 따르면, 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법, 이를 수행하기 위한 기록매체 및 장치를 제공함으로써, 시간적 주의집중 점수와 공간적 주의집중 맵을 모두 고려하여 스포츠 활동을 분류할 수 있다.According to one aspect of the present invention described above, by providing a control method for a sports activity classification learning device for classifying sports activities, a recording medium and an apparatus for performing the same, sports activities are performed in consideration of both the temporal attention score and the spatial attention map. activities can be categorized.

도1은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치를 나타내는 개념도이다.
도2는 중요 프레임 추출 모듈을 나타내는 개념도이다.
도3은 TAM 계산과정을 거쳐 복수의 프레임에 시간적 주의집중 점수가 곱해진 모습을 나타낸 개념도이다.
도4는 2D CNN 모델을 거쳐 공간적 주의집중 맵을 획득하는 과정을 나타낸 개념도이다.
도5는 2D CNN 모델을 거친 다음, 스포츠 활동을 분류하는 과정을 나타낸 개념도이다.
도6은 본 발명의 스포츠 활동분류 학습장치를 mAP를 통해 평가한 표를 나타낸 도면이다.
도7은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치의 제어방법을 나타낸 순서도이다.
도8 내지 도9는 시간적 주의집중 점수를 획득하는 과정을 나타낸 순서도이다.
도10은 공간적 주의집중 맵을 획득하는 과정을 나타낸 순서도이다.1 is a conceptual diagram illustrating a sports activity classification learning device according to an embodiment of the present invention.
2 is a conceptual diagram illustrating an important frame extraction module.
3 is a conceptual diagram showing how a plurality of frames are multiplied by temporal attention scores through a TAM calculation process.
4 is a conceptual diagram illustrating a process of obtaining a spatial attention map through a 2D CNN model.
5 is a conceptual diagram illustrating a process of classifying sports activities after passing through a 2D CNN model.
6 is a diagram showing a table in which the sports activity classification learning apparatus of the present invention is evaluated through mAP.
7 is a flowchart illustrating a control method of a sports activity classification learning device according to an embodiment of the present invention.
8 to 9 are flowcharts illustrating a process of acquiring temporal attention points.
10 is a flowchart illustrating a process of obtaining a spatial attention map.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable one skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in another embodiment without departing from the spirit and scope of the invention in connection with one embodiment. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도1은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치를 나타내는 개념도이다.1 is a conceptual diagram illustrating a sports activity classification learning device according to an embodiment of the present invention.

스포츠 활동분류 학습장치(1)는 데이터 베이스 모듈(10), 프레임 추출 모듈(20), 중요 프레임 추출 모듈(30), 공간 정보 추출 모듈(40), 특징벡터 획득 모듈(50) 및 스포츠 활동분류 모듈(60)을 포함할 수 있다.The sports activity classification learning device 1 includes a database module 10, a frame extraction module 20, an important frame extraction module 30, a spatial information extraction module 40, a feature vector acquisition module 50, and a sports activity classification module 60.

스포츠 활동분류 학습장치(1)는 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하고, 추출된 프레임으로부터 어떤 스포츠 활동이 발생하고 있는지 분류하도록 학습하는 장치일 수 있다. 이 때, 스포츠 활동이란, 야구 경기에서 타자가 스윙을 하고 있는 활동, 타자가 홈런 또는 스트라이크를 치는 활동 등일 수 있으며, 축구 경기에서 선수가 골을 넣는 활동, 슈팅을 하고 있는 활동, 코너킥이나 프리킥을 하고 있는 활동, 반칙을 하고 있는 활동, 파울을 하고 있는 활동, 패널티킥을 하고 있는 활동 등을 포함할 수 있다. 다만, 스포츠 활동에 대한 예시를 들었으나 이에 한정되는 것은 아니다.The sports activity classification learning device 1 may be a device that extracts at least one frame from sports activity image information and learns to classify which sports activity is occurring from the extracted frames. At this time, the sports activity may be an activity in which a batter swings in a baseball game, an activity in which a batter hits a home run or a strike, and the like, an activity in which a player scores a goal in a soccer game, an activity in which a shooter takes a corner kick or a free kick. This could include activities being played, activities being fouled, activities being fouled, activities being taken penalty kicks, etc. However, examples of sports activities have been given, but are not limited thereto.

데이터 베이스 모듈(10)은 스포츠 활동 영상정보와 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장할 수 있다. 예를 들면, 스포츠 활동 영상정보가 야구 경기에서 타자가 스윙하고 있는 영상정보라면 스포츠 활동 분류정보는 '스윙'으로 매칭하여 저장할 수 있으며, 스포츠 활동 영상정보가 야구 경기에서 타자가 홈런을 치고 있는 영상정보라면 스포츠 활동 분류정보는 '홈런'으로 매칭하여 저장할 수 있다.The database module 10 may match sports activity image information with sports activity classification information and store them as learning data. For example, if the sports activity image information is image information of a batter swinging in a baseball game, the sports activity classification information can be matched with 'swing' and stored, and the sports activity image information is an image of a batter hitting a home run in a baseball game. If it is information, sports activity classification information may be matched with 'home run' and stored.

프레임 추출 모듈(20)은 데이터 베이스 모듈(10)에 저장된 영상정보로부터 적어도 하나 이상의 프레임을 추출할 수 있다. 예를 들면, 야구 경기에서 타자가 홈런을 치고 있는 영상정보로부터 미리 설정된 시간차별로 연속된 적어도 하나 이상의 프레임을 추출하거나 불연속된 적어도 하나 이상의 프레임을 추출할 수 있다.The frame extraction module 20 may extract one or more frames from image information stored in the database module 10 . For example, at least one continuous frame or at least one discontinuous frame may be extracted at preset time intervals from image information of a batter hitting a home run in a baseball game.

중요 프레임 추출 모듈(30)은 프레임의 픽셀 값으로부터 시간적 주의집중 점수를 계산하고, 프레임에 시간적 주의집중 점수를 곱할 수 있다. 시간적 주의집중 점수를 계산하는 방식은 TAM(Temporal Attention Module)을 통해 시간적 주의집중 점수를 계산할 수 있다. 이 때, TAM(Temporal Attention Module)이란, 복수개의 프레임 간에 중요도를 산출하여 중요 프레임과 비중요 프레임을 추출할 수 있는 방법이다. The critical frame extraction module 30 may calculate a temporal attention score from the pixel values of the frame and multiply the frame by the temporal attention score. The method of calculating the temporal attention score may calculate the temporal attention score through a Temporal Attention Module (TAM). At this time, TAM (Temporal Attention Module) is a method for extracting important frames and non-important frames by calculating importance between a plurality of frames.

예를 들면, 시간적 주의집중 점수가 곱해진 제 1프레임보다 시간적 주의집중 점수가 곱해진 제 2프레임 값이 더 크다면, 제 2프레임이 중요 프레임으로 추출되는 것이고, 제 1프레임은 비중요 프레임으로 추출될 수 있다. For example, if the value of the second frame multiplied by the temporal attention score is greater than the value of the first frame multiplied by the temporal attention score, the second frame is extracted as an important frame and the first frame is regarded as a non-important frame. can be extracted.

TAM(Temporal Attention Module)에 대한 구체적인 설명은 도2 내지 도3에서 상세히 후술하도록 한다.A detailed description of the Temporal Attention Module (TAM) will be described later in detail with reference to FIGS. 2 and 3 .

공간 정보 추출 모듈(40)은 시간적 주의집중 점수가 곱해진 중요 프레임 및 비중요 프레임의 공간적 주의 집중 맵(Spatial Ateention Map)을 획득할 수 있다.The spatial information extraction module 40 may obtain spatial attention maps of important frames and non-important frames multiplied by temporal attention scores.

본 발명은 2D CNN모델을 통해서 공간적 주의 집중 맵(Spatial Ateention Map)을 획득하였고, 2D CNN모델은 ResNet50v2모델에 CBAM(Convolutional Block Attention Module)이 추가된 모델을 의미한다. In the present invention, a spatial attention map was obtained through a 2D CNN model, and the 2D CNN model means a model in which a Convolutional Block Attention Module (CBAM) is added to the ResNet50v2 model.

CBAM(Convolutional Block Attention Module) 계산을 통해 중요 프레임과 비중요 프레임 각각의 공간적 주의 집중 맵(Spatial Attention MAP)을 획득할 수 있다. A spatial attention map of each of the important frame and the non-critical frame may be obtained through a convolutional block attention module (CBAM) calculation.

공간적 주의 집중 맵(Spatial Attention MAP)은 특정 벡터에 주목하여 연산을 수행함으로써 네트워크 모델의 성능을 높이는 기법인 CBAM(Convolutional Block Attention Module)을 통해 산출되는 값이다. 보다 구체적으로, 주의집중 기법은 인간의 인식 프로세스를 모방하여 계층적 구조(hierarchical structure)를 가지고 있는 데이터 중 중요한 데이터에 가중치를 더해줌으로써 합성곱 신경망의 분류 성능을 높이기 위해 사용된다. 이와 같은 주의집중 기법은 상대적으로 중요한 정보를 포함하고 있는 데이터와 중요하지 않은 정보를 포함하고 있는 데이터를 구분하고 학습함으로써 합성곱 신경망 네트워크 내의 정보 흐름을 원활하게 하는 효과가 있다.The Spatial Attention MAP is a value calculated through Convolutional Block Attention Module (CBAM), a technique that improves the performance of a network model by paying attention to a specific vector and performing an operation. More specifically, the attention-focusing technique is used to improve the classification performance of convolutional neural networks by adding weights to important data among data having a hierarchical structure by imitating human recognition processes. This attention-focusing technique has the effect of facilitating the flow of information in the convolutional neural network by distinguishing and learning data containing relatively important information and data containing unimportant information.

예를 들면, 타자가 스윙을 하고 있는 프레임 내부에는 관객 정보와 타자 정보가 동시에 포함될 수 있는데, 타자가 위치하는 영역의 픽셀값을 최대화한 MAP을 획득하고, 관객이 위치하는 영역의 픽셀 값을 최소화한 MAP을 획득하는 것을 의미할 수 있다.For example, spectator information and batter information may be simultaneously included in a frame where the batter is swinging. A MAP maximizing the pixel value of the area where the batter is located is obtained and the pixel value of the area where the spectator is located is minimized. It may mean acquiring one MAP.

CBAM(Convolutional Block Attention Module)에 대한 구체적인 설명은 도4에서 상세히 후술하도록 한다.A detailed description of the CBAM (Convolutional Block Attention Module) will be described later with reference to FIG. 4 .

특징벡터 획득 모듈(50)은 중요 프레임과 비중요 프레임 각각의 특징벡터를 획득할 수 있다. 이 때, 중요 프레임과 비중요 프레임 각각의 공간적 주의 집중 맵(Spatial Attention MAP)이 LSTM(Long Short Term Memory)모델의 입력값으로 입력되고, LSTM(Long Short Term Memory)모델의 출력값이 중요 프레임의 특징벡터, 비중요 프레임의 특징벡터 일 수 있다. 이 때, LSTM(Long Short Term Memory)모델은 순환 신경망(RNN)의 한 종류이며, 셀(cell), 입력 게이트 (input gate), 출력 게이트(output gate) 및 망각 게이트(forget gate)로 구성된다. 이 때 LSTM(Long Short Term Memory)단위로 구성된 순환 신경망(RNN)을 LSTM(Long Short Term Memory)모델이라고 한다.The feature vector acquisition module 50 may obtain feature vectors of each of the important frame and the non-important frame. At this time, the Spatial Attention MAP of each of the important frame and the non-critical frame is input as an input value of the LSTM (Long Short Term Memory) model, and the output value of the LSTM (Long Short Term Memory) model is It may be a feature vector or a feature vector of a non-important frame. At this time, the LSTM (Long Short Term Memory) model is a type of recurrent neural network (RNN), and is composed of a cell, an input gate, an output gate, and a forget gate. . At this time, a recurrent neural network (RNN) composed of LSTM (Long Short Term Memory) units is called a LSTM (Long Short Term Memory) model.

스포츠 활동분류 모듈(60)은 중요 프레임의 특징벡터, 비중요 프레임의 특징벡터로부터 확률 값을 획득하고, 확률 값이 미리 설정된 확률 값 이상인 경우에만 프레임을 구성하는 영상정보와 매칭된 스포츠 활동으로 분류할 수 있다. 이 때, 확률 값이 미리 설정된 확률 값 미만인 경우에는 LSTM(Long Short Term Memory)모델이 반복적으로 수행될 수 있다.The sports activity classification module 60 obtains probability values from feature vectors of important frames and feature vectors of non-important frames, and classifies the sports activities matched with image information constituting the frames only when the probability values are greater than or equal to a preset probability value. can do. At this time, when the probability value is less than a preset probability value, a Long Short Term Memory (LSTM) model may be repeatedly performed.

도2는 중요 프레임 추출 모듈을 나타내는 개념도이고, 도3은 TAM 계산과정을 거쳐 복수의 프레임에 시간적 주의집중 점수가 곱해진 모습을 나타낸 개념도이다.Fig. 2 is a conceptual diagram showing an important frame extraction module, and Fig. 3 is a conceptual diagram showing how a plurality of frames are multiplied by temporal attention scores through a TAM calculation process.

중요 프레임 추출 모듈(30)은 픽셀 추출부(31), 픽셀 합산부(32), 픽셀 풀링 연산부(33), 가중치 설정부(34) 및 시간적 주의집중 점수 획득부(35)를 포함할 수 있다.The important frame extraction module 30 may include a pixel extraction unit 31, a pixel summing unit 32, a pixel pooling operation unit 33, a weight setting unit 34, and a temporal attention score acquisition unit 35. .

픽셀 추출부(31)는 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출할 수 있고, 픽셀 합산부(32)는 추출된 픽셀을 합할 수 있다. The pixel extractor 31 may extract R channel pixels, G channel pixels, and B channel pixels from the frame, and the pixel adder 32 may sum the extracted pixels.

픽셀 풀링 연산부(33)는 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행할 수 있다. 이 때, 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행함으로써, 픽셀 데이터의 크기를 축소할 수 있다.The pixel pooling operation unit 33 may perform MAX-POOLING and AVERAGE-POOLING operations on the sum of pixels. At this time, the size of pixel data may be reduced by performing MAX-POOLING and AVERAGE-POOLING operations.

가중치 설정부(34)는 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 최대 풀링(MAX-POOLING)에 설정한 가중치와 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정될 수 있다. 이 때, 가중치를 설정하는 것은 사용자가 최대 풀링(MAX-POOLING) 연산값에 상대적으로 더 큰 가중치를 설정하거나, 사용자가 평균풀링(AVERAGE-POOLING) 연산값에 상대적으로 더 큰 가중치를 설정할 수도 있다.The weight setting unit 34 sets a weight for each of the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) calculation value, and sets the weight set for the maximum pooling (MAX-POOLING) and the average pooling (AVERAGE-POOLING) calculation value. POOLING) can be set to 1. At this time, in setting the weight, the user may set a relatively larger weight to the maximum pooling (MAX-POOLING) calculation value, or the user may set a relatively larger weight to the average pooling (AVERAGE-POOLING) calculation value. .

시간적 주의집중 점수 획득부(35)는 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득할 수 있다.The temporal attentional attention score acquisition unit 35 may obtain a temporal attentional attention score using a value obtained by summing a MAX-POOLING operation value and an AVERAGE-POOLING operation value.

시간적 주의집중 점수 획득부(35)는 제1 Fully Connected Layer, ReLU 활성화 함수 수행부, 제2 Fully Connected Layer 및 Sigmoid 활성화 함수 수행부로 마련될 수 있다.The temporal attention score acquisition unit 35 may include a first fully connected layer, a ReLU activation function execution unit, a second fully connected layer, and a sigmoid activation function execution unit.

제1 Fully Connected Layer는 픽셀 풀링 연산부(33)에 의해 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 입력될 수 있다.In the first fully connected layer, a sum of the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) calculation value may be input by the pixel pooling calculation unit 33 .

ReLU 활성화 함수 수행부는 제1 Fully Connected Layer로부터 출력된 출력값이 입력될 수 있다.An output value output from the first Fully Connected Layer may be input to the ReLU activation function performer.

제2 Fully Connected Layer는 ReLU 함수 수행부로부터 출력된 출력값이 입력될 수 있다.An output value output from the ReLU function performer may be input to the second Fully Connected Layer.

Sigmoid 활성화 함수 수행부는 제2 Fully Connected Layer로부터 출력된 출력값이 입력값으로 입력될 수 있다.An output value output from the second Fully Connected Layer may be input as an input value to the sigmoid activation function execution unit.

본 발명의 일 실시예에 따른 시간적 주의집중 점수를 산출하는 과정은 아래와 같은 수학식 1로 표현될 수 있다. 이는 TAM(Temporal Attention Module)계산 과정을 의미한다.A process of calculating the temporal attention score according to an embodiment of the present invention may be expressed by Equation 1 below. This means the TAM (Temporal Attention Module) calculation process.

(수학식 1)(Equation 1)

(1)

(One)

(2)

(3)

(4)

(5)

식 (1)에서 I는 프레임 추출 모듈(20)로부터 추출된 특정 프레임을 의미하고,

은 픽셀 추출부(31)로부터 추출된 R채널, G채널, B채널의 픽셀의 합산 값을 의미할 수 있다.

는 R채널의 픽셀 값,

은 G채널의 픽셀 값,

는 B채널의 픽셀 값을 의미할 수 있다.In Equation (1), I means a specific frame extracted from the frame extraction module 20,

may denote a sum value of pixels of the R channel, the G channel, and the B channel extracted from the pixel extractor 31.

is the pixel value of the R channel,

is the pixel value of the G channel,

may mean a pixel value of the B channel.

식 (2)에서

는 픽셀 풀링 연산부(33)에 의해 평균풀링(AVERAGE-POOLING) 연산을 수행한 값을 의미할 수 있다.in equation (2)

may mean a value obtained by performing an average pooling operation by the pixel pooling operation unit 33.

식 (3)에서

는 픽셀 풀링 연산부(33)에 의해 최대풀링(MAX-POOLING)연산을 수행한 값을 의미할 수 있다.in equation (3)

may mean a value obtained by performing a MAX-POOLING operation by the pixel pooling operation unit 33.

식 (4)에서

은

과

을 합한 값을 의미할 수 있다. a1,a2는 optimal pooling parameter로서, 가중치를 의미할 수 있다. 사용자가

에 더 큰 가중치를 설정하는 경우 a1>a2이며, 사용자가

에 더 큰 가중치를 설정하는 경우 a2>a1일 수 있고, 이 때 a1과 a2의 합은 1로 설정될 수 있다.in equation (4)

silver

class

may mean the sum of a1 and a2 are optimal pooling parameters and may mean weights. user

If you set a larger weight on a1>a2, the user

If a larger weight is set for a2>a1, in this case, the sum of a1 and a2 may be set to 1.

식 (5)는 주의집중 점수 획득부(35)가 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 과정을 나타낸 식을 의미할 수 있다.Equation (5) shows a process in which the attention score acquisition unit 35 obtains the temporal attention score using the sum of the MAX-POOLING calculation value and the AVERAGE-POOLING calculation value. expression can mean

은 제1 Fully Connected Layer에 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 입력으로 입력되는 것을 의미하며, 이는 다층 신경망 구조의 가중치 벡터를 의미한다. 이 때, Fully Connected Layer는 픽셀 값을 연결하는 완전 결합 층일 수 있으며, 복수개로 마련될 수 있다. 이 때, W1은

의 벡터 형태이고, C는Fully Connected Layer의 채널을 의미하며, r은 Reduction Ratio를 의미한다.

means that a value obtained by summing the maximum pooling operation value and the average pooling operation value is input to the first Fully Connected Layer as an input, which means a weight vector of a multi-layer neural network structure. At this time, the Fully Connected Layer may be a fully connected layer that connects pixel values, and may be provided in plural. At this time, W1 is

is the vector form of, and C isIt means the channel of the Fully Connected Layer, and r means the Reduction Ratio.

는

를 ReLU의 활성화 함수에 입력한 값을 의미할 수 있다. 이 때, ReLU의 활성화 함수는 입력값이 양수인 경우 입력값이 출력값이 되며, 입력값이 음수인 경우 출력값이 0이되는 함수를 의미한다.

Is

may mean a value input to the activation function of ReLU. At this time, the activation function of ReLU means a function in which an input value becomes an output value when the input value is a positive number, and an output value becomes 0 when the input value is a negative number.

는

의 출력 값이 제2 Fully Connected Layer에 입력되는 것을 의미하며, 이는 다층 신경망 구조의 가중치 벡터를 의미한다. 이 때,

Is

It means that the output value of is input to the second Fully Connected Layer, which means the weight vector of the multi-layer neural network structure. At this time,

는 Sigmoid 함수에 입력한 것을 의미하는 식이며,

로 표현된 Sigmoid 함수는 0에서 1사이값으로 출력된다. 이 때, 출력된 값을 시간적 주의집중 점수라고 한다.

is an expression that means input to the Sigmoid function,

The sigmoid function expressed as is output as a value between 0 and 1. At this time, the output value is called the temporal attention score.

상기의 내용은 "S. Liu, X. Ma, H. Wu, Y. Li, An End to End Framework with Adaptive Spatio-Temporal Attention Module for Human Action Recognition, Digital Object Identifier, Vol.8 (2020) 47220-47231"에 자세하게 나와 있다.The above is from "S. Liu, X. Ma, H. Wu, Y. Li, An End to End Framework with Adaptive Spatio-Temporal Attention Module for Human Action Recognition, Digital Object Identifier, Vol.8 (2020) 47220- 47231" in detail.

도3을 참조하면, 프레임에 TAM(Temporal Attention Module)계산 과정을 거쳐 시간적 주의집중 점수를 획득하고, 시간적 주의집중 점수를 프레임에 곱한 결과를 나타낸 도면이다. Referring to FIG. 3, a temporal attention module (TAM) calculation process is performed on a frame to obtain a temporal attention score, and a diagram showing a result obtained by multiplying the frame by the temporal attention score.

예를 들면, 제 1프레임에 시간적 주의집중 점수를 곱한 벡터 값이 제 2프레임에 시간적 주의집중 점수를 곱한 벡터 값보다 큰 경우 제 1프레임을 중요프레임으로 설정하고, 제 2프레임을 비중요 프레임으로 설정하는 것을 의미한다. 도3에서는 TAM(Temporal Attention Module)계산 과정을 거쳐 시간적 주의집중 점수를 획득하고, 시간적 주의집중 점수를 프레임에 곱한 결과를 나타낸 것으로서, 테두리가 실선인 프레임이 중요프레임으로 설정된 것을 의미하고, 테두리가 점선인 프레임이 비중요프레임으로 설정된 것을 의미할 수 있다. For example, if the vector value obtained by multiplying the first frame by the temporal attention score is greater than the vector value obtained by multiplying the second frame by the temporal attention score, the first frame is set as an important frame and the second frame is regarded as a non-important frame. means to set 3 shows the result of obtaining a temporal attention score through a TAM (Temporal Attention Module) calculation process and multiplying the temporal attention score by a frame, which means that a frame with a solid border is set as an important frame, and A frame with a dotted line may mean that it is set as a non-critical frame.

도4는 2D CNN과정을 거쳐 공간적 주의집중 맵을 획득하는 과정을 나타낸 개념도이다.4 is a conceptual diagram illustrating a process of obtaining a spatial attention map through a 2D CNN process.

시간적 주의집중 점수가 곱해진 중요 프레임 및 비중요 프레임이 2D CNN 모델에 입력될 수 있다. 2D CNN 모델은 복수개의 Convolution Layer와 CBAM(Convolutional Block Attention Module)으로 마련된 것일 수 있다.Important frames and non-critical frames multiplied by temporal attention scores may be input to the 2D CNN model. The 2D CNN model may be prepared with a plurality of convolution layers and a convolutional block attention module (CBAM).

공간 정보 추출 모듈(40)은 중요 프레임과 비중요 프레임의 공간 정보를 추출할 수 있다. 이 때, 공간 정보란, 공간적 주의집중 맵(Spatial Attention Map)을 의미할 수 있다.The spatial information extraction module 40 may extract spatial information of important frames and non-important frames. In this case, the spatial information may mean a spatial attention map.

공간 정보 추출 모듈(40)은 Convolution　Layer, Feauter Map 풀링 연산부, 채널 주의집중 맵 획득부 및 공간적 주의집중 맵 생성부를 포함할 수 있다.The spatial information extraction module 40 may include a convolution layer, a Feauter Map pooling operation unit, a channel attention map acquisition unit, and a spatial attention map generation unit.

시간적 주의집중 점수가 곱해진 프레임이 Convolution　Layer에 입력되면, Feature Map을 출력할 수 있다. Feature Map은 Feature Map 풀링 연산부에 입력되어, 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산이 수행될 수 있다.If the frame multiplied by the temporal attention score is input to Convolution　Layer, a Feature Map can be output. The feature map is input to the feature map pooling operation unit, and maximum pooling (MAX-POOLING) and average pooling (AVERAGE-POOLING) operations may be performed.

채널 주의집중 맵 획득부는 Feature Map 풀링 연산부의 출력값으로부터 채널 주의집중 맵(Channel Attention Map)을 획득할 수 있다.The channel attention map obtaining unit may obtain a channel attention map from an output value of the feature map pooling operation unit.

공간적 주의집중 맵 획득부는 채널 주의집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득할 수 있다.The spatial attention map acquisition unit may obtain a spatial attention map from the channel attention map.

본 발명의 일 실시예에 따른 채널 주의집중 맵(Channel Attention Map)과 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 과정은 아래와 같은 수학식 2로 표현될 수 있다. 이는 CBAM(Convolution Block Attention Module)계산 과정을 의미한다.A process of acquiring a channel attention map and a spatial attention map according to an embodiment of the present invention may be expressed by Equation 2 below. This means a CBAM (Convolution Block Attention Module) calculation process.

(수학식 2)(Equation 2)

(1)

(One)

(2)

(3)

(4)

식 (1)에서 시간적 주의집중 점수가 곱해진 프레임이 Convolution　Layer에 입력되면, Feature Map을 획득할 수 있는데, 이는

를 의미한다.

는

를 채널 축으로 풀링(Pooling)연산을 수행하고, 수행한 결과로 얻어진 벡터를 의미한다.

는

를 채널 축으로 평균풀링(AVERAGE-POOLING)연산을 수행하여 얻어진 벡터를 의미하며,

는

를 채널 축으로 최대풀링(MAX-POOLING)연산을 수행하여 얻어진 벡터를 의미하고, σ는 Sigmoid 함수를 의미하며,

는 채널 주의집중 맵을 의미한다.When the frame multiplied by the temporal attention score in Equation (1) is input to the Convolution Layer, a Feature Map can be obtained, which is

means

Is

It means a vector obtained as a result of performing a pooling operation on the channel axis.

Is

means a vector obtained by performing an average pooling (AVERAGE-POOLING) operation on the channel axis,

Is

means a vector obtained by performing a MAX-POOLING operation on the channel axis, σ means a sigmoid function,

denotes a channel attention map.

식 (2)에서

는 원소간 곱을 나타내며,

는

를 정제한 Feature Map을 의미한다.in equation (2)

represents the inter-element product,

Is

It means a feature map that has been refined.

식 (3)에서 AvgPool(

₎은

을평균풀링(AVERAGE-POOLING) 연산한 값이며, Maxpool(

₎은

을최대풀링(MAX-POOLING)연산한 값을 의미한다.

는 컨벌루션 연산을 의미하고,

은 7x7 컨벌루션 연산을 의미하며,

은 공간적 주의집중 맵(Spatial Attention Map)을 의미한다.In Equation (3), AvgPool(

₎ is

secondAverage pooling (AVERAGE-POOLING) calculated value, Maxpool (

₎ is

secondMAX-POOLINGrepresents the computed value.

means the convolution operation,

means a 7x7 convolution operation,

Means a spatial attention map (Spatial Attention Map).

도4에는 Convolution　Layer를 2개로만 도시하였으나, 복수개의 Convolution　Layer로 마련될 수 있다. 식(4)에서 얻어진

는

을 정제한 Feature Map을 의미한다. 식(4)에서 얻어진

은 Convolution　Layer에 입력되어 식(3)의 과정이 수행될 수 있다. 즉, 합성곱 신경망의 구조에 따라

의 Feature Map이 새로운Convolution　Layer에 입력되는 과정이 반복될 수 있는 것이다.Although only two convolution layers are shown in FIG. 4, a plurality of convolution layers may be provided. obtained from equation (4)

Is

It means a feature map that has been refined. obtained from equation (4)

may be input to the Convolution Layer and the process of Equation (3) may be performed. That is, according to the structure of the convolutional neural network

Feature Map of is newThe process input to the convolution layer can be repeated.

채널 주의집중 맵 및 공간적 주의집중 맵 계산의 수행은 합성곱 신경망의 복수의 Convolution　Layer와 Pooling연산을 위한 Pooling Layer사이 구간 중 하나 이상의 데이터 연결 구간에 삽입되는 소프트웨어 모듈을 통해 이루어질 수 있으며, 채널 주의집중 맵 및 공간적 주의집중 맵 계산이 수행되는 Convolution　Layer와 Pooling Layer사이의 데이터 연결 구간은 특정 구간으로 한정되지 않는다.Calculation of the channel attention map and the spatial attention map can be performed through a software module inserted into one or more data connection sections among the sections between the plurality of convolution layers of the convolutional neural network and the pooling layer for pooling operation. The data connection section between the convolution layer and the pooling layer where map and spatial attention map calculation is performed is not limited to a specific section.

상기의 내용은 "S. Woo, J. Park, J. Lee, I. Kweon, CBAM: Convolutional Block Attention Module, Proceedings of the European Conference on Computer Vision (ECCV) (2018) 3-19 "에 자세하게 나와 있다.The above is detailed in "S. Woo, J. Park, J. Lee, I. Kweon, CBAM: Convolutional Block Attention Module, Proceedings of the European Conference on Computer Vision (ECCV) (2018) 3-19" .

도5는 2D CNN 모델을 거친 다음, 스포츠 활동을 분류하는 과정을 나타낸 개념도이다.5 is a conceptual diagram illustrating a process of classifying sports activities after passing through a 2D CNN model.

은 LSTM(Long Short Term Memory)모델의 입력값으로 입력될 수 있다.

may be input as an input value of a Long Short Term Memory (LSTM) model.

LSTM(Long Short Term Memory)모델의 출력값은 MLP(Multi-Layer Perceptron)에 입력될 수 있다. MLP(Multi-Layer Perceptron)은 입력 레이어(Input Layer), 출력 레이어(Output Layer) 및 숨겨진 레이어(Hidden Layer)를 포함하는 적어도 세 개의 레이어로 구성되는 일종의 피드포워드(Feed-Forward) 인공신경망(Neural Network)이다. 훈련을 위해 역전파(Back-Propagation)이라고 하는 감독 학습 기법을 사용한다. 이 때, 선형으로 분리할 수 없는 데이터를 구별할 수 있다. MLP(Multi-Layer Perceptron)의 출력값은 특징벡터일 수 있다.An output value of a Long Short Term Memory (LSTM) model may be input to a Multi-Layer Perceptron (MLP). MLP (Multi-Layer Perceptron) is a type of feed-forward artificial neural network composed of at least three layers including an input layer, an output layer, and a hidden layer. network). For training, we use a supervised learning technique called back-propagation. At this time, data that cannot be separated linearly can be distinguished. An output value of MLP (Multi-Layer Perceptron) may be a feature vector.

출력된 중요 프레임의 특징벡터와 비중요 프레임의 특징벡터를 Sigmoid 활성화 함수에 입력하면, 각각의 확률값이 출력될 수 있다. 이 때, 출력된 확률값이 미리 설정된 확률 값 이상인 경우에만 데이터 베이스 모듈(10)에 영상정보와 매칭하여 저장한 스포츠 활동으로 분류할 수 있다. 만약, 출력된 확률 값이 미리 설정된 확률 값 미만인 경우에는 LSTM과정을 반복수행할 수 있다. If the output feature vector of the important frame and the feature vector of the non-important frame are input to the Sigmoid activation function, each probability value can be output. At this time, only when the output probability value is equal to or greater than a preset probability value, it may be classified as a sports activity matched with image information and stored in the database module 10 . If the output probability value is less than the preset probability value, the LSTM process may be repeatedly performed.

도6은 본 발명의 스포츠 활동분류 학습장치를 mAP를 통해 평가한 표를 나타낸 도면이다.6 is a diagram showing a table in which the sports activity classification learning apparatus of the present invention is evaluated through mAP.

본 발명의 스포츠 활동분류 학습장치(1)의 활동분류 성능을 평가하기 위해서 MLB Youtube Dataset을 이용하여 실험을 진행하였다. MLB Youtube Dataset은 2017년 MLB 2017 Post Season 20 경기의 동영상을 이용해 만들어진 데이터셋이다. 데이터셋에는 총 5,846개의 영상이 포함되어 있으며, 각 영상 내의 스포츠 활동에 따라 Ball, Strike, Swing, Bunt, Foul, Hit, Hit By Pitch, In Play 총 8개의 스포츠 활동 중 일부 스포츠 활동들이 레이블 되어있다. In order to evaluate the activity classification performance of the sports activity classification learning device 1 of the present invention, an experiment was conducted using the MLB Youtube Dataset. The MLB Youtube Dataset is a dataset created using videos of MLB 2017 Post Season 20 games in 2017. The dataset includes a total of 5,846 images, and some of the eight sports activities are labeled according to the sports activities within each video: Ball, Strike, Swing, Bunt, Foul, Hit, Hit By Pitch, and In Play. .

본 발명에서는 적절한 하이퍼파라미터 탐색을 위해, 데이터셋에서 Stratified Sampling을 통해서 463개의 영상을 추출하여 검증 데이터 셋으로 사용하고, 나머지 4200개의 영상을 이용하여 학습하며, 1,183개의 영상을 이용하여 평가를 진행하였다.In the present invention, in order to search for appropriate hyperparameters, 463 images were extracted from the dataset through Stratified Sampling and used as a verification data set, learning was performed using the remaining 4200 images, and evaluation was performed using 1,183 images. .

평가척도로 mAP(mean Average Precision)을 사용하였다. mAP는 Object Detection등의 분야에서 자주 사용되는 수치로, 각 활동별로 Precision을 구한 이후, 모두 더한 이후 전체 class의 수로 나뉘어 주는 것으로, 전체 활동에 대해서 평균적으로 얼마나 맞추었는가를 나타내는 척도이다. mAP (mean average precision) was used as an evaluation scale. mAP is a number often used in fields such as object detection. It is divided by the number of all classes after obtaining the precision for each activity, adding all of them, and is a measure of how well you hit the average on average for all activities.

도6에서 RGB는 RGB 프레임만 입력으로 사용한 경우를 의미하고, Flow는 광학 흐름 데이터만 입력으로 사용한 것을 의미하며, Two-stream은 RGB 프레임과 광학흐름을 입력으로 사용한 것을 의미한다.In FIG. 6, RGB means that only RGB frames are used as inputs, Flow means that only optical flow data is used as inputs, and Two-stream means that RGB frames and optical flows are used as inputs.

본 발명인 Proposed Method는 오직 RGB 프레임만 입력으로 사용하였음에도 불구하고, 기존 모델보다 더 높은 수치를 나타내고 있다. Although the proposed method of the present invention uses only RGB frames as input, it shows a higher value than the existing model.

발명의 일 실시예에 따른 스포츠 활동분류 학습장치의 제어방법은 도1에 도시된 스포츠 활동분류 학습장치(1)와 실질적으로 동일한 구성상에서 진행되므로, 도1의 스포츠 활동분류 학습장치(1)와 동일한 구성요소에 대해 동일한 도면 부호를 부여하고, 반복되는 설명은 생략하기로 한다.Since the control method of the sports activity classification learning device according to an embodiment of the present invention is performed on substantially the same configuration as the sports activity classification learning device 1 shown in FIG. 1, the sports activity classification learning device 1 and The same reference numerals are given to the same components, and repeated descriptions will be omitted.

도7은 본 발명의 일 실시예에 따른 스포츠 활동분류 학습장치의 제어방법을 나타낸 순서도이다.7 is a flowchart illustrating a control method of a sports activity classification learning device according to an embodiment of the present invention.

스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하여 스포츠 활동을 분류하는 스포츠 활동분류 학습장치의 제어방법으로서, 스포츠 활동 영상정보와 스포츠 활동 분류정보를 매칭하여 학습 데이터로 저장하는 단계(100), 학습 데이터로 저장된 스포츠 활동 영상정보로부터 적어도 하나 이상의 프레임을 추출하는 단계(110), 추출된 프레임으로부터 중요 프레임을 추출하는 단계(120), 중요 프레임의 공간 정보를 추출하는 단계(130), 중요 프레임의 공간 정보가 LSTM(Long Short Term Memory)모델의 입력값으로 입력되는 단계(140), 중요 프레임의 특징벡터를 획득하는 단계(150), 중요 프레임의 특징벡터로부터 확률 값을 획득하는 단계(160) 및 확률 값이 미리 설정된 확률 값 이상인 경우에만 중요 프레임을 구성하는 영상정보와 매칭된 스포츠 활동으로 분류하는 단계(170)를 포함할 수 있다.A control method of a sports activity classification learning device that extracts at least one frame from sports activity image information and classifies sports activities, comprising the steps of matching sports activity image information and sports activity classification information and storing them as learning data (100); Extracting at least one frame from sports activity image information stored as data (110), extracting important frames from the extracted frames (120), extracting spatial information of important frames (130), Spatial information is input as an input value of a Long Short Term Memory (LSTM) model (140), feature vectors of important frames are acquired (150), probability values are obtained from feature vectors of important frames (160) and classifying as a sports activity matched with image information constituting an important frame only when the probability value is equal to or greater than a preset probability value ( 170 ).

도8 내지 도9는 시간적 주의집중 점수를 획득하는 과정을 나타낸 순서도이다.8 to 9 are flowcharts illustrating a process of acquiring temporal attention points.

도8을 참조하면, 추출된 프레임으로부터 중요 프레임을 추출하는 단계(120)는, 프레임으로부터 R채널의 픽셀을 추출하고, G채널의 픽셀을 추출하며, B채널의 픽셀을 추출하는 단계(200), 추출된 픽셀을 합하는 단계(210), 픽셀을 합한 값에 최대풀링(MAX-POOLING)과 평균풀링(AVERAGE-POOLING) 연산을 수행하는 단계(220), 최대 풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값 각각에 가중치가 설정되되, 최대 풀링(MAX-POOLING)에 설정한 가중치와 평균 풀링(AVERAGE-POOLING)에 설정한 가중치의 합은 1로 설정되는 단계(230) 및 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 단계(240)를 포함할 수 있다.Referring to FIG. 8, in step 120 of extracting important frames from the extracted frames, R channel pixels are extracted, G channel pixels are extracted, and B channel pixels are extracted from the frames (200). , summing the extracted pixels (210), performing MAX-POOLING and AVERAGE-POOLING operations on the sum of the pixels (220), MAX-POOLING operation value and A weight is set for each of the average pooling (AVERAGE-POOLING) calculation values, but the sum of the weight set for MAX-POOLING and the weight set for AVERAGE-POOLING is set to 1 (230) and obtaining a temporal attention score by using a value obtained by summing the maximum-pooling operation value and the average pooling operation value (240).

또한, 도9를 참조하면, 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값을 이용하여 시간적 주의집중 점수를 획득하는 단계(240)는, 최대풀링(MAX-POOLING) 연산값과 평균풀링(AVERAGE-POOLING) 연산값을 합산한 값이 제1 fully connected layer에 입력값으로 입력되는 단계(300), 제1 fully connected layer로부터 출력된 출력값이 입력값으로 ReLU 활성화 함수 수행부에 입력되는 단계(310), ReLU 활성화 함수 수행부로부터 출력된 출력값이 제2 fully connected layer에 입력값으로 입력되는 단계(320) 및 제2 fully connected layer로부터 출력된 출력값이 Sigmoid 활성화 함수 수행부에 입력값으로 입력되는 단계(330)를 포함할 수 있다.In addition, referring to FIG. 9, in step 240 of obtaining a temporal attention score using the sum of the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) calculation value, the maximum pooling ( In step 300, the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value is input as an input value to the first fully connected layer, and the output value output from the first fully connected layer is used as an input value. Input to the ReLU activation function execution unit (310), the output value output from the ReLU activation function execution unit is input as an input value to the second fully connected layer (320), and the output value output from the second fully connected layer is Sigmoid It may include a step 330 that is input as an input value to the activation function execution unit.

도10은 공간적 주의집중 맵을 획득하는 과정을 나타낸 순서도이다.10 is a flowchart illustrating a process of obtaining a spatial attention map.

중요 프레임의 공간 정보를 추출하는 단계(130)는, 시간적 주의집중 점수가 곱해진 프레임이 제1 Convolution　Layer에 입력되어 Feature Map을 출력하는 단계(400), Feature Map에 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행하는 단계(410), Feature Map의 최대풀링(MAX-POOLING)연산과 평균풀링(AVERAGE-POOLING)연산을 수행한 값을 이용하여 채널 주의집중 맵(Channel Attention Map)을 획득하는 단계(420) 및 채널 집중 맵(Channel Attention Map)으로부터 공간적 주의집중 맵(Spatial Attention Map)을 획득하는 단계(430)를 포함할 수 있다. In the step 130 of extracting spatial information of the important frame, the frame multiplied by the temporal attention score is input to the first convolution layer and outputting a feature map (400), max-pooling the feature map operation and average pooling (AVERAGE-POOLING) operation step (410), channel attention map ( It may include acquiring a Channel Attention Map (420) and acquiring a Spatial Attention Map from the Channel Attention Map (430).

또한, 비중요 프레임의 공간 정보를 추출하는 단계도 중요 프레임의 공간 정보를 추출하는 단계와 동일하게 진행될 수 있다.Also, the step of extracting spatial information of non-important frames may be performed in the same manner as the step of extracting spatial information of important frames.

이와 같은, 스포츠 활동분류 학습장치(1)의 제어방법은 애플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거니와 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. Such a control method of the sports activity classification learning device 1 may be implemented as an application or in the form of program commands that can be executed through various computer components and recorded on a computer-readable recording medium. A computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on a computer-readable recording medium may be those specially designed and configured for the present invention, or may be known and usable to those skilled in the art of computer software.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자 기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스 크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. -optical media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬 가지이다.Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes such as those produced by a compiler. A hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.

이상에서 본 발명의 실시예들에 대하여 설명하였으나, 본 발명의 사상은 본 명세서에 제시되는 실시 예에 제한되지 아니하며, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서, 구성요소의 부가, 변경, 삭제, 추가 등에 의해서 다른 실시 예를 용이하게 제안할 수 있을 것이나, 이 또한 본 발명의 사상범위 내에 든다고 할 것이다. Although the embodiments of the present invention have been described above, the spirit of the present invention is not limited to the embodiments presented herein, and those skilled in the art who understand the spirit of the present invention may add elements within the scope of the same spirit. However, other embodiments can be easily proposed by means of changes, deletions, additions, etc., but these will also fall within the scope of the present invention.

1: 스포츠 활동분류 학습장치
10: 데이터 베이스 모듈
20: 프레임 추출 모듈
30: 중요 프레임 추출 모듈
31: 픽셀 추출부
32: 픽셀 합산부
33: 픽셀 풀링 연산부
34: 가중치 설정부
35: 시간적 주의집중 점수 획득부
40: 공간 정보 추출 모듈
50: 특징벡터 획득 모듈
60: 스포츠 활동분류 모듈1: Sports activity classification learning device
10: database module
20: frame extraction module
30: critical frame extraction module
31: pixel extraction unit
32: pixel summation unit
33: pixel pooling operation unit
34: weight setting unit
35: temporal attention score acquisition unit
40: spatial information extraction module
50: feature vector acquisition module
60: sports activity classification module

Claims

A sports activity classification learning device for classifying sports activities by extracting at least one frame from sports activity image information, comprising:
a database module matching the sports activity image information with the sports activity classification information and storing the matching data as learning data;
a frame extraction module extracting at least one frame from the image information stored in the database module;
an important frame extraction module extracting important frames from the extracted frames;
a spatial information extraction module extracting spatial information of the important frame;
a feature vector acquisition module for receiving the spatial information of the important frame as an input value of a Long Short Term Memory (LSTM) model and acquiring a feature vector of the important frame; and
a sports activity classification module that obtains a probability value from the feature vector of the important frame and classifies it into the sports activity matched with the image information constituting the important frame only when the probability value is greater than or equal to a preset probability value; Activity classification learning device.

According to claim 1,
The important frame extraction module,
a pixel extraction unit extracting R channel pixels, G channel pixels, and B channel pixels from the frame;
a pixel summing unit summing the extracted pixels;
a pixel pooling operation unit that performs MAX-POOLING and AVERAGE-POOLING operations on the sum of pixels;
A weight is set for each of the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) calculation value. a weight setting unit in which the sum of one weight is set to 1; and
and a temporal attentional attention score acquisition unit configured to obtain a temporal attentional attention score using a value obtained by summing the MAX-POOLING calculation value and the AVERAGE-POOLING calculation value.

According to claim 2,
The important frame extraction module,
and extracting, as an important frame, a frame having a relatively large value multiplied by the temporal attention score obtained from the temporal attention score obtaining unit.

According to claim 3,
The temporal attention score acquisition unit,
a first fully connected layer to which a sum of the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) calculation value is input;
a ReLU activation function performer receiving an output value output from the first Fully Connected Layer as an input value;
a second Fully Connected Layer to which an output value output from the ReLU activation function performer is input as an input value; and
and a sigmoid activation function execution unit to which an output value output from the second Fully Connected Layer is input as an input value.

According to claim 1,
The spatial information extraction module,
a first convolution layer inputting the important frame and outputting a feature map;
a feature map pooling operation unit that performs a MAX-POOLING operation and an AVERAGE-POOLING operation on the feature map;
a channel attention map obtaining unit acquiring a channel attention map from an output value of the feature map pooling operation unit; and
and a spatial attention map acquiring unit acquiring a spatial attention map from the channel attention map.

A control method of a sports activity classification learning device for classifying sports activities by extracting at least one frame from sports activity image information, comprising:
Matching the sports activity image information and the sports activity classification information and storing them as learning data;
Extracting at least one frame from the sports activity image information stored as the learning data;
Extracting important frames from the extracted frames;
Extract spatial information of the important frame;
Spatial information of the important frame is input as an input value of a Long Short Term Memory (LSTM) model, and a feature vector of the important frame is obtained;
Obtaining a probability value from the feature vector of the important frame, and classifying it as the sports activity matched with the image information constituting the important frame only when the probability value is equal to or greater than a preset probability value. control method.

According to claim 6,
Extracting important frames from the extracted frames,
extracting R channel pixels from the frame, extracting G channel pixels, extracting B channel pixels, and summing the extracted pixels;
Perform MAX-POOLING and AVERAGE-POOLING operations on the sum of pixels,
A weight is set for each of the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) calculation value. The sum of one weight is set to 1,
The control method of the sports activity classification learning apparatus comprising obtaining a temporal attention score by using a sum of the maximum pooling (MAX-POOLING) calculation value and the average pooling (AVERAGE-POOLING) calculation value.

According to claim 7,
Obtaining a temporal attention score using the sum of the MAX-POOLING operation value and the AVERAGE-POOLING operation value,
A value obtained by summing the MAX-POOLING calculation value and the AVERAGE-POOLING calculation value is input as an input value to the first Fully Connected Layer,
The output value output from the first Fully Connected Layer is input to the ReLU activation function performing unit as an input value,
An output value output from the ReLU activation function performing unit is input as an input value to a second Fully Connected Layer;
and inputting an output value output from the second Fully Connected Layer as an input value to a sigmoid activation function performing unit.

According to claim 6,
Extracting the spatial information of the important frame,
A frame multiplied by the temporal attention score is input to a first convolution layer to output a feature map,
Perform MAX-POOLING and AVERAGE-POOLING operations on the Feature Map,
Obtaining a channel attention map using a value obtained by performing MAX-POOLING and AVERAGE-POOLING operations on the feature map;
A control method of a sports activity classification learning apparatus comprising obtaining a spatial attention map from the channel attention map.

A computer-readable storage medium on which a computer program for performing the control method of the sports activity classification learning device according to any one of claims 6 to 9 is recorded.