CN115022617A

CN115022617A - Video quality evaluation method based on electroencephalogram signal and space-time multi-scale combined network

Info

Publication number: CN115022617A
Application number: CN202210601991.3A
Authority: CN
Inventors: 何立火; 陈欣雷; 孙羽晟; 王笛; 高新波; 李洁; 路文
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-06
Anticipated expiration: 2042-05-30
Also published as: CN115022617B

Abstract

The invention provides a video quality evaluation method based on an electroencephalogram signal and a space-time multi-scale combined network, which comprises the following implementation steps: acquiring distortion videos with different distortion levels; collecting an electroencephalogram signal; intercepting an electroencephalogram signal segment; generating a training sample set by using the electroencephalogram signal fragments; constructing a space-time multi-scale combined network; training a spatio-temporal multiscale combined network; and evaluating the video quality by adopting a trained spatio-temporal multi-scale joint network. The time-space multi-scale combined network constructed by the invention can effectively learn electroencephalogram signal characteristics, and solves the problem that the brain action mechanism of sensing electroencephalogram signals by video quality and the periodic time-domain sensing characteristic of human vision are not considered in the prior art, so that the method has the advantages of high accuracy, capability of automatically processing electroencephalogram data in batches and high efficiency.

Description

Video quality evaluation method based on electroencephalogram signal and space-time multi-scale combined network

Technical Field

The invention belongs to the technical field of image processing, and further relates to a video quality evaluation method based on an electroencephalogram signal and space-time multi-scale combined network in the technical field of image video quality evaluation. The method can be used for analyzing the electroencephalogram signals collected in the process of observing the video to obtain the quality evaluation corresponding to the video quality.

Background

The compression and imperfect transmission of video inevitably cause distortion, which affects the appearance of video, so that the evaluation of video quality becomes an important and general problem. The quality evaluation of video can be divided into objective quality evaluation and subjective quality evaluation. The objective quality evaluation method is to obtain the quality score of the video by establishing a mathematical model simulating the process of perceiving the video by human eyes. The method can be realized by depending on software and has the advantages of batch processing, reproducible result and low processing cost. However, it is not yet determined whether the quality score obtained by the objective method calculation model can represent the perceived quality of a human watching video in reality. Subjective quality assessment usually requires the subject to determine whether they can detect the distortion or to grade the intensity of the distortion, however, this method is time-consuming, labor-intensive, and dependent on subjective judgment, is susceptible to the strategy and bias of the subject. The electroencephalogram, as a non-invasive electrophysiological device, can directly acquire electroencephalogram signals reflecting nerve potential activity through a head surface electrode, and is further used for video quality evaluation, so that the electroencephalogram is a simple, safe and reliable method. The method overcomes the defects that the objective method can not fully reflect the subjective perception quality, and the subjective method is long in time consumption and high in cost, and has important theoretical significance and practical value for obtaining the real video perception quality.

The patent document "video quality evaluation method based on electroencephalogram signals and space-time distortion" (patent application number: CN202010341014.5, publication number: CN111510710A) applied by the university of Western' an electronic technology discloses a video quality evaluation method based on electroencephalogram signals and space-time distortion. The method comprises the steps of firstly, selecting a time-space distortion video of water surface fluctuation, and taking the video as visual excitation; then acquiring continuous electroencephalogram signals and subjective evaluation, and calculating the subjective evaluation detection rate; and finally, segmenting the electroencephalogram signals, classifying the segmented electroencephalogram signals, and calculating the accuracy of classification of the electroencephalogram signals so as to evaluate the video quality. Although the method has the advantages that the video quality evaluation result is more consistent with the human subjective evaluation and the evaluation result is more accurate, the method has the defects that the characteristics of the electroencephalogram signal are obtained by performing dimension reduction processing on the electroencephalogram signal, only the time domain characteristics of the electroencephalogram signal in a single scale are considered, the time-space domain characteristics for representing visual perception components in the electroencephalogram signal are not considered, the extracted characteristics are not representative enough, the classification result is influenced, and the quality evaluation result is inaccurate.

A research method for Video Distortion Perception Under different Content Motion Conditions Based on electroencephalogram signals is disclosed in a published paper document 'An EEG-Based Study on Perception of Video Distortion Under Video Distortion Content Motion Conditions' of Qinghua university. Firstly, recording electroencephalogram signals when a subject watches distorted videos, and selecting a P300 component caused by quality change of human perception videos as an index of human perception distortion according to characteristic analysis of the electroencephalogram signals; through classification based on linear discriminant analysis, finding that the separability of the P300 component is in positive correlation with the perceptibility of distortion; the regression analysis result shows that S-shaped quantitative relation exists between the perceptibility of distortion and the separability of the P300 component; based on the relation, the electroencephalogram signals are utilized to calibrate distortion perception threshold values corresponding to different motion speed contents. The method has the defects that the linear discriminant classifier is a traditional machine learning algorithm, and electroencephalogram characteristics need to be manually extracted before classification of electroencephalogram signals by adopting linear discriminant analysis, so that time and labor are wasted, and newly acquired electroencephalogram signals cannot be efficiently processed.

Disclosure of Invention

The invention aims to provide a video quality evaluation method based on electroencephalogram signals and a space-time multi-scale combined network, aiming at overcoming the defects of the prior art, solving the problems that effective features cannot be extracted due to the fact that the brain action mechanism of the electroencephalogram signals for video quality perception and the periodic time domain perception characteristics of human vision are not considered in the prior art, further the classification result is influenced, the quality evaluation result is inaccurate, and solving the problems that the features need to be manually extracted when the traditional machine is used for learning, and the efficiency is low in the prior art.

The specific idea for realizing the method is that aiming at the problem that the video quality evaluation result is inaccurate due to the limitation of the existing electroencephalogram-based video quality evaluation method, distorted videos with different levels are generated, electroencephalograms of subjects watching the distorted videos with different distortion levels are collected, electroencephalogram fragments of the subjects watching the distorted videos with different distortion levels are intercepted, the electroencephalogram fragments are marked to generate a training set, the training set is used for training the constructed space-time multi-scale combined network, and then the trained space-time multi-scale combined network is used for evaluating the video quality to obtain the video quality evaluation result with higher accuracy.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) acquiring distortion videos with different distortion levels;

(2) acquiring electroencephalogram signals of a subject when watching distorted videos with different distortion levels;

(3) intercepting electroencephalogram signal segments of a subject when watching distorted videos with different distortion levels;

(4) generating a training sample set and a testing sample set by utilizing the electroencephalogram signal segments;

(5) constructing a space-time multi-scale combined network;

(6) training a space-time multi-scale joint network;

(7) and evaluating the video quality by adopting a trained spatio-temporal multi-scale joint network.

Compared with the prior art, the invention has the following advantages:

firstly, aiming at the characteristic that the electroencephalogram signals have time and space information, the time-space multi-scale joint network is constructed, multi-scale features are extracted from the time and space information of the electroencephalogram signals, the defect that only electroencephalogram signal time-domain features are considered in an electroencephalogram signal video quality evaluation model in the prior art is overcome, the electroencephalogram signal features capable of representing visual perception components can be learned by the time-space multi-scale joint network, the classification accuracy is improved, and the electroencephalogram signal classification method has the advantage of high accuracy.

Secondly, the multi-scale deep neural network provided by the invention is an end-to-end model, the result of video quality evaluation can be obtained by inputting the preprocessed electroencephalogram signals, the defect that the characteristics need to be manually extracted before classification of the electroencephalogram signals by using the traditional method is avoided, the complexity of operation is reduced, the electroencephalogram data can be automatically processed in batches, and the multi-scale deep neural network has the advantage of high efficiency.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

FIG. 2 is a schematic structural diagram of a multi-scale spatiotemporal feature extraction network constructed by the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

The implementation steps of the present invention are further described with reference to fig. 1 and the embodiment.

Step 1, obtaining distortion videos with different distortion levels;

step 1.1, in the embodiment of the present invention, a "Frozen words" ("Frozen world") unit intercepted from an Our Planet 2019 (Our Planet 2019) of a documentary is selected, and a video with a total of 150 frames and a duration of 5s is intercepted from 31 minutes and 29 seconds of the unit to serve as an undistorted stimulus video in the embodiment of the present invention, and the resolution of the video is 1928 × 1080;

step 1.2, in the embodiment of the present invention, the Quality parameter of the VideoWriter tool of the MATLAB software is adjusted to {7,14,20,30,100}, and five parameters of the Quality correspond to five distortion parameters representing different distortion levels, wherein when the distortion parameter is equal to 100, undistorted is represented;

step 1.3, inputting 150 frames of undistorted stimulus videos into MATLAB software, outputting distorted video stimuli after image compression of 60 th to 89 th frames in 150 frames by using a VideoWriter tool, and obtaining 5 space-time distorted videos corresponding to distortion parameters as a distorted stimulus video set;

step 2, acquiring electroencephalogram signals of a subject when watching distorted videos with different distortion levels:

and acquiring continuous electroencephalograms generated by each tested person in the process of watching the distorted stimulus video set by using a 64-bit Neuroscan electroencephalograph to obtain an electroencephalogram signal set comprising electroencephalogram signal samples of each tested person watching the distorted stimulus video set each time. In the embodiment of the invention, 8 testees are selected, each tested person watches each video for 40 times, the sampling frequency is 1000Hz, and the number of sampling channels is 64, so that each tested person has 200 electroencephalogram signal samples, wherein 40 electroencephalogram signal samples correspond to undistorted videos, and 160 electroencephalogram samples correspond to distorted videos. Therefore, 1600 electroencephalogram signal samples are tested, wherein 320 electroencephalogram signal samples correspond to undistorted video, and 1280 electroencephalogram samples correspond to distorted video;

step 3, intercepting electroencephalogram signal segments of a subject when the subject watches distorted videos with different distortion levels:

the method comprises the steps that the acquired electroencephalogram signals pass through band-pass filters with lower limit and upper limit of cut-off frequency being 0.2Hz and 30Hz respectively to obtain single electroencephalogram signals after band-pass filtering, and single electroencephalogram signal fragments of the filtered single electroencephalogram signals are intercepted from 200ms before the occurrence time of the distorted stimulation to 1000ms after the occurrence time of the distorted stimulation;

step 4, generating a training set and a testing set by utilizing the electroencephalogram signal segments:

step 4.1, taking 75% of samples in a single electroencephalogram signal fragment set of each subject as a training sample set, and taking the rest electroencephalogram signal fragments as a test sample set;

4.2, labeling single electroencephalogram signal segments in the training sample set and the test sample set, wherein the single electroencephalogram signal segment is used for watching undistorted videos, and the label of the single electroencephalogram signal segment is set to be 1; the single electroencephalogram signal segment comes from watching a distorted video, and the label of the distorted video is set to be 0;

4.3, combining single electroencephalogram signal segments and labels thereof in a training sample set to generate a training set, wherein in the embodiment of the invention, each training set sample in the training set is used as the input of a space-time multi-scale joint network model to train the network model; combining single electroencephalogram signal fragments and labels thereof in a test sample set to generate a test set, wherein in the embodiment of the invention, each test sample in the test set is used as the input of a quality evaluation model after model training is finished to detect and predict distortion;

step 5, constructing a space-time multi-scale combined network:

the embodiment of the invention builds a space-time multi-scale combined network formed by connecting a space-time multi-scale feature extraction sub-network and a space-time multi-scale feature fusion sub-network in series, wherein the space-time multi-scale feature extraction sub-network is formed by connecting a time domain feature extraction module and a space domain feature extraction module in parallel.

The structure of the spatio-temporal multi-scale federated network constructed by the present invention is further described with reference to FIG. 2.

Step 5.1, a time domain feature extraction module is set up, the time domain feature extraction module is formed by connecting two sub-modules in series, the first sub-module comprises three parallel bidirectional LSTM layers, and the second sub-module is formed by connecting a feature splicing layer, a full connection layer and a ReLU activation layer in series in sequence;

setting parameters of each layer of a first module in the time domain feature extraction sub-network:

the stacking layer number of the first to third bidirectional LSTM layers is set to be 1, and the number of nodes of the hidden layer is set to be 128;

setting parameters of each layer of a second module in the time domain feature extraction sub-network:

the characteristic splicing layer is realized by adopting a concat function;

setting the output dimension of the BN layer to be 6;

setting the number of the neurons of the full connection layer as 2;

the method comprises the following steps that a ReLU activation layer is realized by adopting a ReLU function, and an infionce parameter of the ReLU activation layer is set to True;

step 5.2, a space domain feature extraction module is built, the space domain feature extraction module is formed by connecting two sub-modules in series, the first sub-module is formed by connecting three convolution layers in parallel, and the second sub-module sequentially has the following structure: the characteristic splicing layer, the BN layer, the full connecting layer and the ReLU activation layer are connected in series;

setting parameters of each layer of a first submodule in the airspace feature extraction module as follows:

setting the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer to be 64 multiplied by 1, 32 multiplied by 1 and 8 multiplied by 1 respectively, setting the number of the convolution kernels to be 6 respectively, and setting the step length to be 1, 32 and 8 respectively;

setting parameters of each layer of a second submodule in the airspace feature extraction module as follows:

setting the output dimension of the BN layer to be 6;

setting the number of neurons in the full connecting layer to be 2;

the method comprises the following steps that a ReLU activation layer is realized by adopting a ReLU function, and a parameter inplace of the ReLU activation layer is set to True;

the Dropout layer is realized by Dropout, and the parameter drop _ rate of the Dropout layer is set to be 0.05;

step 5.3, building a space-time multi-scale feature fusion sub-network, wherein the structure of the sub-network sequentially comprises the following steps: the device comprises a characteristic splicing layer, a full connection layer, a ReLU activation function layer, a Dropout layer and a full connection layer;

setting parameters of each layer of the spatio-temporal multi-scale feature fusion sub-network:

setting the neuron parameter of the first layer full connecting layer to be 2;

the ReLU activation layer is realized by adopting a ReLU, and the parameter inplace of the ReLU activation layer is set to True;

setting the neuron parameter of the second layer full-link layer to be 2;

step 6, training a space-time multi-scale combined network:

step 6.1, initializing time-space multi-scale combined network parameters: the regularization coefficient is set to 1 × 10 ^-5 The initial learning rate is set to 2 × 10 ^-4 The learning rate attenuation rate decreases 1/10 for every 50 increments of the number of iterations, the number of iterations is set to 150, the batch size is set to 64, the attenuation weight is set to 2 × 10 ^-3 ；

Step 6.2, inputting the training set into a space-time multi-scale combined network, and iteratively updating the parameters of the space-time multi-scale combined network by using a back propagation method until a loss function is converged to obtain the trained space-time multi-scale combined network;

the loss function is as follows:

wherein,

representing a cross entropy loss function, n representing the total number of classification results after all samples in the training set are input into the multi-scale joint neural network, Σ representing summation operation, i representing the serial number of the multi-scale joint neural network classification result corresponding to the ith sample in the training set, y (i) representing the real value of the multi-scale joint neural network ith classification result corresponding to the ith sample in the training set, and log representing logarithm operation with 10 as a base,

the predicted value of the ith classification result of the multi-scale joint neural network corresponding to the ith sample in the training set is represented, | | represents the operation of taking the absolute value,

denotes the introduced L1 regularization term, k denotes the L1 regularization coefficient，ω _i The weight value of the ith classification result of the multi-scale joint neural network corresponding to the ith sample in the training set is represented;

and 7, evaluating the video quality by adopting a trained space-time multi-scale combined network:

step 7.1, inputting the test set of the EEG signal segments with distortion labels into a trained space-time multi-scale combined network, judging each classification data output by the space-time multi-scale combined network, if the output is 1, evaluating that the video corresponding to the EEG signal segments is undistorted stimulation video, and if the output is 0, evaluating that the video corresponding to the EEG signal segments is distorted stimulation video;

and 7.2, counting the classification results of the space-time multi-scale joint network on all the electroencephalogram signal segments according to whether the judgment result of the space-time multi-scale joint network is consistent with the actual condition. And obtaining the classification result of the distorted stimulation video and the undistorted stimulation video corresponding to the electroencephalogram according to the label of each electroencephalogram signal segment, and taking the classification result as the video quality evaluation result.

The implementation process of the present invention is further described below with reference to simulation experiments:

1, simulation conditions:

the hardware test platform of the simulation experiment is as follows: the CPU is Intel (R) core (TM) i7-8700, the dominant frequency is 3.2GHz, the memory is 16GB, and the GPU is NVIDIA GeForce GT 710.

The software testing platform of the simulation experiment is as follows: the system comprises a Widows7 operating system, professional electroencephalogram acquisition and analysis software Curry7, a psychological experiment operating platform E-Prime 2.0 and mathematical software MATLAB R2019 a.

2, simulation content and result analysis:

the simulation experiment of the invention is to adopt the method of the invention and two prior arts to classify the electroencephalogram signals respectively and calculate the average accuracy of classification.

Two prior art techniques are employed:

prior art 1 refers to a method for classifying EEG signals proposed in Vernon J Lawow et al published article "EEGNet: a compact connected neural network for EEG-based brain-computer interfaces".

Prior art 2 refers to a method for classifying electroencephalograms proposed by Robin Tibor Schirrmeister et al in its published paper "Deep learning with a connected neural networks for EEG decoding and visualization".

The experimental process of the simulation experiment for collecting the electroencephalogram signals of 8 testees repeatedly watching the distorted stimulation video and the undistorted stimulation video comprises the following steps: firstly, the distortion levels of stimulation videos are selected to be 7,14,20,30 and 100, after undistorted stimulation videos watched by a subject are played to the 59 th frame, 60 to 89 frames of distortion stimulation videos generated by the five distortion levels are played, and then electroencephalogram signals of the subject are collected. The experimental process for collecting the electroencephalogram signals consists of four computer screen interfaces. The first interface is an introduction interface, and requirements of simulation experiments of the invention are introduced in the interface. The second interface is a black screen interface, which inserts a white "+" sign in the middle of the black background to separate each experimental video from each other. The third interface is a video presentation interface which presents a section of stimulation video converted from undistorted to distorted. And after the third interface is presented, returning to the second interface to prepare for converting undistorted video into distorted video next time. The stimulation video corresponding to each distortion level is played repeatedly 40 times, and the presentation sequence is random. And the fifth interface is an end interface, and the fifth interface enters the end interface after all the distorted stimulation videos are presented.

The process of classifying the collected electroencephalogram signals in the simulation experiment of the invention is as follows: firstly, the acquired electroencephalogram signals are subjected to reference conversion, baseline correction, filtering and segmentation, and single electroencephalogram signal segments are extracted. Secondly, the segmented electroencephalogram signals are respectively input into a time domain convolution network and a space domain bidirectional long-short term memory network, and the time domain characteristics and the space domain characteristics of the single electroencephalogram signal segment are obtained. And then, inputting the time domain characteristics of the single electroencephalogram signal segment and the space domain characteristics into a characteristic fusion module to obtain space-time characteristics, performing full connection operation on the space-time characteristics twice, outputting a classification result by using a SoftMax function, performing gradient analysis on errors by using a cross entropy loss function, and performing parameter optimization on the space-time multi-scale combined network by using a small-batch random gradient descent method to achieve the purpose of training the space-time multi-scale combined network. And finally, classifying the electroencephalogram signals by using the trained space-time multi-scale joint network, and calculating the accuracy.

And (3) evaluating the classification results of the three methods by using two evaluation indexes (classification precision of each type and average precision AA). The classification accuracy of the average accuracy AA was calculated using the following formula, and all results are plotted in table 1:

TABLE 1

As shown in the table 1, the average classification accuracy AA of the method is 87.5%, and the index is higher than 2 prior art methods, so that the method can obtain more accurate electroencephalogram signals for classification.

The above simulation experiments show that: the method can extract the multi-scale space-time characteristics of the electroencephalogram signals by utilizing the built space-time multi-scale combined network, solves the problems that effective characteristics cannot be extracted due to the fact that the brain action mechanism of video quality perception electroencephalogram signals and the staged time domain perception characteristics of human vision are not considered in the prior art, further classification results are influenced, and quality evaluation results are inaccurate, and is a very practical video quality evaluation method.

Claims

1. A video quality evaluation method based on an electroencephalogram signal and space-time multi-scale combined network is characterized in that a training set and a test set are generated according to electroencephalogram signal segments corresponding to a distorted stimulus video, and a multi-scale space-time feature extraction network is constructed and trained; the method of evaluation comprises the steps of:

step 1, generating distortion videos with different distortion levels;

selecting a natural video with a period of time of at least 4s and a frame rate of 30 frames/second as a stimulus video, and carrying out distortion processing on the stimulus video by using 5 distortion levels representing different distortion degrees to obtain distortion videos with different distortion levels;

step 2, collecting single electroencephalogram signals of a subject when the subject watches distorted videos with different distortion levels:

acquiring single electroencephalogram signals of at least 8 subjects repeatedly watching each video distortion video for at least 30 times through an electroencephalogram signal collector, wherein the sampling frequency is 1000Hz, and the number of sampling channels is 64;

performing band-pass filtering on the acquired single electroencephalogram signal through a band-pass filter with the lower limit of cut-off frequency between 0.2-0.5Hz and the upper limit of cut-off frequency between 5-30Hz to obtain a single electroencephalogram signal after band-pass filtering, and intercepting a single electroencephalogram signal segment with the time length between 800 plus 1200 ms;

step 4, generating a training set and a testing set by utilizing the single electroencephalogram signal segment:

selecting at least 75% of electroencephalogram signal segments from the electroencephalogram signal segment set as a training set, using the rest electroencephalogram signal segments as a test set, and labeling the electroencephalogram signal segments in the training set and the test set, wherein the training set is used for training a space-time multi-scale joint network, and the test set is used as the input of a trained network for distortion prediction;

step 5, constructing a space-time multi-scale combined network:

constructing a space-time multi-scale combined network formed by connecting a space-time multi-scale feature extraction sub-network and a space-time multi-scale feature fusion sub-network in series, wherein the space-time multi-scale feature extraction sub-network is formed by connecting a time domain feature extraction module and a space domain feature extraction module in parallel;

step 5.1, a time domain feature extraction module is set up, the time domain feature extraction module is formed by connecting two sub-modules in series, the first sub-module comprises three parallel bidirectional LSTM layers, and the second sub-module is formed by connecting a feature splicing layer, a BN layer, a full connection layer and a ReLU activation layer in series in sequence;

step 6, training a space-time multi-scale combined network:

initializing parameters of the space-time multi-scale joint network, inputting a training sample set into the space-time multi-scale joint network, and iteratively updating the parameters of the space-time multi-scale joint network by using a back propagation method until a cross entropy loss function is converged to obtain the trained space-time multi-scale joint network;

inputting the EEG signal segments with distortion labels in the test set into a trained space-time multi-scale combined network, judging each classification data output by the space-time multi-scale combined network, and counting classification results of the space-time multi-scale combined network on all the EEG signal segments according to whether the judgment result of the space-time multi-scale combined network is consistent with the actual condition or not. And obtaining a classification result of the distorted stimulation video and the undistorted stimulation video corresponding to the electroencephalogram according to the label of each electroencephalogram signal segment, and taking the classification result as a video quality evaluation result.

2. Video quality based on electroencephalogram and spatiotemporal multiscale joint network as in claim 1The evaluation method is characterized in that the distortion level in the step 1 refers to: setting K image distortion levels q ═ q in the vicinity of human eye distortion perception threshold values ₁ ,q ₂ ,...,q _k ,...,q _K -the spacing of each distortion level is such that distortion variation is just noticeable; the human eye distortion perception threshold refers to the distortion degree of an image when human eyes can just observe image distortion, and the just noticeable distortion change refers to the distortion degree which gradually increases or decreases the distortion degree under the current distortion condition until the human eyes can just perceive the distortion change, wherein K is more than or equal to 4 and less than or equal to 6.

3. The method for evaluating video quality based on electroencephalogram and spatiotemporal multi-scale joint network according to claim 1, wherein the distortion processing in step 1 refers to: and (3) taking continuous 30 frames of images from the stimulus video, carrying out image compression of different degrees according to each frame of image with different distortion levels, and recombining the images with other unprocessed frames of images in the stimulus video to obtain the distortion videos corresponding to different distortion levels.

4. The video quality evaluation method based on electroencephalogram signals and a spatiotemporal multi-scale joint network as claimed in claim 1, wherein the cross-entropy loss function in step 6 is as follows:

wherein,

denotes the introduced L1 regularization term, k denotes the L1 regularization coefficient, ω _i And the weight value of the ith classification result of the multi-scale joint neural network corresponding to the ith sample in the training set is represented.