CN117975090A

CN117975090A - Character interaction detection method based on intelligent perception

Info

Publication number: CN117975090A
Application number: CN202311650341.9A
Authority: CN
Inventors: 谭守标; 王志方
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-05-03

Abstract

The invention discloses a character interaction detection method based on intelligent perception, which comprises the following steps: setting interaction suggestion branches, obtaining interaction suggestions with higher quality through an interaction suggestion network, and obtaining interaction semantic information and interaction internal position information through an interaction structure component; in the network flow, the backbone network extracts visual feature sequences of global context information, and sends the feature sequences into an interaction proposal branch and an interaction prediction branch; the interaction advice branch obtains high quality interaction advice, uses the high quality interaction advice as a query for the interaction forecast branch, and predicts a final HOI triplet by combining the interaction action category branch. According to the invention, interaction information is acquired through two aspects of interaction structure and interaction category, interaction understanding is enhanced by modifying a decoder and efficiently utilizing the interaction information, and CLIP is introduced into the interaction category for auxiliary discrimination; a number of experiments were performed on HICO-Det and V-COCO benchmarks, demonstrating the effectiveness of the design.

Description

Character interaction detection method based on intelligent perception

Technical Field

The invention relates to the technical field of medical instrument drying devices, in particular to a character interaction detection method based on intelligent perception.

Background

In the field of intelligent perception, it is of great significance to identify the interaction of a person with an object's actions. In industrial production practice, the action detection can detect incorrect operation of instruments and meters, such as dangerous operation steps, unreasonable time operation and other actions. High-specification laboratories and workshops can identify nonstandard behaviors such as smoking, making a call or eating things through abnormal behavior detection. In the robot research, the detection of the actions is convenient for the robot to understand the human behaviors, and meanwhile, the action imitation capability is enhanced.

The detection is based on the action interaction of the person and the object, and the person interaction HOI detection task is to locate the position of the person and the object in the image and detect whether the interaction and the category exist, so as to infer HOI triples < subjects, predicates and objects > in the scene.

Typically, conventional HOI detectors use convolutional networks to process the HOI prediction task in an indirect manner, converting it into surrogate regression and classification problems for people, things, interactions. Researchers design detectors through different ideas: there are methods using points, such as PPDM, to propose a HOI parallel point detection and matching method, using a center point of a human and object frame instead of representing the human and object. GGNet, based on PPDM further improvements, propose glance and gaze networks, adaptively deducing a set of motion-aware points to represent interaction areas. The interactive characteristic library is built in ATL through extracting interactive characteristics of the object and combining with other target detection databases, the interactive characteristics are added to the representation of the object in the reasoning interaction while the data is added, but the problem of suboptimal solution is predicted by an indirect method.

The most recent HOI detectors have chosen the same direction to overcome the suboptimal solution problem, all according to the transform-style DETR detector, consider the HOI detection as a set prediction problem, and employ the "end-to-end" concept. For example HOTR methods first use the DETR structure to predict the HOI problem, employ parallel instance decoders and interaction decoders, and propose man-object pointer combination interaction pairs when matching triples. The single decoder flow QPIC employs a similar network framework as HOItans and uses query granularity attention to replace location granularity attention to improve modeling capabilities of the model in difficult scenarios.

In view of the above, the invention provides a character interaction detection method based on intelligent perception.

Disclosure of Invention

The invention aims to provide a character interaction detection method based on intelligent perception, which solves the existing problems.

In order to achieve the above purpose, the present invention provides the following technical solutions:

A character interaction detection method based on intelligent perception comprises the following steps:

setting interaction suggestion branches, obtaining interaction suggestions with higher quality through an interaction suggestion network, and obtaining interaction semantic information and interaction internal position information through an interaction structure component;

in the network flow, the backbone network extracts visual feature sequences of global context information, and sends the feature sequences into an interaction proposal branch and an interaction prediction branch;

The interaction advice branch obtains high quality interaction advice, uses the high quality interaction advice as a query for the interaction forecast branch, and predicts a final HOI triplet by combining the interaction action category branch.

Preferably, the backbone network comprises:

Extracting image characteristics by using a convolutional neural network and an encoder of a transducer as a backbone network, wherein the convolutional neural network uses ResNet, the encoder adopts a multi-layer mode like the transducer, and each layer comprises a multi-head self-attention module and a two-layer feedforward network;

for a given image, a visual feature map is first extracted using a convolutional neural network The channel dimension of the feature map is then reduced from C to d by a 1x1 convolution block, at which time the dimension of the feature map becomes/>The characteristic information which needs to be input by the encoder is in a sequence form, so that the space dimension of the characteristic image is folded and then provided for the encoder, and the encoder combines the image global information to obtain the characteristic image/>

Preferably, the interaction location suggestion branch has the following three parts: human-object instance detection, interaction suggestion network, interaction structure;

The human-object example detection part sends the feature diagram f _input output in the backbone network into a decoder and obtains human-object examples through a target detection head, in order to obtain high-quality human-object pair interaction suggestions, the human-object examples are respectively connected in pairs, all possible human-object pairs are constructed, then the interaction suggestion network predicts the possibility of interaction of the single human-object pair through an MLP layer, the score of the interaction suggestion network is obtained, and after the interaction suggestion network is passed, only the human-object pair with the score of the front N can be used as the interaction suggestion of the interaction prediction branch;

The interaction structure component utilizes the information output by the interaction suggestion network to construct the semantic relation category between interactions and the structure information in the interactions.

Preferably, the human-object example detection comprises:

Connecting each detected person and object instance, sending all the generated person-object pairs into an interaction advice network, and representing the spatial information of the person and object instance by four-element for the detected instance And/>

Wherein c ^h、c^o is the coordinates of the center points of the person and the object, and h and w are the heights and the widths of the boundary boxes.

Preferably, the interaction advice network comprises:

After all person-object pairs are generated, spatial information is sent into an interaction suggestion network, the probability of interaction action exists in each person-object pair is learned and predicted, and the interaction suggestion prediction task is defined as a binary classification problem.

Preferably, the interactive structure comprises:

the interactive structure can be divided into semantic category and interactive internal structure between interactions, N high-quality interactive suggestions and position information thereof are obtained through an interactive suggestion network, and the semantic information and the position information are combined to obtain richer interactive structure information as priori knowledge of the following interactive category prediction.

Preferably, the interactive prediction branch comprises:

The decoder with stronger perceptibility to the interactive information is firstly arranged, so that N interactive suggestions generated by utilizing the interactive suggestion branches and the interactive structure information can be fused, and the decoder can inquire N high-quality suggestions under the guidance of the interactive structure information by modifying the self-attention and cross-attention layers of the DETR, and HOI representations with rich interactive information are obtained after the decoder processes the N high-quality suggestions.

Attention:

The attention used in the transducer is defined as follows, consisting of the query sequence query (q ₁,q₂...q_m), the Key sequence Key (k ₁,k₂...k_n), the Value sequence Value (v ₁,v₂...v_n), the Output of the attention operation being Output (o ₁,o₂...o_m), the Output being obtained by the softmax function from the following formula:

W ^q,W^k,W^v is a learnable weight matrix of Q, K, V, respectively, and d _k is the dimension of K. The dimensions of the output O _i are the same as the dimensions of the query Q.

Self-attention layer:

In the conventional transducer decoder, the output of the encoder is used as input to perform self-attention operation, but in order to utilize inter-interaction structure information to sense an interaction region, the following modifications are made in a self-attention layer, N interaction suggestions are used as query and Value, and the interaction suggestion is combined with the inter-interaction structure information as Key:

in self-attention computation, q _i and q _j are respectively different interaction suggestions, M _ij is an interaction semantic relation matrix, d _k is the dimension of q _j, and before interaction semantic information is utilized, MLP layer computation is carried out on the interaction suggestions together; after the self-attention layer, each interaction proposal has a weight graph, other character pairs related to q _i are related according to the semantic relation among the interactions so as to detect and classify the interactions subsequently, and finally the output obtained by the layer is

Cross-attention layer:

the interaction proposal obtains the HOI characteristic layer after the self-attention layer Let/>Cross attention operation is carried out on the image features f _input obtained by the backbone network, and N suggestions/>, are used forAs the query, the image feature f _input(f₁,f₂,...,f_n) output by the encoder is Value and Key, and the Key fuses the image feature, the position coding and the interaction internal structure information as follows:

Wherein rep _i is the query representation of the ith interaction proposal, M _in is the interaction internal structure information matrix, pos is the position code, and finally, the interaction representation is sent to the interaction classifier, namely the MLP layer, and the predicted HOI category is obtained with the assistance of the interaction category branch.

Preferably, training and reasoning includes:

In the training process, through carrying out loss function calculation and back propagation on the interaction suggestion network and the interaction prediction branch, the interaction action category branch and the human-object example detection part do not participate in training, and the two losses are calculated by using a focus loss function L _FL, and the loss function L _p of the interaction suggestion network is designed as follows:

In the middle of For the predicted outcome of the ith interaction proposal, p _i is ground truth of the ith interaction proposal; because of the serious long tail distribution problem in the HOI dataset, an alpha parameter is added on the basis of a focus loss function, the alpha parameter is set to be 0.25, the problem of sample imbalance is solved by reducing the weight of a positive sample or a negative sample, and the interactive prediction branch loss function L _a is designed as follows:

In the middle of The prediction probability and groud truth of the a-th action and i-th query after the CLIP similarity calculation are respectively given, and the final overall loss function L _CISC is:

L_CISC＝L_p+λL_a (8)

lambda is the super parameter and is set to 0.5.

Compared with the prior art, the invention has the following beneficial effects:

The invention provides a brand-new HOI detector CISC, which improves the interaction understanding capability from two directions of an interaction structure and an interaction category, thereby enhancing the action detection precision in the intelligent perception field, the interaction structure is designed with an interaction perception decoder, the interaction structure is more efficiently combined with the interaction structure information, the understanding of the interaction structure is enhanced by utilizing a pre-provided high-quality suggestion, the interaction category is assisted by using a CLIP to promote the understanding of the interaction category, the mAP reaching 32.72 in the HICO-Det and the mAP reaching 65.3/70.1 in the V-COCO are realized, the effectiveness of the CISC is reflected, and experimental analysis shows that the enhancement of the high-quality suggestion and the interaction understanding is a key point of HOI detection.

The invention provides a new HOI detector-CISC, which proves that the HOI detection understanding capability can be improved; the novel paradigm is provided for understanding and detecting the interaction structure and interaction action; a great number of experiments and verification on two data sets of HICO-Det and V-COCO prove the high efficiency of the proposed method; and the action detection capability in the intelligent perception field is enhanced.

Drawings

FIG. 1 is a schematic diagram of a human interaction example (human-Na-mouse/human-watch-display) of the present invention;

Fig. 2 is a schematic diagram of a CISC network structure according to the present invention.

Detailed Description

1. The detection method comprises the following steps:

(1) Backbone network:

Image features are extracted using a convolutional neural network using ResNet and an encoder of a transducer in a multi-layer fashion, each layer comprising a multi-headed self-attention module and a two-layer Feed Forward Network (FFN), as the transducer. For a given image, a visual feature map is first extracted using a convolutional neural network The channel dimension of the feature map is then reduced from C to d by a 1x1 convolution block, at which time the dimension of the feature map becomes/>The characteristic information which needs to be input by the encoder is in a sequence form, so that the space dimension of the characteristic image is folded and then provided for the encoder, and the encoder combines the image global information to obtain the characteristic image/>

(2) Interaction advice branch

In the interaction location suggestion branch, there are three parts: human-object instance detection, interaction advice network, interaction structure.

The human-object example detection part sends the feature diagram f _input output in the backbone network to the decoder and obtains human-object examples through the target detection head, in order to obtain high-quality human-object pair interaction advice, the human-object examples are respectively connected in pairs, namely all possible human-object pairs are constructed, then the interaction advice network predicts the possibility of interaction of the human-object pairs through the MLP layer for the single human-object pair, the score of the interaction advice network is obtained, and after the interaction advice network is passed, only the human-object pair with the score of the front N can be used as the interaction advice of the interaction prediction branch.

(121) Human-object example detection:

each detected person and object instance is connected with each other, and all generated person-object pairs are sent into an interaction advice network. For the detected examples, the invention expresses the spatial information of the human and object examples by four-element groups And/>(Wherein c ^h、c^o is the coordinates of the center point of the person and the object, and h and w are the height and the width of the boundary box).

(122) Interaction advice network:

After all person-object pairs are generated, spatial information is sent into an interaction advice network, and the probability of interaction action of each person-object pair is learned and predicted. The interaction advice prediction task is defined as a binary classification problem, which consists of an MLP header coupled with a Sigmoid activation function. In the training and prediction process, the present invention samples up to N pairs of people, including positive and negative samples, for each input image. The pair of samples is set to positive samples only if the human-to-object bounding boxes and ground truth of the predicted human-to-object pair each have an intersection ratio greater than 0.4. The N is set to be larger than the number of interaction actions in most images, and when the number of positive samples is insufficient, the person-object pairs are randomly selected as negative samples.

(123) Interaction structure:

The interactive structure can be divided into semantic category and interactive internal structure, N high-quality interactive suggestions and position information thereof are obtained through an interactive suggestion network, and the semantic information and the position information of the interactive suggestions are combined to obtain richer interactive structure information as priori knowledge of the following interactive category prediction.

Interaction internal structure: in the detection of the human-object example, the invention obtains sufficient human-object pair position information, and the conventional HOI detection does not use interactive internal layout information in most cases, so the invention divides an image into five areas through human and object bounding boxes: background, people, thing frame union, remaining people frames, remaining joint regions, each pair as a suggested person-thing pair, each region is labeled according to this rule.

The present invention finds that the two groups of actions have different relation in terms of semantics, for example, in the example given in fig. 1, when a person sees "person holds a mouse", the person is said to be "sitting on" a chair ", and also when the person drives" is said to be another "person" sitting on "a car", in the former example, the two HOI instances share the same "person", in the latter example, the two HOI instances share the same "object", and semantic relation exists between the HOI instances of the two examples. The invention defines the semantic relation category between interactions as the following five types according to common sense: sharing the same "person"; sharing the same "thing"; "human" is identical to "object"; when there are leading consequences of human-object pair 1 and human-object pair 2, as shown in fig. 1; when person-object pair 1 is not associated with person-object pair 2, it is passed to the interactive perceptual decoder in the form of a triplet (HOI ₁,HOI₂, R), where HOI _i is the location information of the ith person-object pair and R is the semantic relationship category between person-object pairs.

(13) Interactive predictive branching

The invention designs a decoder with stronger perceptibility to the interactive information, so that N interactive suggestions generated by utilizing the interactive suggestion branches and the interactive structure information can be fused, and the decoder can inquire N high-quality suggestions under the guidance of the interactive structure information by modifying the self-attention and cross-attention layers of the DETR, and HOI representations with rich interactive information are obtained after the decoder processes the N high-quality suggestions.

Attention:

The attention used in the transducer is defined as follows, and is composed of a query sequence query (q ₁,q₂...q_m), a Key sequence Key (k ₁,k₂...k_n), and a Value sequence Value (v ₁,v₂...v_n), and the Output of the attention operation is Output (o ₁,o₂...o_m). The output is obtained by the softmax function from the following equation:

W ^q,W^k,W^v is a learnable weight matrix of Q, K, V, d _k is the dimension of K, and the dimension of the output O _i is the same as the dimension of the query Q.

Self-attention layer:

In the conventional transducer decoder, the output of the encoder is used as input to perform self-attention operation, but in order to utilize inter-interaction structure information to sense an interaction region, the following modifications are made in the self-attention layer, N interaction suggestions are used as query and Value, and the interaction suggestions are combined with the inter-interaction structure information as Key, as follows:

In the self-attention calculation, q _i and q _j are respectively different interaction suggestions, M _ij is an inter-interaction semantic relation matrix, d _k is the dimension of q _j, and before the inter-interaction semantic information is utilized, the interaction suggestions are firstly subjected to MLP layer calculation together with the interaction suggestions, after the self-attention layer, each interaction suggestion has a weight graph, other character pairs related to q _i are related according to the inter-interaction semantic relation so as to detect and classify the interaction subsequently, and the output obtained by the layer is

Cross-attention layer:

(14) Interaction category branching

Through research and experiments, the existence of interaction suggestion branches is found that the acquisition of HOI examples and interaction position information is not the bottleneck of HOI detection, and the interaction perception decoder detects the position with high interaction possibility, and even so, the judgment of the interaction action type is still unsatisfactory, so that a module is added to assist the judgment of the interaction action type, and interaction type knowledge is migrated from a large-scale visual language pre-training model CLIP 18 to assist in the prediction of the interaction action type.

Text information embedding:

In accordance with the CLIP format, the present invention creates a text description "a photo of a person act object" for HOI to describe the person interaction triplet person act object for the sample description of no interaction as "a photo of person and object". During processing, the pictures and the description sentences are sent to a pre-trained CLIP text encoder to obtain text embedded vectors M is the dimension inside the CLIP encoder, cls _a is the category number of interaction actions, and similarity calculation is carried out on the text embedded vector obtained and the encoder expression vector passing through the classifier, wherein the calculation is as follows:

In the middle of Θ is the scale coefficient in CLIP, rep _i is the i-th interaction suggestion query representation, after calculation, the interaction category representation can be adjusted by text embedding vector to obtain the final interaction category prediction probability/>

(15) Training and reasoning

In the training process, the invention calculates and counter-propagates the loss function of the interaction suggestion network and the interaction prediction branch, the interaction action category branch and the human-object example detection part do not participate in the training, the two losses are calculated by using a focus loss function L _FL, and the loss function L _p of the interaction suggestion network is designed as follows:

In the middle of For the predicted result of the ith interaction proposal, p _i is ground truth of the ith interaction proposal, because of the serious long tail distribution problem in the HOI dataset, an alpha parameter is added on the basis of the focus loss function, and is set to 0.25, the problem of unbalanced samples is solved by reducing the weight of a positive sample or a negative sample, and the interaction prediction branch loss function L _a is designed as follows:

In the middle of The prediction probability and groud truth of the a-th action and the i-th query after the CLIP similarity calculation are respectively calculated. The final overall loss function L _CISC is:

L_CISC＝L_p+λL_a (8)

lambda is the super parameter and is set to 0.5.

2. The experimental method comprises the following steps:

The effectiveness of CISC will be demonstrated, firstly, two data sets HICO-Det and V-COCO commonly used in the field of human interaction and experimental setting details will be introduced, and then the performance of CISC in detection will be demonstrated. Finally, each component is analyzed by an ablation experiment.

(21) Experimental setup

Data set:

V-COCO 1 is a subset of the COCO dataset, containing 400 training images and 4946 test patterns, with 80 object categories and 29 action categories. For V-COCO, the invention is only collected in the setting of [17] (Average Precision, average accuracy) and/>APs _role with more than 25 interactions in both scenarios. Many interaction instances without object tag information are defined in V-COCO, where object bounding boxes and interactions need to be predicted correctly in scene 1. Then bounding boxes for such samples may not be predicted in scenario 2.

The HICO-Det is a subset of the HICO dataset containing 37,536 training images and 9,515 test images, with 600 HOI instance categories, 80 object categories and 117 action categories. The present invention evaluates the entire test set for HICO-Det and demonstrates performance at defaults settings. The present invention reports mAPs under three category sets according to previous settings: 1) full, complete 600 HOI instance categories; 2) rare HOI instance categories with fewer than 10 138 training samples; 3) non-rare, the remaining 462 HOI instance categories, i.e., training samples, are not less than 10.

Experimental details:

In backbone network, CNN layer selects ResNet-50, encoder and decoder of interactive detection branch use pre-trained DETR 10 model not to participate in training. The invention uses AdamW to optimize the network, the weight attenuation is 10 ^-4, the batch is 4, the N is set to 16, the initial learning rate is 10 ^-4, the initial learning rate is reduced by half after 60 cycles, and the training is carried out for 90 cycles. The model was trained on two Nvidia 2080ti GPUs devices.

(22) Performance analysis

The present invention uses the official evaluation codes of HICO-DET and V-COCO to obtain the mAP, tables 1 and 2 show the present invention compared to the recent convolutional network-based, query-based HOI detector performance.

Performance comparison of V-COCO AS shown in table 1, with scenario 1, scenario 2, and-ResNet on the same backbone network, the performance of the present invention is superior to the methods in the following list, two-stage convolution-based methods (e.g., DRG and IDN) and Transformer style query-based methods (HOTR, AS-Net, hottrans and QPIC). The query-based method can better detect the instance and the interaction category by combining the context information, and improves the performance, but the invention concentrates the attention points of the decoder in a more interactive possible area by providing high-quality suggestions in advance, designs an interactive perception decoder to better utilize the structural information between interactions and inside the interactions, and uses the pretrained CLIP to assist in judging the interaction category, thereby achieving better performance than the interaction category.

TABLE 1V-COCO experimental data

Performance comparison of HICO-Det as shown in table 2, there are three class set of HOI example samples in HICO-Det, the present invention shows three sets of maps under Default settings, identical to V-COCO data sets, and by the design of the present invention, the performance is better than other methods in the table, divided into three areas altogether, the upper area being a two-stage CNN detector, the middle area being a one-stage CNN detector, and the lower being query-based detector performance.

Notably, the improvement of CISC over the rare class set is more pronounced than the other two class sets, probably because the use of the improved focus loss function makes the training network more focused on HOI instances with small sample size and difficult classification, as will be discussed in subsequent ablation experiments, performance over both data sets demonstrates the effectiveness of the invention.

TABLE 2HICO-Det experimental data

(23) Ablation experiments

In order to verify the validity of the component designed by CISC, the invention designs an ablation experimental strategy shown in Table 3, wherein IP is an interaction proposal branch, ID is an interaction perception decoder, CLIP is an interaction action category branch, and the part is available.

Experiments under the V-COCO data set are selected, wherein Encoder-decoders in the first experiment use QPIC detector arrangement, and no branch is added; adding interaction advice branches in the experiment II, so that the network has advice with high quality; the interactive perception decoder is also added in the experiment III; experiment four is the CISC of the present invention.

TABLE 3 impact of network components on performance

In the design of the loss function, the invention modifies the focus loss function [19] so as to achieve the HOI examples which make the network pay more attention to the less number of samples. To verify its effect, a comparative experiment of table 4 was set up. The three groups were divided: no α, α=0.5 and α=0.25 are provided.

TABLE 4 influence of Focus loss function coefficients on Performance

Through experiments, the influence of setting the coefficient on the performance of the Rare class set is found, and according to experimental data, the influence on the Non-Rare class set is not obvious, so that the method accords with the assumption when the coefficient is set.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The character interaction detection method based on intelligent perception is characterized by comprising the following steps of:

2. The intelligent perception-based character interaction detection method according to claim 1, wherein the backbone network comprises:

3. The intelligent perception-based character interaction detection method according to claim 1, wherein the interaction location suggestion branches comprise the following three parts: human-object instance detection, interaction suggestion network, interaction structure;

4. The intelligent perception-based person interaction detection method as claimed in claim 1, wherein the person-object instance detection comprises:

5. The intelligent perception based human interaction detection method of claim 1, wherein the interaction advice network comprises:

6. The intelligent perception based character interaction detection method according to claim 1, wherein the interaction structure comprises:

7. The intelligent perception-based person interaction detection method according to claim 1, wherein the interaction prediction branch comprises:

Attention:

The attention used in the transducer is defined as follows, consisting of the query sequence query (q ₁,q₂…q_m), the Key sequence Key (k ₁,k₂…k_n), the Value sequence Value (v ₁,v₂…v_n), the Output of the attention operation being Output (o ₁,O₂…O_m), the Output being obtained by the softmax function from the following formula:

W ^q,W^k,W^v is a learnable weight matrix of Q, K, V, respectively, and d _k is the dimension of K. The dimensions of the output Q _i are the same as those of the query Q.

Self-attention layer:

Cross-attention layer:

the interaction proposal obtains the HOI characteristic layer after the self-attention layer Let/>Cross attention operation is carried out on the image features f _input obtained by the backbone network, and N suggestions/>, are used forAs the query, the image feature f _input(f₁,f₂,…,f_n) output by the encoder is Value and Key, and the Key fuses the image feature, the position coding and the interaction internal structure information as follows:

8. The intelligent perception-based character interaction detection method according to claim 1, wherein training and reasoning comprises:

In the middle of For the predicted outcome of the ith interaction proposal, p _i is ground truth of the ith interaction proposal. Because of the serious long tail distribution problem in the HOI dataset, an alpha parameter is added on the basis of a focus loss function, the alpha parameter is set to be 0.25, the problem of sample imbalance is solved by reducing the weight of a positive sample or a negative sample, and the interactive prediction branch loss function L _a is designed as follows:

In the middle of H _ia is the prediction probability and groud truth of the a-th action and i-th query after CLIP similarity calculation, and the final overall loss function L _CISC is:

L_CISC＝L_p+λL_a (8)

lambda is the super parameter and is set to 0.5.