CN118429897B

CN118429897B - Group detection method, group detection device, storage medium and electronic equipment

Info

Publication number: CN118429897B
Application number: CN202410881274.XA
Authority: CN
Inventors: 孔子豪; 侯腾飞; 童超
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2024-07-03
Filing date: 2024-07-03
Publication date: 2024-10-22
Anticipated expiration: 2044-07-03
Also published as: CN118429897A

Abstract

The application discloses a group detection method, a group detection device, a storage medium and electronic equipment, wherein the group detection method comprises the following steps: determining individual characteristics of each target in the image to be detected; extracting features of externally input interaction prompt words by using a first language model to obtain interaction prompt word features; coding the individual features based on the interaction prompt word features by utilizing a pre-trained interaction feature coding model to obtain interaction features for indicating interaction relations between any two targets in the targets; and carrying out group division and identification based on the interaction characteristics. By the method and the device, the accuracy of group detection can be effectively improved.

Description

Group detection method, group detection device, storage medium and electronic equipment

Technical Field

The present application relates to neural network technologies, and in particular, to a group detection method, a device, a storage medium, and an electronic apparatus.

Background

In security monitoring, the recognition of abnormal events often depends on the input information of group detection, and a target group (for example, crowd) with the same purpose or carrying out the same event is first used as group detection, and then the behavior recognition method is used for carrying out the abnormal detection. The current common group detection method is limited by a clustering method based on manual characteristics such as distance and the like, is complex in parameter adjustment and has low accuracy. Some recent techniques focus on interactive relationships between targets, but learning of interactive relationships is not guided correctly, and ambiguous relationships can introduce many unrelated targets, making population detection misalignments.

Disclosure of Invention

The application provides a group detection method, a group detection device, a storage medium and electronic equipment, which can effectively improve the accuracy of group detection.

In order to achieve the above purpose, the application adopts the following technical scheme:

a population detection method comprising:

Determining individual characteristics of each target in the image to be detected;

extracting features of externally input interaction prompt words by using a first language model to obtain interaction prompt word features;

Coding the individual features based on the interaction prompt word features by utilizing a pre-trained interaction feature coding model to obtain interaction features for indicating interaction relations between any two targets in the targets;

and carrying out group division and identification based on the interaction characteristics.

Preferably, the interactive feature coding model is a transducer model;

The step of encoding the individual features based on the interactive prompt word features comprises the following steps:

Performing position coding based on the position relation among the targets, fusing a position coding result, the interaction prompt word characteristics and the individual characteristics, taking the fused characteristics as input of the interaction characteristic coding model, and performing coding processing by using the interaction characteristic coding model; or alternatively

Performing position coding based on the position relation among the targets, fusing a position coding result and the individual features, and processing the fused features as the input of the interactive feature coding model; before self-attention processing is performed in the interactive feature coding model, the individual features are grouped based on the interactive prompt word features, and the self-attention processing in the interactive feature coding model is performed based on the grouping result.

Preferably, the grouping and identifying based on the interaction features includes:

inputting the interaction characteristics into a pre-trained behavior inference model, classifying the interaction characteristics by using the behavior inference model, predicting to obtain a group classification result, and predicting the probability of interaction relationship between any two targets based on the interaction characteristics;

taking any two targets with the probability larger than or equal to the interaction threshold value as a pair of targets with interaction relationship;

and carrying out group division based on all the determined interaction relations, dividing targets with interaction relations into the same group, and taking the group classification result as a classification recognition result of the divided group.

Preferably, after obtaining the interaction feature and before performing group division and identification based on the interaction feature, the method further comprises:

Constructing a graph network model; the nodes of the graph network model comprise the targets, interaction relations among any two targets and group relations, wherein the group relations are used for representing consistency relations among different interaction relations;

Determining initial characteristics of each node in the graph network model based on the individual characteristics and the interaction characteristics, carrying out iterative updating on the characteristics of each node in the graph network model, and determining output node characteristics of each node so as to enable the nodes to learn each other;

encoding the output node characteristics based on the guidance of externally input group prompt words, and determining updated output node characteristics;

the group division and identification based on the interaction characteristics comprises the following steps: and carrying out group division and identification based on the updated interaction characteristics included in the updated output node characteristics.

Preferably, the method further comprises: receiving an externally input individual prompt word, and extracting characteristics of the individual prompt word by utilizing a second language model to obtain characteristics of the individual prompt word;

After acquiring the individual features, the method further comprises: based on the guidance of the individual prompt word characteristics, the trained individual characteristic encoder is utilized to encode the individual characteristics, and updated individual characteristics are obtained;

When the interactive feature encoder is used for encoding the individual features, the individual features are processed based on the updated individual features;

and when the initial characteristics of each node in the graph network model are determined, performing based on the updated individual characteristics.

Preferably, the method further comprises:

And when the individual feature encoder is trained, comparing the updated individual features of the encoded output with the individual prompt word features in similarity for determining a loss function of the individual feature encoder.

Preferably, the graph network model is a factor graph model;

the nodes of the factor graph comprise conventional nodes and factor nodes;

the conventional nodes comprise target nodes for representing targets and interaction nodes for representing interaction relations between any two targets;

The factor nodes comprise a first type factor node and a second type factor node; the factor node is a ternary body composed of three conventional nodes, and for the first factor node, the ternary body comprises any two targets and interaction relations among the targets; for the second class factor node, the triplets comprise interaction relations between every two of any three targets;

Setting edges between the factor nodes and conventional nodes included by the triplets, and setting edges between the interaction nodes and two designated target nodes; the designated target node is a target node of which the interaction relationship represented by the interaction node corresponds to two targets.

Preferably, the determining the initial feature of each node in the graph network model based on the individual feature and the interaction feature includes:

The initial characteristics of the target node are the individual characteristics of the target corresponding to the node;

the initial characteristics of the interactive nodes are interactive characteristics of the corresponding interactive relations of the nodes;

The initial characteristic of the factor node is the average initial characteristic of three nodes included in the triplet of the factor node.

Preferably, when the characteristics of each node in the graph network model are iteratively updated,

Determining the characteristics of corresponding factor nodes after the current iteration based on the factor node characteristics determined after the previous iteration of each factor node and the node characteristics determined after the previous iteration of each conventional node with edges between the factor node and the factor node;

and determining the characteristics of the corresponding regular nodes after the current iteration based on the node characteristics determined after the previous iteration of each regular node and the node characteristics determined after the previous iteration of each factor node with edges between the factor nodes and the regular nodes.

Preferably, the updated interaction characteristic is an updated output node characteristic of the interaction node;

The method further comprises the steps of:

and determining the liveness of each target based on the updated output node characteristics of the target nodes, and determining key targets based on the liveness.

Preferably, the method further comprises:

and determining parameters for updating the characteristics of the graph network model, parameters for encoding the characteristics of the output nodes and parameters for group division and identification through a training process in advance.

Preferably, the loss function of the training process is determined based on a similarity loss of the interaction adjacency matrix and a classification loss of whether each object is a key point, wherein an element in the interaction adjacency matrix indicates whether interaction exists between any two objects.

A population detection apparatus, the apparatus comprising: the device comprises a feature determining unit, a first language model processing unit, an interactive feature encoder and a group dividing unit;

The characteristic determining unit is used for determining individual characteristics of each target in the image to be detected;

the first language model processing unit is used for extracting characteristics of externally input interaction prompt words by utilizing the first language model to obtain interaction prompt word characteristics;

the interaction feature encoder is used for encoding the individual features based on the interaction prompt word features by utilizing a pre-trained interaction feature encoding model to obtain interaction features for indicating interaction relations between any two targets in the targets;

the group dividing unit is used for dividing and identifying groups based on the interaction characteristics;

wherein the interactive feature coding model is a transducer model;

in the interactive feature encoder, the encoding the individual feature based on the interactive prompt word feature includes:

Performing position coding based on the position relation among the targets, fusing a position coding result and the individual features, and processing the fused features as the input of the interactive feature coding model; before self-attention processing is performed in the interactive feature coding model, grouping the individual features based on the interactive prompt word features, wherein the self-attention processing in the interactive feature coding model is performed based on the grouping result;

and/or the number of the groups of groups,

The apparatus further comprises a graph network processing unit and a group encoding unit;

The graph network processing unit is used for constructing a graph network model; the method is also used for determining initial characteristics of each node in the graph network model based on the individual characteristics and the interaction characteristics, carrying out iterative updating on the characteristics of each node in the graph network model, and determining output node characteristics of each node so as to enable the nodes to learn each other; the nodes of the graph network model comprise the targets, interaction relations among any two targets and group relations, wherein the group relations are used for representing consistency relations among different interaction relations;

The group coding unit is used for coding the output node characteristics based on the guidance of externally input group prompt words and determining updated output node characteristics;

the group dividing unit is used for dividing and identifying groups based on updated interaction features included in the updated output node features.

Preferably, when the interactive feature coding model is a transducer model, in the group dividing unit, the group dividing and identifying based on the interactive feature includes:

Inputting the interaction characteristics into a pre-trained behavior inference model, classifying the interaction characteristics by using the behavior inference model, predicting to obtain a group classification result, predicting the probability of the interaction relationship between any two targets based on the interaction characteristics, and taking any two targets with the probability greater than or equal to an interaction threshold as a pair of targets with the interaction relationship;

Preferably, the device further comprises a second language model processing unit, which is used for extracting characteristics of the individual prompt words by using a second language model to obtain characteristics of the individual prompt words;

The device further comprises an individual feature encoder, wherein the individual feature encoder is used for encoding the individual feature based on the guidance of the individual prompt word feature to obtain updated individual feature;

when the interactive coding unit utilizes the interactive feature coding model to code the individual features, the method is carried out based on the updated individual features;

And when the initial characteristics of each node in the graph network model are determined in the graph network processing unit, the processing is performed based on the updated individual characteristics.

Preferably, when the individual feature encoder is trained, the updated individual feature of the encoded output is compared with the similarity of the individual prompt word feature to determine a loss function of the individual feature encoder.

Preferably, the graph network model is a factor graph model;

the nodes of the factor graph comprise conventional nodes and factor nodes;

Preferably, in the graph network processing unit, the determining initial features of each node in the graph network model based on the individual features and the interaction features includes:

Preferably, in the graph network processing unit, when the feature of each node in the graph network model is iteratively updated,

Preferably, when the apparatus further includes a graph network processing unit and a group coding unit, the group dividing unit is further configured to determine liveness of each target based on the updated output node characteristics of the target nodes, and determine a key target based on the liveness.

Preferably, the device further comprises a joint training unit, which is used for determining parameters for updating the characteristics of the graph network model, parameters for encoding the characteristics of the output nodes and parameters for grouping and dividing the groups in advance through a training process.

A computer readable storage medium having stored thereon computer instructions which when executed by a processor implement the population detection method of any of the preceding claims.

An electronic device comprising at least a computer readable storage medium, further comprising a processor;

The processor is configured to read executable instructions from the computer readable storage medium and execute the instructions to implement the population detection method of any one of the above.

According to the technical scheme, in the application, the individual characteristics of each target in the image to be detected are determined, and the characteristics of the externally input interaction prompt words are extracted by utilizing the first language model to obtain the characteristics of the interaction prompt words, so that the interaction characteristics between the targets can be described in a semantic level; then, coding individual features based on the interactive prompt word features by utilizing a pre-trained interactive feature coding model to obtain interactive features for indicating the interactive relationship between any two targets in each target; in this way, the semantic information of the interaction characteristic can be consulted, and the individual characteristic of the target is coded to obtain the interaction characteristic, so that the interaction characteristic can reflect the interaction relation consistent with the interaction prompt word, and the accuracy of the interaction characteristic and the accuracy of the interaction relation reflected by the interaction characteristic are improved; finally, the group division and the identification are carried out based on the interaction characteristics, and the more accurate interaction characteristics are utilized for the group division, so that the accuracy of group detection can be effectively improved.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a population detection method according to the present application;

FIG. 2 is a schematic flow chart of a population detection method according to an embodiment of the application;

FIG. 3 is a schematic diagram of a population detection method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a population detection method according to a second embodiment of the present application;

FIG. 5 is a schematic diagram of a group detection method according to a second embodiment of the present application;

FIG. 6 is a schematic diagram of an edge of a network model according to a second embodiment of the present application;

FIG. 7 is a schematic diagram of nodes and interrelationships of a graph network model in a second embodiment of the present application;

FIG. 8 is a schematic diagram of the basic structure of a population detection apparatus according to the present application;

fig. 9 is a schematic diagram of a basic structure of an electronic device provided in the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical means and advantages of the present application more apparent.

According to the group detection method, the interaction prompt words are introduced to represent semantic features of interaction relations among different targets in the group, and the interaction prompt words are utilized to guide the acquisition of the interaction features, so that group detection is prevented from being limited by scenes and the complexity of manual parameter adjustment is avoided, the group division result meets the requirements of users, and the accuracy of group detection can be effectively improved.

FIG. 1 is a schematic diagram of a basic flow chart of a population detection method according to the present application, as shown in FIG. 1, the method includes:

Step 101, determining individual characteristics of each target in the image to be detected.

Where individual characteristics of the target include, but are not limited to, speed, acceleration, attitude, heading, etc.

The individual features may be acquired specifically based on a current single image frame or a current set of image frames. The specific process of determining individual characteristics may take various forms, and may typically be performed using neural network models.

And 102, extracting features of externally input interaction prompt words by using a first language model to obtain interaction prompt word features.

In order to acquire more accurate and user-demand-conforming features for group detection, interactive prompt words can be input externally to describe interesting interactive information. In order to introduce semantic information of the interaction prompt words into group detection, the step utilizes a first language model to conduct feature extraction on the interaction prompt words to obtain interaction prompt word features, so that semantic features of interaction relations among different targets can be obtained, individual features are encoded based on guidance of the interaction prompt word features in subsequent processing to obtain the interaction features, and the interaction features are enabled to accord with semantic description of the interaction prompt words.

And 103, coding the individual features based on the interactive prompt word features by utilizing a pre-trained interactive feature coding model to obtain interactive features for indicating the interactive relationship between any two targets in each target.

In the step, the individual characteristics are coded by utilizing a pre-trained interactive characteristic coding model, so that interactive characteristics for describing interactive relations among different targets are obtained. The interactive feature coding model may be an existing neural network model, such as a transducer structure, a CNN structure, a GCN structure, and the like.

In particular, the model of the Transformer structure introduces a self-attention mechanism, through which individual features can be more effectively refined and aggregated, and compared with the model of the RNN and CNN structures, the model of the Transformer structure has better effect on the recognition of interaction relations.

Meanwhile, as described above, the interactive prompt word features are determined in step 102, and in the process of encoding individual features to determine the interactive features, the step is performed based on the interactive prompt word features, so that the interactive features can be ensured to reflect the interactive relationship consistent with the interactive prompt word, and the accuracy of the interactive features is improved.

In addition, in order to further improve the accuracy of the individual features to the target description and further improve the accuracy of the interaction features, optionally, an externally input individual prompt word can be introduced, the individual features are guided to be encoded by the individual prompt word before the individual features are input into the interaction feature encoding model, and then the encoding result is used as the updated individual features to be input into the interaction feature encoding model. Meanwhile, in order to conveniently encode the individual features by using the individual prompt words, optionally, the trained second language model can be used for extracting the features of the individual prompt words to obtain the features of the individual prompt words, and then the encoding of the individual features is guided based on the features of the individual prompt words.

And 104, carrying out group division and identification based on the interaction characteristics.

Through the processing, the interactive features meeting the requirements of the interactive prompt words are obtained, and finally, the group division is carried out according to the interactive features, and the group category is identified.

Most basically, classification categories of the group can be determined based on the interaction features, for example, the interaction features are input into a classification network, and the group classification corresponding to the interaction features is determined. Meanwhile, the interaction relation between any two targets is reflected in the interaction characteristics, the probability of the interaction relation between the corresponding any two targets can be determined based on the interaction characteristics, and the any two targets with the probability larger than or equal to the interaction threshold value are used as a pair of targets with the interaction relation; and dividing the group based on all the determined interaction relations, dividing the targets with the interaction relations into the same group, and setting the divided group categories as the determined group categories.

Thus, the group detection method in the application ends. By the method shown in the flow of fig. 1, the relationship learning among the targets is guided by using the interaction prompt words, so that the interaction relationship meeting the user requirement can be effectively determined, and the accurate group division can be performed.

On the basis of the group detection method shown in fig. 1, after the interactive feature code model is determined, the following processing about the graph network model may be further included to obtain more accurate updated interactive features, so as to further improve the accuracy of group identification, and specifically, the following processing may be newly added between step 103 and step 104:

Step 103a, constructing a graph network model, wherein nodes of the graph network model comprise targets, and interaction relations and group relations between any two targets.

The group relationships are used for representing consistency relationships among different interaction relationships. In this way, the interaction relationship among the targets and the consistency relationship among different interaction relationships can be reflected through the graph network model structure, so that the interaction effect and interaction relationship among the targets in the image and the interaction effect and transmission among the interaction relationships are reflected.

And 103b, determining initial characteristics of each node in the graph network model based on the individual characteristics and the interaction characteristics, and iteratively updating the characteristics of each node in the graph network model to determine the output node characteristics of each node so as to enable the nodes to learn each other.

The graph neural network is a deep learning model for processing graph structure data, and neighbor information of model nodes can be aggregated through a message passing mechanism, and feature representation of the nodes is updated. In the application, the learning of the characteristics and the relation among the targets is realized by updating the node characteristics in the graph neural network model.

Specifically, initial characteristics of each node in the graph network model are firstly determined based on individual characteristics and interaction characteristics, and then the characteristics of each node in the graph network model are iteratively updated to determine output node characteristics of each node so as to enable the nodes to learn each other. The output node characteristics comprise interaction characteristics updated through graph network model processing.

Step 103c, encoding the output node characteristics based on the guidance of the externally input group prompt words, and determining updated output node characteristics.

In order to guide relation learning, the application receives externally input group prompt words, encodes the output node characteristics under the guidance of the group prompt words, and determines the updated output node characteristics of various nodes.

Specifically, after determining the output node characteristics, the group encoder is utilized to encode the different output node characteristics of the packet respectively according to the guidance of the group prompt word, so as to obtain updated output node characteristics. Due to the guidance of the group prompt words, when the output node characteristics can be encoded, the output node characteristics conforming to the group prompt words are extracted, so that the updated output node characteristics conform to the requirements of the group prompt words.

In addition, for convenience in guiding, the language model can be utilized in advance to conduct feature extraction on the group prompt words, so that group prompt word features are obtained, and then the encoding of the output node features is conducted based on the group prompt word features.

Further in step 104, population division and identification is performed based on the updated interaction characteristics. Meanwhile, besides the group division and the identification by utilizing the interaction characteristics, further optionally, the liveness of each target can be determined based on the updated individual characteristics, and the key targets can be determined based on the liveness of each target.

The following describes a specific implementation of the application by means of two specific embodiments.

Embodiment one:

Fig. 2 is a schematic flow chart of a population detection method in the first embodiment, and fig. 3 is a schematic frame diagram of the population detection method in the first embodiment. In this embodiment, individual prompt words and interactive prompt words are introduced to guide the encoding of individual features and interactive features, and feature extraction is performed on the individual prompt words and interactive prompt words by using a language model to facilitate subsequent processing, as shown in fig. 2 and 3, a group detection method of a specific embodiment includes:

and 201, extracting features of externally input individual prompt words and interaction prompt words by utilizing a language model to obtain individual prompt word features and interaction prompt word features.

Among them, language models include, but are not limited to, GPT, bert, LLaMa, etc. Individual cue words are cue words describing but target, such as 'person waving hand', 'person kicking leg', etc.; the interactive prompt word is a prompt word describing interaction between any two targets, such as 'two people hit each other', and the like. The feature extraction may be performed on the individual prompt word and the interactive prompt word by using the same language model, which is not limited in this aspect of the application.

Step 202, determining individual characteristics of each target in the image to be detected.

The method for acquiring the individual features can adopt various existing methods, and the application is not limited to the acquisition method.

And 203, coding the individual features by using the individual coding model under the guidance of the individual prompt word features.

The individual coding model is used for extracting the characteristics of the individual characteristics again to generate updated individual characteristics, and the individual characteristics shown in fig. 3 are the updated individual characteristics in the step. The specific model may be a conventional single person behavior recognition model. The form of the individual coding model includes, but is not limited to, a transducer structure, a CNN structure, etc., and after the individual coding model is structured, the individual coding model is respectively aligned with the individual prompt word features extracted by the language model.

When the individual coding model is trained, the output updated individual features can be compared with the individual prompt word features, and a loss function is calculated for updating parameters of the individual coding model.

And 204, based on the interactive prompt word characteristics, coding the individual characteristics by utilizing a pre-trained interactive characteristic coding model.

Wherein, as mentioned before, the interactive feature coding model can be an existing neural network model, such as a transducer structure, a CNN structure, a GCN structure, and the like. In this example, a model of a transducer structure was used.

The model of the Transform structure includes two parts: an encoder and a decoder. The encoder receives an input sequence (source) that is processed by a stack of identical layers consisting of a "multi-headed self-care mechanism" and a "fully connected feed forward network". The decoder then generates an output sequence (target) based on the representation generated by the encoder. In order to be able to successfully use the Transform model for reasoning about relationships and interactions between group behavioral participants, self-attention mechanisms are key factors. Thus, the self-attention mechanism itself, and how the Transform model applies to the group activity recognition task, are described further below.

Attention a is a function representing a weighted sum of "values V". The weights are calculated by matching the "query Q" with a set of "keys K". The matching function may have different forms, the most popular being scaling dot products. Formally, the attention of a dot product matching function with scaling can be written as:

where d is the dimension of "query Q" and "key K". In the self-attention module, all three representations (Q, K, V) are calculated from the input sequence S by linear projection, so a (S) =a (Q (S), K (S), V (S)).

Since attention is a weighted sum of all values, it overcomes the problem of forgetting over time. In sequence-to-sequence modeling, this mechanism places more emphasis on the most relevant words in the source sequence. This is also an ideal feature for group activity identification, as the information of each participant's characteristics can be enhanced based on the other participants in the scene without any spatial constraints. Multi-head attention is an extension of attention, with several parallel attention functions using independent linear projections of (Q, K, V):

the encoder layer E of the Transform model includes a multi-head attention in combination with a feedforward neural network L:

the encoder E may comprise several such layers that process the input feature sequence S sequentially.

For the group behavior recognition task, the input feature sequence S of the model is a set s= { S _i |i=1,..n } (multiple branches parallel, one type of feature corresponds to one encoder) composed of individual feature representations. Since the features s _i do not follow any particular order, the self-care mechanism is a model that is more suitable for refining and aggregating these features than RNN and CNN. Compared with the mode of using the graph network model to perform interactive feature modeling in the second embodiment, the interactive feature coding model in the step relies on a self-attention mechanism, and does not need to perform explicit modeling on the connection between nodes through appearance and position relations.

For spatial relationships between the behavioural participants, the encoder may be implicitly exploited by position encoding of the respective body feature s _i, for example by representing each bounding box b _i of the respective participant feature s _i with its centre point (x _i,y_i), and encoding the centre point with PE.

For PE encoding, it is assumed that there is an input sequence of length L and that the position of the object in the sequence is required. The position coding is given by sine and cosine functions of different frequencies as follows:

Wherein d represents the dimension of the output embedding space; pos represents the position of an object in the input sequence, and pos is more than or equal to 0 and less than or equal to L/2; i is used to map to a column index, where 0.ltoreq.i < d/2, and the individual values of i will also map to sine and cosine functions; in summary, it can be seen in the above expression that even positions use a sine function and odd positions use a cosine function.

The above-mentioned model of the transform structure is applied to the interactive feature generation processing in this embodiment, so that the processing of encoding the individual features by using the interactive feature encoding model in this step can be obtained, and in this embodiment, the following two implementation manners are given:

1. Performing position coding based on the position relation among the targets; fusing the position coding result, the interactive prompt word characteristics and the individual characteristics, taking the fused characteristics as the input of an interactive characteristic coding model, and carrying out coding processing by utilizing the interactive characteristic coding model;

Under the implementation mode, on one hand, space information between targets is obtained through position coding, on the other hand, interactive semantic information between the targets is expressed by utilizing interactive prompt word characteristics, the space information, the semantic information and individual characteristics are fused, and the fused characteristics are processed by a transducer encoder.

2. Position coding is carried out based on the position relation among the targets, a position coding result and individual characteristics are fused, and the fused characteristics are used as input of an interactive characteristic coding model to be processed; before self-attention processing is carried out in the interactive feature coding model, individual features are grouped based on the interactive prompt word features, and the self-attention processing in the interactive feature coding model is carried out based on the grouping result;

In this implementation, the interactive prompt word features are no longer fused with the individual features as encoder inputs, but rather the interactive prompt word features are grouped as a priori information, i.e., the targets are grouped, and multi-headed attention is encoded within each grouping. Thus, the characteristic information between the targets with interactive relations can be captured as much as possible, and the information interference between irrelevant targets is reduced.

The interactive feature is obtained based on the interactive prompt word feature, so that the interactive feature can reflect the interactive relation conforming to the semantic information given by the interactive prompt word, and on the other hand, the interactive feature is based on the model of a transducer structure, can effectively model the interactive relation among different targets through a self-attention mechanism, and generates the interactive feature reflecting the interactive relation among different targets without explicitly modeling the connection relation among different targets.

Step 205, performing group division and identification based on the interaction characteristics determined in step 204.

The process of population partitioning and identification includes two branches: behavior classification and population classification. The embodiment is realized by a behavior reasoning module and a group detection result output module shown in fig. 3.

Specifically, the behavioral reasoning module includes two branches:

Branch one: the classification head in the behavior recognition model can be commonly used, namely LINEAR HEAD is composed of a plurality of full connection layers, the branches are processed to output N multiplied by C class probability vectors, and each element in the vectors represents a probability value that a sample i belongs to a group class j, wherein i is a sample index, j is a group class index, N is the number of samples, and C is the number of group classes; in the training process, the truth value label is the actual group category corresponding to the sample; here, the behavioral reasoning module and the interactive feature coding model are jointly trained, and one sample is a group of individual features of the single-input interactive feature coding model;

branch two: the method comprises the steps of acquiring interaction relations among targets, outputting an NxKxK interaction probability graph, wherein elements represent probability information of interaction relations between targets j and targets K in each sample i, K is the number of individuals in a single sample (namely the number of targets), in the training process, truth labels are target interaction graphs (corresponding elements are 1 when the interaction relations between targets j and targets K exist in each sample i, corresponding elements are 0 when the interaction relations between targets j and targets K do not exist), and attention K values are predefined based on a training test set.

The group category corresponding to the current sample is obtained through the processing of the behavior reasoning module, and the interaction probability among all targets in the current sample is obtained.

In the group detection result output module, the group division is performed based on the interaction probability graph output by the behavior reasoning module, specifically, for each sample, any two targets with the interaction probability greater than or equal to the interaction threshold value are used as a pair of targets with interaction relationships to perform the group division based on all the determined interaction relationships, and the targets with interaction relationships are divided into the same group; and finally, corresponding to each sample, and obtaining the divided group. And in the group detection result output module, determining the group category corresponding to each sample based on the category probability vector output by the behavior reasoning module.

As described above, the group detection result output unit obtains the group division result corresponding to each sample on the one hand, and obtains the group category corresponding to each sample on the other hand. And combining the two types of results to output as a group identification result, namely outputting the determined group category B of any sample A as the group category corresponding to the corresponding group divided by the sample A, namely outputting the group divided by the corresponding sample and the group category of the group for the sample consisting of a plurality of individual characteristics.

Thus, the flow of the method in the first embodiment is ended.

Embodiment two:

Fig. 4 is a schematic flow chart of a population detection method in the second embodiment, and fig. 5 is a schematic frame diagram of the population detection method in the second embodiment. In this embodiment, individual prompt words, interaction prompt words and group prompt words are introduced to guide the encoding of individual features, interaction features and group features, and feature extraction is performed on the individual prompt words, interaction prompt words and group prompt words by using a language model so as to facilitate subsequent processing, as shown in fig. 4 and 5, a group detection method of a specific embodiment includes:

And step 401, extracting features of externally input individual prompt words, interaction prompt words and group prompt words by utilizing a language model to obtain individual prompt word features, interaction prompt word features and group prompt word features.

Among them, language models include, but are not limited to, GPT, bert, LLaMa, etc. Individual cue words are cue words describing but target, such as 'person waving hand', 'person kicking leg', etc.; the interactive prompt word is a prompt word for describing interaction between any two targets, such as 'two people hit each other' and the like; the group prompt words are prompt words for describing panorama in the picture, such as 'in square scene, two people are in the lower right corner of the picture to make a frame', etc.

Step 402, determining individual characteristics of each target in the image to be measured.

Step 403, under the guidance of the individual prompt word characteristics, encoding the individual characteristics by using an individual encoder; and under the guidance of the interactive prompt word characteristics, the individual characteristics are encoded by using an interactive encoder.

The individual encoder is used for extracting the characteristics of the individual again to generate updated individual characteristics. The interactive encoder is used for encoding the individual features to generate interactive features representing the interrelationships among the targets. Forms of the individual encoder and the interactive encoder include, but are not limited to, a transducer structure, a CNN structure, etc., in which individual encoder and interactive encoder are aligned with the individual cue word features and interactive cue word features extracted by the language model, respectively, after the feature extraction structure.

When the individual encoder is trained, the output updated individual characteristics can be compared with the individual prompt word characteristics, and a loss function is calculated for updating parameters of the individual encoder; when the interactive feature encoder is trained, the output interactive features can be compared with interactive prompt word features, and a loss function is calculated for updating parameters of the interactive encoder.

The interactive encoder may determine the interactive features by using the interactive feature encoding model of the Transformer structure described in the first embodiment, or may simply use a model of a CNN or RNN structure to perform fusion processing and encoding on the individual features. When the interactive feature coding model of the transducer structure described in the first embodiment is used to determine the interactive feature, the system does not need to include the behavior inference module and the group detection result output module in fig. 3 in practical application, and the two modules are used for training the interactive feature coding model.

Step 404, constructing a graph network model.

In this embodiment, interaction graph modeling is performed, so that graph elements such as edges and nodes are used to represent individual and interaction features, including, but not limited to, single graph modeling of a single interaction relationship of a target and multiple graph modeling modes of multiple interaction relationships.

In this embodiment, a factor graph is used as an example of a graph network model to be modeled. The structure of the factor graph includes nodes and edges, wherein the nodes are divided into two types, namely conventional nodes and factor nodes. Conventional nodeComprising two setsAnd (2) and. Wherein the method comprises the steps ofRepresenting a collection of individual targets, such nodes are referred to as target nodes,Representing a collection of interaction information between targets, such nodes are referred to as interaction nodes, which may be represented by interaction features between targets:

Factor nodes are divided into two types NodeThe interaction relation between the targets is strongly related, and the relationship representing the association of the individual characteristics and the interaction relation, namely the relationship representing the interaction consistency between the targets, is used for interaction between every two targets and individual characteristic learning.

Another factor node is based on higher order triples for learning the consistency of interaction relationships between triples (here "elements" denote pairs of objects), in more detail, the consistency of interactions between two objects of three objects, in other words, the relationship of learning the transitivity of interactions (e.g. behavioral interactions), i and j interactions, j and k interactions, the interaction relationship between i and k should be affected by the transitivity of j.

From the above, it can be seen that for both types of factor nodes, the composition includes a triplet, the first type of factor nodeComprises two targets and interaction relation between the two targets, and a second type factor nodeThe triplets of (2) include interactions between two of the three targets.

Finally, for the triplets in each factor nodeConnecting edgeMeanwhile, for each interaction node, the interaction node is connected with two targets related to the interaction relationship, as shown in fig. 6.

More briefly, taking the interaction of actions between individuals as an example, referring to FIG. 7, circles represent two different types of conventional nodes: the green circle is an action variable (action variable) which represents the action category of each person, and the orange circle (INTERACTIVE VARIABLE) is an interaction variable which represents whether interaction exists between two persons; squares represent two different factor nodes, blue squares represent relationship of behavioral interaction consistency, and yellow squares represent relationship of interaction transitivity.

The foregoing is a graph network model in this embodiment, and of course, probability graph network models of other structures, such as bayesian graphs, markov graphs, and the like, may also be selected in practical applications.

Step 405, determining initial features of each node in the graph network model based on the individual features and the interaction features.

The initial characteristics of each node in the graph network model are the node characteristics in the initial graph. Wherein, for the target node, the initial characteristic is the updated individual characteristic f _i corresponding to the target determined in step 403, and for the interaction node, the initial characteristic is the updated interaction characteristic corresponding to the interaction relationship determined in step 403。

The initial characteristics of the first class factor node are:

the initial characteristics of the second class factor node are:

the edge features are as follows:

the edge features here are used for subsequent updating of the node features.

And 406, carrying out iterative updating on the characteristics of each node in the graph network model, and determining the output node characteristics of each node.

The mutual learning among the nodes is realized through the updating of the node characteristics in the graph network model. The node feature update may be performed by a feature extraction structure, which may be used to obtain a more optimal node feature representation. In this embodiment, the MLP multi-layer sensing network is taken as an example of the feature extraction structure for mechanical energy description. Assuming that the number of layers of a factor graph network (FGNN) corresponding to the graph network model is L, namely the L layers are the same in feature extraction structure, updating node features once in each layer, and taking the node features of the current layer as the input of the feature extraction structure of the next layer, wherein the factor node features of the first layer (1) are as follows:

wherein M and Q are MLP multi-layer perception network respectively, AndIs a parameter of a multi-layer aware network,Representing the maximum pooling operation and,Representing a factor node c and all regular nodes connected thereto, the factor node may beOr (b)。

The conventional node of layer l+1 is characterized by:

representing a regular node i and all factor nodes connected thereto, the regular node may be a target node or an interaction node.

Learning of the interaction relationship can be achieved through steps 405 and 406, and the graph network model output node characteristics of the nodes are obtained.

Step 407, respectively encoding the output node characteristics based on the guidance of the group prompt word characteristics, and determining updated output node characteristics.

In the step, the output node characteristics and the group prompt word characteristics are input into a trained group interaction encoder together, and the updated output node characteristics are output through the processing of the group interaction encoder. The group prompt word features are used for guiding the encoding process and outputting output node features consistent with the group prompt word feature features.

The group interaction encoder forms include, but are not limited to, a Transformer structure, a CNN structure, etc., in which the feature extraction structure is then aligned with the group cue word features extracted by the language model.

When the group interaction encoder is trained, the output node characteristics can be compared with the group prompt word characteristics, and a loss function is calculated so as to update parameters of the group interaction encoder, so that the group interaction encoder can output the output node characteristics consistent with the group prompt word characteristics.

In this step, the output node characteristics of each conventional node are mainly processed, that is, the output node characteristics of the target node and the interaction node are processed.

The processing in steps 405 to 407 corresponds to the processing of the community interaction learning module in fig. 5, and the updated output node characteristics of each node can be obtained according to the input personal characteristics and interaction characteristics, and the output node characteristics reflect the interaction characteristics between two persons and the consistency characteristics between different interaction characteristics.

Step 408, determining the probability of interaction relation between any two targets and the liveness of each target based on the updated output node characteristics, and outputting the key targets.

In this embodiment, the group division and identification may be performed in the manner of step 205, where the updated output node characteristics of the interaction node determined in step 407 are equivalent to the interaction characteristics in the flow shown in fig. 2.

In addition, in this embodiment, in addition to outputting the group detection result (that is, the group division result), the key target of the group may be output. The group key targets may include different forms of output of key targets, group key regions, and heat maps (i.e., liveness information of individual targets).

The liveness of a target refers to the probability value of whether the target is a critical node. And outputting the target with the activity of the target being greater than the set threshold as a key target.

The activity level of the specific target may be determined according to the updated output node characteristics of the target node determined in step 407, as follows:

Wherein alpha is calculated Is a linear function of the classification score of (c),A probability indicating whether an individual target is critical,And outputting node characteristics updated for the target node.

In more detail, to perform group division, it is further required to determine the probability of having an interaction relationship between any two targets according to the updated output node characteristics of the target nodes determined in step 407:

Where beta is a linear function that calculates a classification score for the interaction node, Indicating the probability of whether there is an interaction between the targets,And outputting node characteristics updated for the interactive nodes.

In step 409, any two targets with probability greater than or equal to the interaction threshold are used as a pair of targets with interaction relationship, the targets with interaction relationship are divided into the same group, and the group category is determined.

As in the first embodiment, whether or not there is interaction between two targets is determined based on the probability that there is interaction between the two targets, and when group division is performed, the targets having interaction relationship with each other are divided into the same group; and carrying out classification processing based on the updated output node characteristics of the interaction nodes, determining the group category of the current output, and taking the determined group category as the category information of the divided group.

The method flow shown in fig. 4 ends up. In the processing of fig. 4, the processing of steps 403 to 409 involves a plurality of neural networks and parameters that need to be determined in advance in the processing, specifically including individual encoders, interactive encoders, feature extraction structures in the graph networks, functions for calculating inter-target interaction probabilities in group division, and functions for calculating the probabilities of targets as key nodes, and in order to determine these neural networks and related parameters, the entire processing of steps 403 to 409 needs to be trained in advance. In the training, training data marks a group truth value and key targets in the group, the group truth value is converted into an interaction relation truth value of the targets in the group, specifically, factor nodes and interaction nodes between the same group of targets are 1, and other factor nodes and interaction nodes are 0. The trained loss function is a similarity loss of the interaction adjacency matrix, and can further comprise a classification loss (such as cross entropy loss) of whether the targets are key points, wherein elements in the interaction adjacency matrix represent whether interaction exists between any two targets, for example, if the targets I and J have interaction, the element of the corresponding position is 1. Through the combined training of the whole steps 403-409, each encoder, the feature extraction structure and the group division can be mutually matched, and reasonable relation learning and group division are performed under the guidance of individual prompt words, interaction prompt words and group prompt words.

The above is a specific implementation of the population detection method in the second embodiment. In the method, reasonable group division is carried out by utilizing each encoder, a graph network model, a feature extraction architecture, a group division mode and the like which are trained in advance; meanwhile, multi-level prompt words are added in interactive learning to perform multi-mode interactive relation learning, and the problems of single characteristic and complex cluster parameter debugging in a common method are relieved by guiding characteristic learning of individuals, interactions and groups, so that interactive relations with similar characteristics are distinguished, surrounding staff are removed, and the accuracy of group detection is improved.

The application also provides a population detection device which can be used for implementing the population detection method. FIG. 8 is a schematic diagram of the basic structure of the population detection apparatus of the present application. As shown in fig. 8, the method includes: the device comprises a feature determining unit, a first language model processing unit, an interactive feature encoder and a group dividing unit;

The interactive feature encoder is used for encoding the individual features based on the interactive prompt word features by utilizing a pre-trained interactive feature encoding model to obtain interactive features for indicating the interactive relationship between any two targets in each target;

And the group dividing unit is used for dividing and identifying the groups based on the interaction characteristics.

The interactive feature coding model is a transducer model;

in the interactive feature encoder, the operation of encoding the individual feature based on the interactive prompt word feature may specifically include:

position coding is carried out based on the position relation among the targets, a position coding result, interactive prompt word characteristics and individual characteristics are fused, the fused characteristics are used as input of an interactive characteristic coding model, and coding processing is carried out by utilizing the interactive characteristic coding model; or alternatively

Position coding is carried out based on the position relation among the targets, a position coding result and individual characteristics are fused, and the fused characteristics are used as input of an interactive characteristic coding model to be processed; before self-attention processing is carried out in the interactive feature coding model, grouping the personal features based on the interactive prompt word features, and carrying out self-attention processing in the interactive feature coding model based on the grouping result;

and/or the number of the groups of groups,

The device can further comprise a graph network processing unit and a group coding unit;

A graph network processing unit for constructing a graph network model; the method is also used for determining initial characteristics of each node in the graph network model based on the individual characteristics and the interaction characteristics, carrying out iterative updating on the characteristics of each node in the graph network model, and determining the output node characteristics of each node so as to enable the nodes to learn each other; the nodes of the graph network model comprise targets, interaction relations between any two targets and group relations, wherein the group relations are used for representing consistency relations among different interaction relations;

And the group dividing unit is used for dividing and identifying the groups based on the updated interaction characteristics included in the updated output node characteristics.

Optionally, when the interactive feature coding model is a transducer model, in the group dividing unit, processing for performing group division and identification based on the interactive feature may specifically include:

Inputting interaction characteristics into a pre-trained behavior inference model, classifying the interaction characteristics by using the behavior inference model, predicting to obtain a group classification result, predicting the probability of interaction relation between any two targets based on the interaction characteristics, and taking any two targets with the probability larger than or equal to an interaction threshold as a pair of targets with interaction relation;

and carrying out group division based on all the determined interaction relations, dividing targets with interaction relations into the same group, and taking a group classification result as a classification recognition result of the divided group.

Optionally, the device further includes a second language model processing unit, configured to perform feature extraction on the individual prompt word by using the second language model, so as to obtain an individual prompt word feature;

The device further comprises an individual feature encoder, wherein the individual feature encoder is used for encoding the individual features based on the guidance of the individual prompt word features to obtain updated individual features;

When the interactive coding unit utilizes the interactive feature coding model to code the individual features, the individual features are updated;

When determining initial characteristics of each node in the graph network model in the graph network processing unit, the method is performed based on the updated individual characteristics.

Optionally, when training the individual feature encoder, comparing the updated individual feature of the encoded output with the individual prompt word feature in similarity for determining a loss function of the individual feature encoder;

When the interactive feature encoder is trained, the updated interactive features output by the encoding are compared with the interactive prompt word features in similarity for determining the loss function of the interactive feature encoder.

Optionally, the graph network model is a factor graph model;

the nodes of the factor graph comprise conventional nodes and factor nodes;

The factor nodes comprise first-class factor nodes and second-class factor nodes; the factor node is a ternary body composed of three conventional nodes, and for the first factor node, the ternary body comprises any two targets and mutual interaction relation thereof; for the second class factor node, the triplets comprise interaction relations between every two of any three targets;

Optionally, in the graph network processing unit, determining initial features of each node in the graph network model based on the individual features and the interaction features includes:

the initial characteristics of the interaction nodes are interaction characteristics of the corresponding interaction relations of the nodes;

the initial characteristics of a factor node are the average initial characteristics of three nodes included in the triplet of factor nodes.

Optionally, in the graph network processing unit, when the feature of each node in the graph network model is iteratively updated,

Optionally, when the apparatus further includes a graph network processing unit and a group coding unit, the group dividing unit is further configured to determine liveness of each target based on the updated output node characteristics of the target nodes, and determine a key target based on the liveness.

Optionally, the device further comprises a joint training unit, which is used for determining parameters for updating the characteristics of the graph network model, parameters for encoding the characteristics of the output nodes and parameters for dividing the group in the group detection model in advance through a training process.

Optionally, the loss function of the training process is determined based on a similarity loss of the interaction adjacency matrix and a classification loss of whether each object is a keypoint, wherein an element in the interaction adjacency matrix indicates whether there is an interaction between any two objects.

The present application also provides a computer readable storage medium storing instructions which, when executed by a processor, perform steps in a method of implementing population detection as described above. In practice, the computer readable medium may be comprised by or separate from the apparatus/device/system of the above embodiments, and may not be incorporated into the apparatus/device/system. Wherein instructions are stored in a computer readable storage medium which stored instructions which, when executed by a processor, can perform the steps in a population detection method as described above.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: portable computer diskette, hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing, but are not intended to limit the scope of the application. In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Fig. 9 is a schematic diagram of an electronic device according to the present application. As shown in fig. 9, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, specifically:

The electronic device can include a processor 901 of one or more processing cores, a memory 902 of one or more computer-readable storage media, and a computer program stored on the memory and executable on the processor. When the program of the memory 902 is executed, a group detection method can be implemented.

Specifically, in practical applications, the electronic device may further include a power supply 903, an input/output unit 904, and other components. It will be appreciated by those skilled in the art that the structure of the electronic device shown in fig. 9 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

The processor 901 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of a server and processes data by running or executing software programs and/or modules stored in the memory 902 and calling data stored in the memory 902, thereby performing overall monitoring of the electronic device.

The memory 902 may be used to store software programs and modules, i.e., the computer-readable storage media described above. The processor 901 executes various functional applications and data processing by executing software programs and modules stored in the memory 902. The memory 902 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function, and the like; the storage data area may store data created according to the use of the server, etc. In addition, the memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 902 may also include a memory controller to provide access to the memory 902 by the processor 901.

The electronic device further comprises a power supply 903 for supplying power to the respective components, and may be logically connected to the processor 901 through a power management system, so that functions of managing charging, discharging, power consumption management, etc. are implemented through the power management system. The power supply 903 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may also include an input output unit 904, which input unit output 904 may be used to receive input numeric or character information, as well as to generate keyboard, mouse, joystick, optical signal inputs related to user settings and function control. The input unit output 904 may also be used to display information entered by a user or provided to a user as well as various graphical user interfaces that may be composed of graphics, text, icons, video, and any combination thereof.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A population detection method, comprising:

Group division and identification are carried out based on the interaction characteristics;

when the interactive feature coding model is a transducer model, the coding processing of the individual features based on the interactive prompt word features includes:

and/or the number of the groups of groups,

After the interactive features are obtained and before the group division and the identification are performed based on the interactive features, the method further comprises the following steps:

2. The method of claim 1, wherein the grouping and identifying based on the interaction characteristics comprises:

3. The method according to claim 1, characterized in that the method further comprises: receiving an externally input individual prompt word, and extracting characteristics of the individual prompt word by utilizing a second language model to obtain characteristics of the individual prompt word;

4. A method according to claim 3, characterized in that the method further comprises:

5. The method of claim 1, wherein the graph network model is a factor graph model;

the nodes of the factor graph comprise conventional nodes and factor nodes;

6. The method of claim 5, wherein the determining initial features for each node in the graph network model based on the individual features and the interaction features comprises:

7. The method of claim 5, wherein, when iteratively updating the characteristics of each node in the graph network model,

8. The method of claim 5, wherein the updated interaction characteristics are updated output node characteristics of the interaction node;

The method further comprises the steps of:

9. The method according to claim 1, characterized in that the method further comprises:

10. The method of claim 9, wherein the loss function of the training process is determined based on a similarity loss of an interaction adjacency matrix and a classification loss of whether each object is a keypoint, wherein an element in the interaction adjacency matrix represents whether there is interaction between the any two objects.

11. A population detection apparatus, the apparatus comprising: the device comprises a feature determining unit, a first language model processing unit, an interactive feature encoder and a group dividing unit;

wherein the interactive feature coding model is a transducer model;

and/or the number of the groups of groups,

12. The apparatus of claim 11, wherein when the interactive feature encoding model is a transducer model, the performing, in the group classification unit, group classification and identification based on the interactive features comprises:

dividing groups based on all the determined interaction relations, dividing targets with interaction relations into the same group, and taking the group classification result as a classification recognition result of the divided groups;

and/or the number of the groups of groups,

When the device further comprises a graph network processing unit and a group coding unit, the group dividing unit is further used for determining the liveness of each target based on the updated output node characteristics of the target nodes and determining key targets based on the liveness.

13. The apparatus of claim 11, further comprising a second language model processing unit configured to perform feature extraction on the individual prompt words using the second language model to obtain individual prompt word features;

When the interactive feature encoder utilizes the interactive feature coding model to code the individual features, the method is carried out based on the updated individual features;

When initial characteristics of each node in the graph network model are determined in the graph network processing unit, the processing is performed based on the updated individual characteristics;

and/or the number of the groups of groups,

14. The apparatus of claim 11, wherein the graph network model is a factor graph model;

the nodes of the factor graph comprise conventional nodes and factor nodes;

15. The apparatus of claim 14, wherein, in the graph network processing unit,

The determining initial features of each node in the graph network model based on the individual features and the interaction features includes:

the initial characteristics of the factor node are average initial characteristics of three nodes included in the triplex of the factor node;

and/or, when the characteristics of each node in the graph network model are updated iteratively,

16. The apparatus according to claim 11, further comprising a joint training unit for determining parameters for feature updating of the graph network model, parameters for encoding the output node features, and parameters for group classification in advance through a training process;

The loss function of the training process is determined based on a similarity loss of the interaction adjacency matrix and a classification loss of whether each object is a keypoint, wherein an element in the interaction adjacency matrix represents whether an interaction exists between any two objects.

17. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the population detection method of any of claims 1 to 10.

18. An electronic device comprising at least a computer-readable storage medium and a processor;

The processor is configured to read executable instructions from the computer readable storage medium and execute the instructions to implement the population detection method of any one of claims 1-10.