CN117132810B

CN117132810B - Target detection method, model training method, device, equipment and storage medium

Info

Publication number: CN117132810B
Application number: CN202311024420.9A
Authority: CN
Inventors: 夏修理; 张兴; 王伟; 姚敏森; 童超; 李鹏; 陈宇; 刘文波; 高贺
Original assignee: China Resources Intelligent Computing Technology Guangdong Co ltd; China Resources Digital Technology Co Ltd
Current assignee: China Resources Intelligent Computing Technology Guangdong Co ltd; China Resources Digital Technology Co Ltd
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2024-07-30
Anticipated expiration: 2043-08-14
Also published as: CN117132810A

Abstract

The embodiment of the invention provides a target detection method, a model training method, a device, equipment and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the steps of obtaining an image to be detected, inputting the image to be detected into a sensing network to obtain sensing characteristics of sensing units of each data set, carrying out characteristic fusion according to the sensing characteristics and output characteristics of corresponding up-sampling units to obtain fusion characteristics corresponding to each sensing characteristic, inputting the fusion characteristics into detection heads to obtain corresponding detection results respectively, and obtaining prediction data of the image to be detected according to a preset number of detection results. At least part of the training process of the target detection model is obtained by data collaborative training of a plurality of data sets, and different data set sensing units are utilized to sense image characteristics of different data sets, so that the generalization capability of the target detection model can be improved by comprehensively utilizing information of different data sets, and good detection performance can be achieved on a plurality of target data sets.

Description

Target detection method, model training method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a target detection method, a model training method, a device, equipment, and a storage medium.

Background

Object detection is one of the important and challenging tasks in computer vision, and the object is to enable a machine to identify and locate objects in images or videos, which plays an important role in the fields of monitoring security, face recognition, automatic driving and the like. In order to increase the number of samples, model training across data sets is required in order to consider the insufficient number of samples in a single target detection data set when training the target detection model.

The related art combines data sets of similar scenes into a hybrid data set to train a deep learning model in a manner that requires characteristics of the data itself in the data set to resolve class duplicates and conflicts between different data sets. However, different data sets may come from different scenes, environments or devices, and due to differences of factors such as camera internal parameters, external parameters, scenes and the like, the data distribution may have great differences, and after mixing, the differences may introduce data distribution deviation, so that the accuracy on the mixed data sets is poor, thereby negatively affecting the performance of the model and affecting the generalization capability of the model.

Disclosure of Invention

The embodiment of the application mainly aims to provide a target detection method, a model training method, a device, equipment and a storage medium, so as to improve the detection accuracy of a target detection model.

To achieve the above object, a first aspect of the embodiments of the present application provides a target detection method implemented by a target detection model, where the target detection model includes a sensing network, a fusion network and a detection network, the sensing network includes a preset number of data set sensing units connected in sequence, the fusion network includes a preset number of up-sampling units corresponding to the data set sensing units, the detection network includes a preset number of detection heads, at least part of training processes of different target detection models are performed by data collaborative training of a plurality of data sets to obtain image features of the data set sensing units for sensing different data sets, and the method includes:

acquiring an image to be detected;

Inputting the image to be detected into the sensing network to sequentially obtain the sensing characteristics of each data set sensing unit; the data set sensing unit comprises sensing layers consistent with the number of the data sets, and the sensing layers are used for generating the sensing characteristics;

the upsampling unit is utilized to obtain upsampling output characteristics of the sensing characteristics, and the sensing characteristics and the upsampling output characteristics are subjected to characteristic fusion to obtain fusion characteristics corresponding to each sensing characteristic;

and respectively inputting the fusion characteristics into the detection heads to obtain corresponding detection results, and obtaining the prediction data of the image to be detected according to a preset number of detection results.

In some embodiments, the sensing layer is a batch normalization layer, and the dataset sensing unit comprises: the two-dimensional convolution layer, the activation layer and the batch normalization layer with consistent data sets; inputting the image to be detected to the sensing network, sequentially obtaining the sensing characteristics of each data set sensing unit, including:

Acquiring input features of input data by using a two-dimensional convolution layer, and calculating similarity values of the input features and each data set;

selecting the data set with the maximum similarity value as a target data set;

inputting the input data into the batch normalization layer corresponding to the target data set to obtain normalization features;

and inputting the normalized features into the activation layer to obtain the sensing features, wherein the input data of a first data set sensing unit is the image to be detected, and the input data of other data set sensing units is the sensing features of a previous data set sensing unit.

In some embodiments, the sensing network further comprises a pyramid pooling layer, and the data set sensing units are connected with the pyramid pooling layer after being cascaded according to the hierarchy; the up-sampling unit in the fusion network is correspondingly associated with the data set sensing unit; the step of obtaining the upsampled output features of the perceptual features by using the upsampling unit, and performing feature fusion on the perceptual features and the upsampled output features to obtain fusion features corresponding to each perceptual feature, includes:

Inputting the perception features output by the data set perception unit of the last level into the pyramid pooling layer to obtain pooling features;

Inputting the pooled features into the up-sampling unit of the last level to obtain the up-sampling output features of the last level;

fusing the sensing features of the same level with the corresponding up-sampling output features to obtain fused features; the fusion feature of the next level is the input of the up-sampling unit of the previous level, and the output of the up-sampling unit is the up-sampling output feature.

In some embodiments, the image to be detected includes an object to be detected; the detection head includes: a classifier and a bounding box regressor; the step of inputting the fusion features into the detection heads to obtain corresponding detection results, and obtaining the predicted data of the image to be detected according to the preset number of detection results, comprises the following steps:

inputting the fusion characteristics into the classifier to obtain a classification predicted value;

inputting the fusion characteristics into the boundary box regressor to obtain a boundary box predicted value;

Voting a preset number of the classification predicted values, and taking the classification predicted value with the largest number of votes as the class information of the target to be tested;

acquiring intersections of a preset number of predicted values of the boundary frames to obtain intersection frames, and taking the positions corresponding to the intersection frames as position information of the target to be detected;

and obtaining the prediction data according to the category information and the position information.

In some embodiments, at least part of the training process of the target detection model is obtained by data co-training of a plurality of data sets, including:

acquiring the original data and the original labels of a plurality of data sets to generate a label tree, and acquiring training samples and sample labels according to the label tree; the sample tag includes: classification labels and bounding box labels;

Inputting the training sample into the target detection model for data processing to obtain fusion training characteristics, and inputting the fusion training characteristics into a detection network to obtain a classification training value and a bounding box training value;

Calculating to obtain a first loss value according to the classification training value and the classification label, and calculating to obtain a confidence loss value and a positioning loss value according to the boundary box training value and the boundary box label;

obtaining a total loss value according to the first loss value, the confidence loss value and the positioning loss value;

and adjusting the model weight of the target detection model according to the total loss value until reaching the iteration termination condition to obtain the trained target detection model.

In some embodiments, the calculating the first loss value according to the classification training value and the classification label includes:

performing random Fourier transform on the fusion training characteristics to obtain a random Fourier characteristic diagram;

calculating sample weight according to the random Fourier feature map;

And obtaining a first loss value according to the classification training value and the classification label, and updating the first loss value according to the sample weight.

In some embodiments, the original label includes an original classification label and an original bounding box label; the step of obtaining the original data and the original labels of a plurality of data sets to generate a label tree, and obtaining training samples and sample labels according to the label tree comprises the following steps:

Classifying the original data to obtain a classification result of a preset level, and constructing the tag tree according to the classification result, wherein the tag tree comprises father and child nodes corresponding to the preset level, the father and child nodes comprise father nodes and child nodes, and the attribute of the child node is a subset of the attribute of the father node;

selecting a target node from the label tree to generate the training sample;

acquiring a node path from the target node corresponding to the training sample to a root node of the label tree, and calculating to obtain the classification labels of the training sample according to the original classification labels of all nodes on the node path;

And taking the original boundary box label of the target node as the boundary box label of the training sample.

To achieve the above object, a second aspect of the embodiments of the present application provides a training method of an object detection model, where the object detection model includes a sensing network, a fusion network, and a detection network, and the method includes:

Inputting the training samples into the sensing network and the fusion network for data processing to obtain fusion training characteristics, and inputting the fusion training characteristics into the detection network to obtain a classification training value and a bounding box training value;

selecting a target node from the label tree to generate the training sample;

To achieve the above object, a third aspect of the embodiments of the present application provides an object detection device, which is implemented by an object detection model, where the object detection model includes a sensing network, a fusion network, and a detection network, the sensing network includes a preset number of data set sensing units connected in sequence, the fusion network includes a preset number of up-sampling units corresponding to the data set sensing units, the detection network includes a preset number of detection heads, and different data set sensing units are used to sense image features of different data sets, and the device includes:

An image acquisition module: the method comprises the steps of acquiring an image to be detected;

And a perception module: the sensing network is used for inputting the image to be detected into the sensing network to sequentially obtain the sensing characteristics of each data set sensing unit;

And a fusion module: the up-sampling unit is used for obtaining up-sampling output characteristics of the sensing characteristics, and carrying out characteristic fusion on the sensing characteristics and the up-sampling output characteristics to obtain fusion characteristics corresponding to each sensing characteristic;

and a detection module: and the fusion features are respectively input into the detection heads to obtain corresponding detection results, and the prediction data of the image to be detected is obtained according to a preset number of detection results.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes an electronic device, including a memory storing a computer program and a processor implementing the method according to the first aspect or the second aspect when the processor executes the computer program.

To achieve the above object, a fifth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect or the second aspect.

According to the target detection method, the model training method, the device, the equipment and the storage medium, the image to be detected is acquired and is input into the sensing network to obtain the sensing characteristics of each data set sensing unit, then the sensing characteristics and the output characteristics of the corresponding up-sampling unit are subjected to characteristic fusion to obtain fusion characteristics corresponding to each sensing characteristic, finally the fusion characteristics are respectively input into the detection heads to obtain corresponding detection results, and then the prediction data of the image to be detected is obtained according to the preset number of detection results. According to the embodiment of the application, at least part of the training process of the target detection model is obtained by data collaborative training of a plurality of data sets, and different data set sensing units are utilized to sense the image characteristics of different data sets, so that the generalization capability of the target detection model can be improved by comprehensively utilizing the information of different data sets, and better detection performance can be obtained on a plurality of target data sets.

Drawings

Fig. 1 is a schematic structural diagram of an object detection model of an object detection method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a perception network of an object detection model of an object detection method according to another embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an object detection model of an object detection method according to another embodiment of the present invention.

Fig. 4 is a flowchart of a target detection method according to an embodiment of the present invention.

Fig. 5 is a flowchart of step S120 in fig. 4.

Fig. 6 is a schematic structural diagram of a perception network of an object detection model of an object detection method according to another embodiment of the present invention.

Fig. 7 is a flowchart of step S130 in fig. 4.

Fig. 8 is a flowchart of step S140 in fig. 4.

Fig. 9 is a flowchart of a training process of an object detection model of an object detection method according to another embodiment of the present invention.

Fig. 10 is a flowchart of step S910 in fig. 9.

Fig. 11 is a schematic diagram of a label tree of an object detection model of an object detection method according to another embodiment of the present invention.

Fig. 12 is a schematic view of a detection head structure of an object detection model of an object detection method according to another embodiment of the present invention.

Fig. 13 is a block diagram of an object detection device according to another embodiment of the present invention.

Fig. 14 is a schematic hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

First, several nouns involved in the present invention are parsed:

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Based on this, the embodiment of the invention provides a target detection method, a model training method, a device, equipment and a storage medium, at least part of the training process of a target detection model is obtained by data collaborative training of a plurality of data sets, and different data set sensing units are utilized to sense image features of different data sets, so that the embodiment can comprehensively utilize information of different data sets to improve generalization capability of the target detection model, and can obtain better detection performance on a plurality of target data sets.

The embodiment of the invention provides a target detection method, a model training method, a device, equipment and a storage medium, and specifically, the following embodiment is used for explaining, and the target detection method in the embodiment of the invention is described first.

The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIALINTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the invention provides a target detection method, relates to the technical field of artificial intelligence, and particularly relates to the technical field of data mining. The target detection method provided by the embodiment of the invention can be applied to a terminal, a server and a computer program running in the terminal or the server. For example, the computer program may be a native program or a software module in an operating system; the Application may be a local (Native) Application (APP), i.e. a program that needs to be installed in an operating system to run, such as a client that supports object detection, or an applet, i.e. a program that only needs to be downloaded into a browser environment to run; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in. Wherein the terminal communicates with the server through a network. The target detection method may be performed by a terminal or a server, or by a terminal and a server in cooperation.

In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, or the like. The server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; or may be service nodes in a blockchain system, where Peer-To-Peer (P2P, peer To Peer) networks are formed between the service nodes, and the P2P protocol is an application layer protocol that runs on top of a transmission control protocol (TCP, transmission Control Protocol) protocol. The server may be provided with a server of the target detection system, through which interaction with the terminal may be performed, for example, the server may be provided with corresponding software, which may be an application for implementing the target detection method, etc., but is not limited to the above form. The terminal and the server may be connected through communication connection modes such as bluetooth, USB (Universal Serial Bus ) or network, which is not limited in this embodiment.

The invention is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the embodiments of the present application, when related processing is performed according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards of related countries and regions. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

First, a target detection model in an embodiment of the present invention will be described.

Referring to fig. 1, a schematic structural diagram of an object detection model according to an embodiment of the present application is shown.

In one embodiment, the object detection model is used to implement an object detection method, wherein the object detection model 10 includes: the perception network 100, the fusion network 200, and the detection network 300, wherein at least a portion of the training process of the object detection model 10 is derived from data co-training of multiple data sets. The sensing network 100 is configured to receive an image to be detected, and the detecting network 300 outputs a detection result.

Specifically, referring to fig. 2, the awareness network 100 includes: the preset number of data set sensing units 110 and pyramid pooling layers 120 connected in sequence is assumed to be N, the preset number in the figure is illustrated by taking n=3 as an example, and is set to be 3 pieces of characteristic information for detecting three different levels of near view, middle view and far view of an image, and it can be understood that the more the preset number is, the finer the detection granularity in the target detection process is. The preset number is not limited in this embodiment, and may be set according to the operation requirement and the performance of the operation device.

In an embodiment, referring to fig. 2, the data set sensing unit 110 includes a two-dimensional convolution layer 111, M sensing layers and an activation layer 113, where the sensing layers are specifically a batch normalization layer 112, where M is the number of data sets that need to be cooperatively processed in the embodiment of the present application, and is not limited to the size of M, and M is not related to the preset number N, and may be the same or different, and M only needs to be satisfied as an integer greater than 1. Wherein, the data entering the data set sensing unit 110 selects one of the batch normalization layers 112, and then inputs the output of the batch normalization layer 112 into the activation layer 113. In this embodiment, firstly, the two-dimensional convolution layer 111 is used to extract isomorphic convolution network parameters of input data, then the batch normalization layer 112 is used to perform standardization processing on the isomorphic convolution network parameters, and at the same time, learnable parameters are introduced to adjust the mean value and variance of the data, and the batch normalization layer 112 calculates the mean value and variance as a standardized reference in the training process of each small batch of training samples, so that the problem of gradient disappearance or gradient explosion in the training process can be alleviated, and the training process of the model can be accelerated. In one embodiment, the activation layer is SILU activation functions that non-linearly transform the input during forward propagation.

In one embodiment, referring to fig. 3, the converged network 200 includes: n up-sampling units 210 corresponding to the data set sensing units 110, wherein the data set sensing units 110 are correspondingly associated with the up-sampling units 210. As shown in fig. 1, the 1 st data set sensing unit 110 is configured to receive an image to be detected, sense the image to obtain a sensing feature, input the sensing feature into the next data set sensing unit 110 until the nth sensing feature is output via the nth data set sensing unit 110, the nth sensing feature enters the pyramid pooling layer 120 to obtain a pooling feature, the pooling feature output by the pyramid pooling layer 120 enters the nth up-sampling unit 210 of the fusion network 200 to perform up-sampling processing to obtain an nth up-sampling output feature, then fuse the nth sensing feature with the nth up-sampling output feature to obtain an nth fused feature, and compute the nth fused feature one by one as the input of the N-1 st up-sampling unit 210 until the 1 st up-sampling output feature output by the 1 st up-sampling unit 210 is obtained.

In one embodiment, the detection network 300 includes: n detection heads 310 are used for obtaining N fusion features in the above-mentioned FIG. 3, each fusion feature is input into a corresponding detection head 310 to perform different layers of target detection processes including near view, middle view and far view, a detection result of each detection head is obtained, and prediction data of an image to be detected is obtained according to the detection result of each detection head.

From the above, the object detection model in the embodiment of the present application can comprehensively utilize information of different data sets to improve generalization capability of the object detection model, so that better detection performance can be obtained on a plurality of object data sets.

The following describes a target detection method in the embodiment of the present invention.

Fig. 4 is an alternative flowchart of a target detection method according to an embodiment of the present invention, where the method in fig. 4 may include, but is not limited to, steps S110 to S150. It should be understood that the order of steps S110 to S140 in fig. 4 is not particularly limited, and the order of steps may be adjusted, or some steps may be reduced or added according to actual requirements.

Step S110: and acquiring an image to be detected.

In an embodiment, the image to be detected includes a target to be detected, and the target to be detected is input into the target detection model according to the embodiment of the present application to obtain a detection result of the target to be detected, where the target to be detected is not limited. The image to be detected can be uploaded to the processing device by a user for detection, or can be obtained from a public database or a webpage through data capture, and the method for obtaining the image to be detected is not limited in this embodiment.

Step S120: inputting the image to be detected into a sensing network to sequentially obtain the sensing characteristics of each data set sensing unit.

In an embodiment, the data set sensing unit comprises a number of sensing layers corresponding to the number of data sets, the sensing layers being arranged to generate the sensing features. After the data enters the data set sensing unit, the closest sensing layer is selected from the sensing layers to be input. Referring to fig. 5, step S120 includes the steps of:

Step S121: and acquiring input characteristics of the input data by using the two-dimensional convolution layer, and calculating similarity values of the input characteristics and each data set.

In one embodiment, after input data is obtained, a two-dimensional convolution layer is used to perform feature extraction on an image to be detected through a series of convolution operations to obtain input features. For example, a convolution kernel matrix is slid to perform a convolution operation on the input data, where the convolution kernel is generally square in size, such as 3x3, 5x5, etc., and the convolution operation is specifically performed by: the convolution kernel is slid from the upper left corner of the input data, a fixed step length is moved each time, the convolution kernel and the pixel value of the corresponding area of the input data are subjected to dot multiplication at each position, the results are added to obtain a single numerical value, and a matrix formed by the single numerical values is used as the value of the corresponding position of the input characteristic of the output.

In an embodiment, each dataset includes a plurality of image data, and for each dataset, the extracted portion of representative feature vectors form a set, and a feature matching algorithm is used to compare the input data with the feature vectors of the feature vector set in each dataset, and a similarity value for each dataset is obtained according to the matching result. The similarity value is mainly used for measuring the similarity between the distribution of the input data and the distribution of the feature vectors in the data set.

Step S122: and selecting the data set with the maximum similarity value as a target data set.

Step S123: and inputting the input data into a batch normalization layer corresponding to the target data set to obtain normalization features.

Step S124: and inputting the normalized features into an activation layer to obtain the perception features.

In one embodiment, the data set with the highest similarity value, that is, the data set to which the input data is most likely to belong, is selected as the target data set. It can be understood that, since the data set sensing units are cascaded, the input data of the first data set sensing unit is the image to be detected, and the input data of the other data set sensing units is the sensing characteristic of the previous data set sensing unit.

In an embodiment, referring to fig. 6, it is assumed that there are 4 data set sensing units cascaded, which are in turn a data set sensing unit S1, a data set sensing unit S2, a data set sensing unit S3, and a data set sensing unit S4, each data set sensing unit has 1 two-dimensional convolution layer, 3 batch normalization layers, and one activation layer, and the 3 batch normalization layers correspond to the data set G1, the data set G2, and the data set G3, respectively. Firstly, an image to be detected is input into a data set sensing unit S1, image features to be detected are obtained through a two-dimensional convolution layer, and the distribution of the image features to be detected is closest to the data set G1 through similarity value judgment, so that the data set G1 in the data set sensing unit S1 is used as a target data set, and the image features to be detected are input into a corresponding batch normalization layer to obtain sensing features p1 of the data set sensing unit S1. Then the sensing feature p1 is input into the data set sensing unit S2, and the distribution of the sensing feature p1 is judged to be closest to the data set G2 through the similarity value, so that the data set G2 in the data set sensing unit S2 is used as a target data set, and the sensing feature p1 is input into a corresponding batch normalization layer to obtain the sensing feature G2 of the data set sensing unit S2. And sequentially inputting, wherein the data set G1 in the data set sensing unit S3 is used as a target data set, and the data set G3 in the data set sensing unit S4 is used as a target data set. Therefore, the target data sets in the different data set sensing units in the embodiment of the application are not necessarily identical, and are calculated according to the distribution of the input data.

In one embodiment, the perceptual features of the batch normalization layer output are expressed as:

Wherein z ^(l) represents the input of the first batch normalization layer, and the mean and variance of the input z ^(l) are μ _(l) and μ, respectively Gamma', beta represent the scaling and translation parameter vectors, respectively.

The active layer is SILU functions, expressed as:

Where x represents the input of the active layer and c represents the adjustable constant.

Step S130: and obtaining an up-sampling output characteristic of the perception characteristic by utilizing the up-sampling unit, and carrying out characteristic fusion on the perception characteristic and the up-sampling output characteristic to obtain a fusion characteristic corresponding to each perception characteristic.

In one embodiment, referring to fig. 7, step S130 includes the steps of:

Step S131: and inputting the sensing characteristics output by the data set sensing unit of the last hierarchy into a pyramid pooling layer to obtain pooling characteristics.

Step S132: and inputting the pooled features into an up-sampling unit of the last level to obtain up-sampling output features of the last level.

Step S133: and fusing the perception features of the same level with the corresponding up-sampling output features to obtain fusion features.

In an embodiment, referring to fig. 3, the data set sensing units comprise N levels, each level of data set sensing units being associated with an up-sampling unit, the output of the up-sampling unit being an up-sampling output characteristic, i.e. each level comprising one data set sensing unit and one up-sampling unit. The 1 st data set sensing unit is used for receiving an image to be detected, sensing the image to obtain a sensing feature 1, then inputting the sensing feature of the 1 st data set sensing unit to the 2 nd data set sensing unit to obtain a sensing feature 2, inputting the sensing feature stepwise until the sensing feature N of the N data set sensing unit is obtained, inputting the sensing feature N to the pyramid pooling layer to obtain pooling feature C, and then inputting the pooling feature C to the up-sampling unit of the N level to obtain an up-sampling output feature DN of the N level. And in the same level, fusing the perception feature and the corresponding up-sampling output feature to obtain a fusion feature. For example, the fusion feature of the nth level is obtained by adding the perceptual feature N and the upsampling output feature DN, the fusion feature of the ith level is obtained by adding the perceptual feature i of the ith data set perceptual unit and the upsampling output feature Di of the upsampling unit of the ith level, and so on, the fusion feature corresponding to each level can be obtained. Meanwhile, the fusion feature of the next level is the input of the up-sampling unit of the previous level, for example, the input of the up-sampling unit of the (N-1) th level is the fusion feature of the (N) th level, and so on, the input of the up-sampling unit of the (1) th level is the fusion feature of the (2) th level.

Through the process, the fusion characteristics of each level can be obtained, and the fusion characteristics comprise information obtained through perception and up-sampling.

Step S140: and respectively inputting the fusion characteristics into the detection heads to obtain corresponding detection results, and obtaining prediction data of the image to be detected according to a preset number of detection results.

In an embodiment, the image to be detected includes an object to be detected, referring to fig. 8, step S140 includes the following steps:

step S141: and inputting the fusion characteristics into a classifier to obtain a classification predicted value.

In one embodiment, the purpose of the classifier is to determine the class of the object to be detected in the fusion feature, such as specific classes like "side-to-side", "bicycle", etc., and the classification prediction value is a specific class label or probability distribution, where the probability distribution is used to represent the likelihood of each class.

Step S142: and inputting the fusion characteristics into a boundary box regressor to obtain a boundary box predicted value.

In an embodiment, the purpose of the bounding box regressor is to determine information such as the position and the size of the object to be measured in the fusion feature, and the bounding box predicted value may be a coordinate representation of the upper left corner and the lower right corner of the bounding box, or a center point coordinate, a width, and a height representation of the bounding box.

It is to be understood that the order between the above-described step S141 and step S142 is not particularly limited. Similarly, in this embodiment, the detection heads may correspond to different detection levels including near, middle and far views, for example, if the preset number is set to three, three detection heads are set to correspond to the near, middle and far views respectively. The detection head of the near scene obtains a detection result of the near scene, wherein the detection result comprises a boundary frame prediction value of the near scene and a classification prediction value of the near scene, the detection head of the middle scene obtains a detection result of the middle scene, the detection head of the far scene obtains a detection result of the far scene, and the detection head of the near scene comprises a boundary frame prediction value of the middle scene and a classification prediction value of the far scene.

Step S143: voting is carried out on a preset number of classification predicted values, and the classification predicted value with the largest number of votes is used as the class information of the object to be measured.

In an embodiment, the classification prediction value with the largest number of times is selected from the different classification prediction values as the class information of the object to be measured. For example, the classification prediction values corresponding to the close view, the middle view and the far view all judge that the object to be detected is a cat, and then the class information of the object to be detected is the cat.

Step S144: and solving intersections of the predicted values of the preset number of bounding boxes to obtain intersection frames, and taking the positions corresponding to the intersection frames as position information of the target to be detected.

In an embodiment, range intersections are obtained from the predicted values of different bounding boxes corresponding to the close range, the middle range and the far range, and an intersection frame is obtained, so that position information of the target to be detected is obtained according to the positions corresponding to the intersection frame. It can be appreciated that the bounding box predictors of different detection levels need to be normalized and then the intersection is found.

Step S145: and obtaining prediction data according to the category information and the position information.

In an embodiment, the sequence between the step S143 and the step S144 is not specifically limited. And fusing a plurality of detection results to obtain prediction data of the image to be detected. The result fusion comprises two parts, voting is carried out on the classified predicted values, the classified predicted value with the largest number of votes is taken as the class of the target to be tested, the intersection is obtained on the predicted value of the boundary frame, and the position corresponding to the intersection frame is taken as the position of the target to be tested. And obtaining a detection result according to the category of the target to be detected, the position, the size and other information of the target to be detected, and marking the detection result in the image to be detected, for example, marking the position, the size and the category of the target to be detected.

The above process describes a process of performing object detection by using information of a plurality of data sets in the embodiment of the present application, and a training process of an object detection model in the embodiment of the present application is described below.

In one embodiment, referring to FIG. 9, at least a portion of the training process of the object detection model is derived from data co-training of multiple data sets, comprising:

step S910: and obtaining the original data of the plurality of data sets and the original labels to generate a label tree, and obtaining training samples and sample labels according to the label tree.

In one embodiment, the original label includes an original classification label and an original bounding box label, and the sample label includes: classification labels and bounding box labels. Referring to fig. 10, step S910 includes the steps of:

Step S911: classifying the original data to obtain a classification result of a preset level, and constructing a tag tree according to the classification result.

If the tag sets have the same semantic hierarchy, the data sets can be easily fused. However, most of the tag sets of the data sets have different semantic hierarchies, and cannot be directly fused. For example, kaggle cat-dog identification dataset labeled only as cat or dog, while the Stanford dataset labeled 120 dog categories.

In one embodiment, it is assumed that the raw data is from a plurality of different data sets, such as a person detection data set, an animal detection data set, a vehicle detection data set, a flame detection data set, etc., while the animal detection data set in turn includes a cat detection data set, a dog detection data set, etc., and the dog detection data set in turn includes a coarse-granularity label data set and a fine-granularity label data set, wherein coarse-granularity labels refer to the raw labels as "dogs", and fine-granularity labels refer to the raw labels as a fine classification of dogs, such as "baume", "Jimoppet", etc. The labels in the data sets of the original data can be seen to be one each, and are not necessarily in mutual exclusion relationship, for example, the original label 'dog' and the original label 'tidy' belong to inclusion relationship, and the labels of different data sets cannot be directly combined. Therefore, the embodiment of the application provides a label tree for realizing label merging of cross-data sets.

In one embodiment, the raw data are classified, each of the raw data includes a raw tag, and the classification of the raw data is performed according to the raw classification tag of the raw tag, e.g., the data set includes: people, cats, dogs, boats, airplanes, cars, flames, tornados, etc. can be divided into the following preset levels:

root level: physical objects (which are real-world objects, corresponding to virtual objects, such as pictorial characters, belong to a virtual object).

First level: animal |artifact| phenomenon, each element in the first hierarchy may include a second hierarchy.

Second level: animals (human |cat| dogs), artifacts (ship|airplane| car), phenomena (flame|tornado), each element in the second level may comprise a third level, which is illustrated in fig. 12 by way of example for cats, dogs, and cars in the second level, and does not represent that only cats, dogs, and cars comprise the third level.

It can be appreciated that the preset level may be set according to actual requirements, and the embodiment is not limited thereto. While the labels represented by each node may correspond to multiple raw data.

Referring to fig. 11, the elements of each level are represented as nodes, the attribute of each node is the content of the element, for example, the attribute of the node "bicycle" in the third level is "bicycle", and the label tree is drawn according to the preset level relationship. The nodes in the label tree comprise root nodes and other father and child nodes, the content of the root level is the root node, the father and child nodes comprise father nodes and child nodes, and the attribute of the child nodes is a subset of the attribute of the father nodes. It is understood that a parent node may be a child node of a level above.

Step S912: and selecting a target node from the label tree to generate a training sample.

In an embodiment, any number of nodes are selected from the label tree as target nodes, one or more pieces of original data corresponding to labels of each target node are obtained, and the original data are used as training samples.

Step S913: and obtaining a node path from the target node corresponding to the training sample to the root node of the label tree, and calculating to obtain the classification label of the training sample according to the original classification labels of all the nodes on the node path.

In one embodiment, the class labels are products of conditional probabilities corresponding to the original class labels of all nodes on the node path. Referring to fig. 12, assuming that the target node is selected to be "bicycle", the node path from the target node to the root node is: "bicycle" - "car" - "artifact" - "physical object". Since the original class label is a conditional probability, the class label calculation process for the training sample "bicycle" is expressed as:

p _{Bicycle with wheel} =p (bicycle|vehicle) ×p (vehicle|artifact) ×p (artifact|physical object) ×p (physical object)

Wherein p (bicycle|car) is the probability of being a bicycle class under the condition of judging as a car; p (vehicle|artifact) is the probability of being a vehicle under the condition that it is determined to be an artifact; p (artifact|physical object) is the probability of being an artificial class if it is determined that there is a physical object; p (physical object) is the probability of containing the physical object.

For another example, taking the target node as "edge grazing", the node path from the target node to the root node is: "side-grazing" - "dog" - "animal" - "physical target". The class label calculation process for training samples "side-to-side" is therefore expressed as:

p _{Side grazing} = p (side animal |dog) ×p (animal|physical target) ×p (physical target)

Wherein, p (side-pastoral |dog) is the probability of being a side-pastoral class under the condition of judging as a dog; p (dog|animal) is the probability of being a dog under the condition of being determined to be an animal; p (animal|physical object) is the probability of being an animal under the condition that it is determined that there is a physical object; p (physical object) is the probability of containing the physical object.

Therefore, the embodiment of the application correlates different original classification labels through the label tree, so that different data sets with different label sets can be organically fused.

Step S914: and taking the original bounding box label of the target node as a bounding box label of the training sample.

In an embodiment, the original bounding box label of the target node is used as a bounding box label of the training sample, and the sample label of the training sample is obtained by combining the classification labels.

Step S920: and inputting the training sample into a target detection model for data processing to obtain fusion training characteristics, and inputting the fusion training characteristics into a detection network to obtain a classification training value and a bounding box training value.

In one embodiment, referring to fig. 12, the detection head 310 includes: classifier 311, bounding box regressor 312, and sample weight learning module 313. Firstly, training samples sequentially pass through a two-dimensional convolution layer, a perception network and a fusion network to obtain fusion training features, and then the fusion training features are input into a classifier 311 and a bounding box regressor 312 in fig. 13 one by one to obtain a classification training value output by the classifier 311 and a bounding box training value output by the bounding box regressor 312.

Step S930: and calculating according to the classification training value and the classification label to obtain a first loss value, and calculating according to the boundary frame training value and the boundary frame label to obtain a confidence loss value and a positioning loss value.

In one embodiment, the data set may have a problem of unbalanced classification, that is, some classes have a smaller number of training samples and other classes have a larger number of training samples, so in the training process, in order to remove the correlation between the image features, a sample weight learning module learns sample weights with independent dimensions by using a random fourier transform to adjust the training process of the model. Wherein the weights may be category weights, with lower frequency categories being given higher weights for more attention to these categories. Referring to fig. 13, when the first loss value is calculated according to the classification training value and the classification label, a random fourier transform is further performed on the fusion training feature to obtain a random fourier feature map, where the random fourier feature map is used to map the fusion training feature from the original space to a feature space, and the feature space is high-dimensional. The random fourier feature map is then input to a sample weight learning module 313 to calculate sample weights, which are then multiplied by the first loss values to update the first loss values according to the sample weights. The first loss value contains sample weight related information, and the target detection model adjusts parameters of the first loss value to minimize a corresponding loss function when the first loss value is optimized. By sample weighting, the object detection model will be more focused on those samples with higher weights, thereby better handling the unbalanced data set.

In one embodiment, the confidence loss value and the location loss value are calculated from the bounding box training values and the bounding box labels. In the bounding box prediction process, the present embodiment uses the confidence loss value and the location loss value to evaluate the performance of the target detection model. Firstly, a confidence loss value is used for measuring the prediction accuracy of a target detection model on the existence of a target in a training sample, and a confidence score is used for indicating whether the target exists in the frame. The location loss value is then used to measure the accuracy of the object detection model to the bounding box position prediction, which includes coordinate information of the bounding box (e.g., the center coordinates, width, and height of the bounding box). By combining the confidence loss value and the positioning loss, the regression result of the boundary frame is more accurate, the target detection model is helped to more accurately position the target object, and the accuracy of the detection task is improved.

In one embodiment, to solve the problem of imbalance between positive and negative samples, embodiments of the present application use a dynamic allocation strategy to select topK training samples as positive samples and the rest are negative samples. This embodiment utilizes positive sample alignment scores to help evaluate the accuracy and recall of the target detection model on positive sample classification, with higher positive sample alignment scores indicating better performance of the model in identifying and predicting positive samples.

The positive sample alignment score t is calculated as:

t＝s^α×IoU^β

Wherein s is the prediction probability that the classification training value is a true value, ioU is the complete intersection ratio of the boundary frame training value and the boundary frame label, and alpha and beta are super parameters which are adjusted in the training process. The calculation process of the complete cross ratio is as follows: firstly dividing the intersecting area of the boundary frame training value and the boundary frame label by the merging area of 2 frames, then taking the quotient value as the base of 2 to make logarithm, and multiplying the logarithm by the sign to obtain the complete intersection ratio.

Step S940: and obtaining a total loss value according to the first loss value, the confidence loss value and the positioning loss value.

In one embodiment, the first loss value L _VFL is expressed as:

wherein p is a classification training value, and q is a bounding box training value. When the training samples are positive samples, q is the full cross-ratio. If the training sample is a positive sample, q is zero, and gamma is a training parameter.

The confidence loss value L _DFL is expressed as:

where y 'and y represent bounding box labels, the values include left and right boundary values y', y _i+1,y_i represent left and right predicted values of the bounding box training values.

The positioning loss value L _CIoU is expressed as:

Wherein ρ ²(b,b^gt) is the square of the euclidean distance between the center point b of the bounding box training value and the center point b ^gt of the bounding box label, c is the minimum circumscribed rectangular diagonal length of the bounding box label and the bounding box training value, v represents the bounding box aspect ratio associated value, w, h represents the width and height of the bounding box training value, respectively, and w ^gt,h^gt represents the width and height of the bounding box label, respectively.

The total loss value L _tot is expressed as:

Where L represents the number of targets detected, lambda _cls represents the weight of the first loss value L _VFL, lambda _coord represents the weight of the confidence loss value L _DFL, and lambda _bbox represents the weight of the location loss value L _CIoU.

Step S950: and adjusting the model weight of the target detection model according to the total loss value until the iteration termination condition is reached, so as to obtain the trained target detection model.

In one embodiment, the iteration termination conditions here include: 1) The iteration number reaches the preset iteration number: and presetting the preset iteration number of the scene generating model operation, and stopping training when the preset iteration number is reached. 2) The loss value reaches a loss threshold: and when the loss value is reduced to or close to a preset loss threshold value, the scenario generation model is considered to be converged, and training is stopped. 3) The target detection model reaches preset performance parameters. The present embodiment does not limit the iteration termination condition.

In one embodiment, the preset performance parameter may be that the floating point operand of the target detection of the single picture is less than or equal to 294GLOPs, and the running time is less than or equal to 70ms. In this embodiment, the preset performance parameter is used as the optimization index mAP, which is expressed as:

Wherein K is the number of categories of the classification labels in the sample labels, AP represents the accuracy of a single category, and is the area of an interpolated precision-recall (PR) curve and an X-axis envelope. r ₁、r₂、…r_n-1 is the recall corresponding to the first interpolation position p _interp(r_i+1) of the precision interpolation segment arranged in ascending order, and n represents the interpolation number.

In one embodiment, the training flow of the target detection method is summarized as follows: since the labels of the data sets used for similar object detection tasks have semantic relevance within the domain knowledge, the present embodiment first constructs a unified label graph according to the domain knowledge, and then combines multiple data sets into a hybrid data set within the label space. And then, utilizing a two-dimensional convolution layer to acquire isomorphic convolution network parameters of different data sets, and then acquiring heterogeneous batch normalization network parameters, namely sensing characteristics, by cascading a plurality of data set sensing units. And then, carrying out a multi-data set collaborative optimization process by utilizing the fusion network and the detection network to obtain the target detection model after tuning. The test set divided from the mixed data set is used for further testing whether the accuracy and the performance of the target detection model reach preset standards. If the standard is not met, repeating the steps.

According to the technical scheme provided by the embodiment of the application, the image to be detected is acquired and is input into the sensing network to obtain the sensing characteristics of each data set sensing unit, then the sensing characteristics and the output characteristics of the corresponding up-sampling unit are subjected to characteristic fusion to obtain the fusion characteristics corresponding to each sensing characteristic, finally the fusion characteristics are respectively input into the detection heads to obtain the corresponding detection results, and the prediction data of the image to be detected is obtained according to the preset number of detection results. According to the embodiment of the application, at least part of the training process of the target detection model is obtained by data collaborative training of a plurality of data sets, the label difference bottleneck in cross-data set fusion is solved by utilizing the label tree, the data set sensing unit comprises sensing layers which are consistent with the data sets in number and are used for generating sensing characteristics, and different sensing layers are used for extracting characteristic information of different data sets, so that the generalization capability of the target detection model can be improved by comprehensively utilizing the information of different data sets, and better detection performance can be obtained on a plurality of target data sets.

The embodiment of the invention also provides a training method of the target detection model. The target detection model comprises a perception network, a fusion network and a detection network, and the method comprises the following steps: acquiring original data of a plurality of data sets and original labels to generate a label tree, and acquiring training samples and sample labels according to the label tree; the sample tag includes: classification labels and bounding box labels; inputting the training samples into a perception network and a fusion network for data processing to obtain fusion training characteristics, and inputting the fusion training characteristics into a detection network to obtain a classification training value and a bounding box training value; calculating to obtain a first loss value according to the classification training value and the classification label, and calculating to obtain a confidence loss value and a positioning loss value according to the boundary frame training value and the boundary frame label; obtaining a total loss value according to the first loss value, the confidence loss value and the positioning loss value; and adjusting the model weight of the target detection model according to the total loss value until the iteration termination condition is reached, so as to obtain the trained target detection model.

In one embodiment, the original labels include an original classification label and an original bounding box label; obtaining the original data and the original labels of a plurality of data sets to generate a label tree, and obtaining training samples and sample labels according to the label tree, wherein the method comprises the following steps: classifying the original data to obtain a classification result of a preset level, and constructing a label tree according to the classification result, wherein the label tree comprises father-son nodes corresponding to the preset level, the father-son nodes comprise father nodes and son nodes, and the attribute of the son nodes is a subset of the attribute of the father nodes; selecting a target node from the tag tree to generate a training sample; obtaining a node path from a target node corresponding to the training sample to a root node of a label tree, and calculating to obtain a classification label of the training sample according to original classification labels of all nodes on the node path; and taking the original bounding box label of the target node as a bounding box label of the training sample.

The specific implementation manner of the training method of the target detection model in this embodiment is substantially identical to the specific implementation manner of the target detection method described above, and will not be described herein.

The embodiment of the present invention further provides a target detection apparatus, which may implement the target detection method, and is implemented by a target detection model, where the target detection model includes a sensing network, a fusion network, and a detection network, the sensing network includes a preset number of data set sensing units connected in sequence, the fusion network includes a preset number of up-sampling units corresponding to the data set sensing units, the detection network includes a preset number of detection heads, and different data set sensing units are used to sense image features of different data sets, and referring to fig. 13, the apparatus includes:

image acquisition module 1310: for acquiring an image to be detected.

Perception module 1320: the method is used for inputting the image to be detected into the sensing network, and sensing characteristics of each data set sensing unit are obtained in sequence.

Fusion module 1330: the method is used for obtaining the up-sampling output characteristics of the perception characteristics by utilizing the up-sampling unit, and carrying out characteristic fusion on the perception characteristics and the up-sampling output characteristics to obtain fusion characteristics corresponding to each perception characteristic.

Detection module 1340: and the fusion characteristics are respectively input into the detection heads to obtain corresponding detection results, and the prediction data of the image to be detected is obtained according to the preset number of detection results.

The specific implementation manner of the target detection apparatus in this embodiment is substantially identical to the specific implementation manner of the target detection method described above, and will not be described herein.

The embodiment of the invention also provides electronic equipment, which comprises:

At least one memory;

at least one processor;

At least one program;

The program is stored in the memory, and the processor executes the at least one program to implement the object detection method of the present invention. The electronic device can be any intelligent terminal including a mobile phone, a tablet Personal computer, a Personal digital assistant (PDA for short), a vehicle-mounted computer and the like.

Referring to fig. 14, fig. 14 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

The processor 1401 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs, so as to implement the technical solution provided by the embodiments of the present invention;

The memory 1402 may be implemented in the form of a ROM (read only memory), a static storage device, a dynamic storage device, or a RAM (random access memory). Memory 1402 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present specification are implemented in software or firmware, relevant program codes are stored in memory 1402 and the processor 1401 invokes an object detection method for performing the embodiments of the present invention;

An input/output interface 1403 for implementing information input and output;

the communication interface 1404 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.); and

Bus 1405) for transferring information between components of the device (e.g., processor 1401, memory 1402, input/output interface 1403, and communication interface 1404);

wherein processor 1401, memory 1402, input/output interface 1403 and communication interface 1404 enable communication connections between each other within the device via bus 1405.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the target detection method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the target detection method, the target detection device, the electronic equipment and the storage medium, the image to be detected is acquired and is input into the sensing network to obtain the sensing characteristics of each data set sensing unit, then the sensing characteristics and the output characteristics of the corresponding up-sampling unit are subjected to characteristic fusion to obtain fusion characteristics corresponding to each sensing characteristic, finally the fusion characteristics are respectively input into the detection head to obtain corresponding detection results, and then the prediction data of the image to be detected are obtained according to the preset number of detection results. According to the embodiment of the application, at least part of the training process of the target detection model is obtained by data collaborative training of a plurality of data sets, the data set sensing unit comprises sensing layers which are consistent with the data sets in number and are used for generating sensing characteristics, and different sensing layers are used for extracting characteristic information of different data sets, so that the generalization capability of the target detection model can be improved by comprehensively utilizing the information of different data sets, and better detection performance can be obtained on a plurality of target data sets.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. The target detection method is characterized by being realized by a target detection model, wherein the target detection model comprises a sensing network, a fusion network and a detection network, the sensing network comprises a preset number of data set sensing units which are sequentially connected, the fusion network comprises a preset number of up-sampling units corresponding to the data set sensing units, the detection network comprises a preset number of detection heads, and different data set sensing units are used for sensing image characteristics of different data sets, and the method comprises the following steps:

acquiring an image to be detected;

Inputting the image to be detected into the sensing network to sequentially obtain the sensing characteristics of each data set sensing unit;

Respectively inputting the fusion characteristics into the detection heads to obtain corresponding detection results, and obtaining prediction data of the image to be detected according to a preset number of detection results;

The dataset sensing unit comprises: a batch normalization layer with consistent quantity of the two-dimensional convolution layer, the activation layer and the data set; inputting the image to be detected to the sensing network, sequentially obtaining the sensing characteristics of each data set sensing unit, including:

selecting the data set with the maximum similarity value as a target data set;

2. The target detection method according to claim 1, wherein the sensing network further comprises a pyramid pooling layer, and the data set sensing unit is connected with the pyramid pooling layer after being cascaded according to a hierarchy; the up-sampling unit in the fusion network is correspondingly associated with the data set sensing unit; the step of obtaining the upsampled output features of the perceptual features by using the upsampling unit, and performing feature fusion on the perceptual features and the upsampled output features to obtain fusion features corresponding to each perceptual feature, includes:

3. The target detection method according to claim 1, wherein the image to be detected includes a target to be detected; the detection head includes: a classifier and a bounding box regressor; the step of inputting the fusion features into the detection heads to obtain corresponding detection results, and obtaining the predicted data of the image to be detected according to the preset number of detection results, comprises the following steps:

4. A method of object detection according to any one of claims 1 to 3, wherein the training process of the object detection model comprises:

5. The method according to claim 4, wherein the calculating a first loss value according to the classification training value and the classification label comprises:

calculating sample weight according to the random Fourier feature map;

6. The method of claim 4, wherein the original labels comprise an original classification label and an original bounding box label; the step of obtaining the original data and the original labels of a plurality of data sets to generate a label tree, and obtaining training samples and sample labels according to the label tree comprises the following steps:

selecting a target node from the label tree to generate the training sample;

7. The utility model provides a target detection device, its characterized in that is realized by the target detection model, the target detection model includes perception network, fuses network and detection network, the perception network includes the data set perception unit that presets a quantity and connect gradually, fuse the network include presets a quantity with the up-sampling unit that data set perception unit corresponds, the detection network includes presets a quantity of detection head, different data set perception unit is used for the image feature of perception different data sets, the device includes:

and a detection module: the fusion features are respectively input into the detection heads to obtain corresponding detection results, and prediction data of the image to be detected are obtained according to a preset number of detection results;

selecting the data set with the maximum similarity value as a target data set;

8. An electronic device comprising a memory storing a computer program and a processor that when executing the computer program implements the object detection method of any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the object detection method of any one of claims 1 to 6.