WO2023273515A1

WO2023273515A1 - Target detection method, apparatus, electronic device and storage medium

Info

Publication number: WO2023273515A1
Application number: PCT/CN2022/086919
Authority: WO
Inventors: 郝瑞韬
Original assignee: 北京旷视科技有限公司; 北京迈格威科技有限公司
Priority date: 2021-06-28
Filing date: 2022-04-14
Publication date: 2023-01-05
Also published as: CN113591838A; CN113591838B

Abstract

The present disclosure relates to the technical field of image processing, and provided in embodiments thereof are a target detection method, an apparatus, an electronic device, and a storage medium, which can reduce the size of an image without needing to use a means of coding and decoding and without affecting the detection performance on a target object in a size-reduced image. The target detection method may comprise: performing color coding on an input original color image, and obtaining a plurality of YUV color space images; performing pixel region division on a target image among the plurality of images, and obtaining a plurality of pixel regions corresponding to the target image; performing discrete cosine transform on each pixel region, and obtaining transform features of the target image; selecting a target channel of a low frequency region among the transform features; and performing object detection according to frequency domain feature information of the target channel in the target image.

Description

Target detection method, device, electronic device and storage medium

Cross References to Related Applications

This disclosure claims the priority of the Chinese patent application with application number 202110721797.4 and titled "target detection method, device, electronic device and storage medium" filed with the State Intellectual Property Office of China on June 28, 2021, the entire contents of which are incorporated by reference incorporated in this disclosure.

technical field

The present disclosure relates to the technical field of image processing, and in particular to a target detection method, device, electronic equipment and storage medium.

Background technique

The target detection task is an important task in the field of computer vision, and its working purpose is to accurately locate a specific target object from the image through computer image processing. To achieve this purpose by computer, on the one hand, it is necessary to be able to determine the target, for example, to obtain the contour curve of the target, or to obtain parameters such as the shape and size of the target; on the other hand, it is also necessary to locate the target in the image specific location in . As the basic tasks of subsequent segmentation, tracking, and recognition tasks, the target detection task is also an important part of image processing in the field of computer vision. and other links to provide a better data basis.

In practical applications, the video stream is usually input through the camera. Since the picture changes presented by the video stream are realized by the sequential transformation of multiple frames of digital images, and each frame of digital images contains multiple pixels arranged in a matrix, The resolution of the image is reflected by the number of pixels set in two directions perpendicular to each other. The higher the resolution of the image, the larger the information data carried by the image. Usually, the original video stream has a large amount of data. It is beneficial to processing, transmission and storage, and the encoding and decoding of video streams greatly limits the performance of algorithms. In order to balance the problem of limited computing power, it is generally necessary to shrink large-resolution images, but directly reducing the resolution of images will consume data calculation and processing time on the one hand, and on the other hand, directly shrinking images is also very easy Small object information in the image will be lost, resulting in poor detection performance for smaller objects in the reduced image.

Contents of the invention

Embodiments of the present disclosure provide a target detection method, device, electronic device and storage medium, which can reduce an image without encoding and decoding without affecting the detection performance of the target in the reduced image.

An aspect of the embodiments of the present disclosure provides a target detection method, which may include: performing color coding on the input original color image to obtain multiple images in the YUV color space; performing pixel area detection on the target images in the multiple images Divide to obtain multiple pixel areas corresponding to the target image; perform discrete cosine transform on each pixel area to obtain the transformation features of the target image; select the target channel of the target area from the transformation features; according to the frequency domain characteristics of the target channel in the target image information for object detection.

Optionally, selecting the target channel of the target region from the transformation feature may include: according to the transformation feature, using a channel selection network for detection to obtain the target channel, the channel selection network is a network model obtained by training in advance according to the transformation feature of the sample image, The sample image and the target image are images of the same encoding format.

Optionally, the channel selection network may include: a pooling layer, a convolution processing layer, an activation function layer, and a sampling layer; according to the transformation characteristics, the channel selection network corresponding to the pre-trained target image is used for detection to obtain the target channel, which may include : Use the pooling layer to perform global average pooling on the feature values of each channel in the transformation feature to obtain the pooling feature; use the convolution processing layer to perform convolution processing on the pooling feature to obtain the convolution feature; use the activation function The layer processes the convolution feature to obtain the probability feature; the sampling layer uses the sampling layer to sample the channel corresponding to the target image according to the probability feature to obtain the target channel.

Optionally, the activation function layer may be a sigmoid function layer.

Optionally, the sampling layer can be a gumbelsoftmax sampling layer.

Optionally, dividing the pixel area of the target image in the multiple images may include: each divided pixel area includes N*N pixel units, where N is a positive integer greater than 0.

Optionally, each pixel area may include 8*8 pixel units.

Optionally, performing color coding on the input original color image to obtain the multiple images in the YUV color space may further include: subtracting predetermined values from the pixel values of the multiple image pixels in the YUV color space.

Optionally, performing object detection according to the frequency-domain feature information of the target channel in the target image may include: inputting the frequency-domain feature information of the target channel into a preset downsampling layer in a pre-trained frequency-domain detection network for processing , to get information about the target object.

Optionally, performing object detection according to the frequency-domain feature information of the target channel in the target image may include: splicing the frequency-domain feature information of the target channel into the original frequency-domain detection network at four times downsampling as input, The feature information detection in the frequency domain uses the original image which is enlarged by four times as input.

Optionally, the multiple images may include: a Y component image, a U component image, and a V component image; the target image may include: a Y component image.

Optionally, the target image may also include: a U component image, a V component image; selecting the target channel of the target area from the transformation feature may include: selecting a first preset number of low-frequency channels from the transformation feature of the Y component image as Y Component low-frequency channel; select the second preset number of low-frequency channels as the U component low-frequency channel from the transformation feature of the U component image; select the third preset number of low-frequency channels as the V component low-frequency channel from the transformation feature of the V component image; Wherein, the first preset number is larger than the second preset number and larger than the third preset number.

Yet another aspect of the embodiments of the present disclosure provides an object detection device, which may include: an encoding module configured to color-encode an input original color image to obtain multiple images in a YUV color space; an area division module configured to It is configured to divide the pixel regions of the target image in the multiple images to obtain multiple pixel regions corresponding to the target image; the transformation module is configured to perform discrete cosine transform on each pixel region to obtain the transformation characteristics of the target image; The selection module is configured to select the target channel of the target area from the transformation features; the detection module is configured to perform object detection according to the frequency domain feature information of the target channel in the target image.

Optionally, the feature selection module is configured to: use a channel selection network for detection according to the transformation characteristics to obtain the target channel, the channel selection network is a network model obtained by training in advance according to the transformation characteristics of the sample image, the sample image and the target image images in the same encoding format.

Another aspect of the embodiments of the present disclosure provides an electronic device, which may include: a memory and a processor, the memory stores a computer program executable by the processor, and when the processor executes the computer program, any one of the target detection methods described above is implemented .

Another aspect of the embodiments of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is read and executed, any one of the object detection methods described above is implemented.

An embodiment of the present disclosure provides a target detection method, device, electronic device, and storage medium. The target detection method may include color coding the input original color image to obtain multiple images in the YUV color space; The target image is divided into pixel areas to obtain multiple pixel areas corresponding to the target image; discrete cosine transform is performed on each pixel area to obtain the transformation features of the target image; the target channel of the target area is selected from the transformation features; according to the target image The frequency-domain feature information of the target channel is used for object detection. By obtaining the transformation features of the target image and selecting and retaining the transformation features, it is possible to retain more valuable and informative features in the target image, thereby effectively improving the performance without increasing computing time and program occupation. For the accuracy of object detection in the image, there is no need to reduce the size of the image to lose the information of small objects in the image, and there is no need to perform encoding and decoding, resulting in a huge amount of calculation.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following will briefly introduce the accompanying drawings that need to be used in the embodiments of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, so It should not be regarded as a limitation on the scope, and those skilled in the art can also obtain other related drawings according to these drawings without creative work.

FIG. 1 is a flowchart of a target detection method provided by some embodiments of the present disclosure;

Fig. 2 is a schematic diagram of a target channel selection path for a target image within a pixel region in a target detection method according to some embodiments of the present disclosure;

FIG. 3 is a flowchart of an implementation of step S104 in a target detection method provided by some embodiments of the present disclosure;

Fig. 4 is a flow chart of an implementation of step S1041 in a target detection method provided by some embodiments of the present disclosure;

Fig. 5 is a flowchart of a target detection method provided by other embodiments of the present disclosure;

FIG. 6 is a flow chart of another implementation of step S104 in a target detection method provided by some embodiments of the present disclosure;

FIG. 7 is a flowchart of an implementation of step S105 in a target detection method provided by some embodiments of the present disclosure;

FIG. 8 is a schematic diagram of an object detection device 100 provided by some embodiments of the present disclosure;

Fig. 9 is a schematic diagram of an electronic device 200 provided by some embodiments of the present disclosure.

Icons: 100-target detection device; 110-encoding module; 120-area division module; 130-transformation module; 140-feature selection module; 150-detection module; 200-electronic equipment; 201-memory; 202-processor.

detailed description

The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure.

In the description of the present disclosure, it should be noted that the orientation or positional relationship indicated by the terms "upper", "lower", "inner", "outer" etc. is based on the orientation or positional relationship shown in the drawings, or the The orientation or positional relationship that is customarily placed when the application product is used is only for the convenience of describing the present disclosure and simplifying the description, and does not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, therefore It should not be construed as a limitation of the present disclosure. In addition, the terms "first", "second", etc. are only used for distinguishing descriptions, and should not be construed as indicating or implying relative importance.

In recent years, artificial intelligence-based computer vision, deep learning, machine learning, image processing, image recognition and other technologies have made important progress. Artificial Intelligence (AI) is an emerging science and technology that studies and develops theories, methods, technologies and application systems for simulating and extending human intelligence. The subject of artificial intelligence is a comprehensive subject that involves many technologies such as chips, big data, cloud computing, Internet of Things, distributed storage, deep learning, machine learning, and neural networks. As an important branch of artificial intelligence, computer vision is specifically to allow machines to recognize the world. Computer vision technology usually includes face recognition, liveness detection, fingerprint recognition and anti-counterfeiting verification, biometric recognition, face detection, pedestrian detection, target detection, pedestrian detection, etc. Recognition, image processing, image recognition, image semantic understanding, image retrieval, text recognition, video processing, video content recognition, behavior recognition, 3D reconstruction, virtual reality, augmented reality, simultaneous localization and map construction (SLAM), computational photography, robotics Navigation and positioning technologies. With the research and progress of artificial intelligence technology, this technology has been applied in many fields, such as security, urban management, traffic management, building management, park management, face access, face attendance, logistics management, warehouse management, robots , smart marketing, computational photography, mobile imaging, cloud services, smart home, wearable devices, unmanned driving, automatic driving, smart medical care, face payment, face unlock, fingerprint unlock, witness verification, smart screen, smart TV, Cameras, mobile Internet, webcasting, beauty, cosmetics, medical beauty, intelligent temperature measurement and other fields.

The target detection task is an important task in the field of computer vision. The goal is to locate the image of the object from an image. To achieve this task, two aspects of work are required. On the one hand, it is necessary to confirm the object to be located. On the other hand, Also locate its exact position in the image. As a basic part of downstream segmentation, tracking, recognition and other tasks, target detection task has become a basic task and challenge in the field of computer vision.

In the actual application of target detection, the image input is usually from the camera, so the input is usually a video stream, the video stream has a large amount of image data, and the decoding of the video stream will consume a lot of time and system operations capacity, considering the limitation of computing power, it is often necessary to reduce the image with large resolution, but directly reducing the image, on the one hand, also wastes processing time, on the other hand, directly reducing the image will also cause image data loss. loss, which leads to the loss of information of small objects in the image and makes it difficult to identify, which in turn leads to poor detection and recognition performance for small objects in the image.

Based on this, an embodiment of the present disclosure provides a target detection method. FIG. 1 is a flow chart of a target detection method provided in an embodiment of the present disclosure. As shown in FIG. 1 , it may include:

S101. Perform color coding on the input original color image to obtain multiple images in YUV color space.

First, the input original color image can be color-coded, the input RGB original image (for example, a 1080*1920*3 RGB original image) is converted into domain space, and the RGB image is converted into an image of multiple components of the color space , for example, decomposed into three (Y, Cr, Cb) images of YUV color space. In the YUV color space, "Y" represents the brightness (Luminance or Luma), that is, the grayscale value, and "U" and "V" represent the chroma (Chrominance or Chroma), which is used to describe the color and saturation of the image , which specifies the color of the pixel.

S102. Divide the target image in the plurality of images into pixel regions to obtain a plurality of pixel regions corresponding to the target image.

The target image in the multiple images may be divided into pixel areas to obtain multiple pixel areas corresponding to the target image. Wherein, it should be noted that, for the pixel area division of multiple images, it can be the target image with preset selection in the multiple images, and directly divide the target image in the image into multiple pixel areas, or it can be multi-pixel Each image is divided into multiple pixel areas as a whole, that is to say, it can be understood that the target image is the entire image.

Usually, the size of multiple pixel regions corresponding to the target image should be the same, that is, if the pixel region is set to be 8*8 blocks, then each pixel region is 8*8 blocks.

S103. Perform discrete cosine transformation on each pixel region to obtain transformation features of the target image.

Discrete cosine transform can be performed on each pixel area to obtain the transformation feature of the target image, and the transform feature corresponding to each pixel area of the target image can be obtained after the discrete cosine transform is performed on the pixel area.

In the transformation characteristics of the pixel area, usually the coefficients closer to the upper left corner have larger amplitudes and lower frequencies, and the coefficients closer to the lower right corner have smaller amplitudes and higher frequencies. Therefore, in this frequency coefficient matrix, the upper left side is the low frequency area. The lower right side is the high-frequency region, and a large amount of feature information in the image is concentrated in the low-frequency region, which is the upper left side of the frequency coefficient matrix.

S104. Select the target channel of the target area from the transformation features.

Therefore, the reconstruction operation can be performed on the transformation features, and the target channel of the target area can be selected from the transformation features; wherein, FIG. 2 is a target channel selection path in a pixel area of the target image in the target detection method of some embodiments of the present disclosure. Schematic diagram, the target area can be, for example, a low-frequency area, and the target channel selection in the low-frequency area can be a predetermined number of target channel selections in the order shown by the arrow in Figure 2, or can be selected based on artificial prior information. Moreover, still taking the three (Y, Cr, Cb) images decomposed into the YUV color space as an example, the target channel allocation in the three images also needs to be selected, for example, in the three (Y, Cr, Cb) images , because the human eye is much more sensitive to the recognition of brightness (Y) than to the recognition of chroma (Cr, Cb), therefore, based on the contribution of the three components to the image feature information, the importance of Y is higher than that of Cr and Cb. When performing target channel selection in the pixel areas of the three images, select a larger number of target channels for Y, and select a smaller number of target channels for Cr and Cb. The selected target channel corresponds to the retention of frequency-domain feature information in the target channel.

S105. Perform object detection according to the frequency-domain feature information of the target channel in the target image.

Then, object detection can be performed according to the frequency-domain feature information of the target channel in the selected target image. Since the target channel in the target image is optimized and selected, the information that can reflect the target feature in the target image is selectively retained. Therefore, the When the target image is used for object detection, the detection performance of small-sized objects in the image can be improved. Moreover, since the target image is effectively reduced, the calculation amount of image processing is reduced, and the time consumed for detecting the target object in the image can be shortened. and reduce computing resource usage.

A target detection method provided by an embodiment of the present disclosure may include color coding the input original color image to obtain multiple images in the YUV color space; dividing the target image in the multiple images into pixel regions to obtain the target image Corresponding multiple pixel areas; Discrete cosine transform is performed on each pixel area to obtain the transformation feature of the target image; Select the target area such as the target channel of the low frequency area from the transformation feature; According to the frequency domain feature information of the target channel in the target image, Perform object detection. By obtaining the transformation features of the target image and selecting and retaining the transformation features, it is possible to retain more valuable and informative features in the target image, thereby effectively improving the performance without increasing computing time and program occupation. For the accuracy of object detection in the image, there is no need to reduce the size of the image to lose the information of small objects in the image, and there is no need to perform encoding and decoding, resulting in a huge amount of calculation.

In some optional embodiments of the present disclosure, dividing the pixel area of the target image in the plurality of images includes: each divided pixel area may include N*N pixel units, where N is greater than 0 positive integer of .

In some optional embodiments of the present disclosure, each pixel area may include 8*8 pixel units.

Using this method of area division, the number of horizontal and vertical pixel units in the divided pixel area is the same, which is used as the basic unit to facilitate the calculation and processing in subsequent steps. Each pixel area includes 8*8 pixel units as the basic unit unit, which is more conducive to reducing the complexity of calculation.

It should be noted that, in the following description, each pixel area is divided into 8*8 pixel units for example and description.

FIG. 3 is a flow chart of an implementation of step S104 in a target detection method provided by an embodiment of the present disclosure. As shown in FIG. 3 , S104, selecting a target channel of a target region from transformation features may include:

S1041. According to the transformation feature, use a channel selection network for detection to obtain the target channel. The channel selection network is a network model trained in advance according to the transformation feature of the sample image, and the sample image and the target image are images of the same encoding format.

In some optional embodiments of the present disclosure, when step S104 is performed, the target channel of the target region is selected from the transformation features, and adaptive training can be performed in advance for real-time correction. Optionally, the pre-selected target image can be used The preset image of the same format is used as a sample image, and the network model is obtained by training according to the transformation characteristics of the sample image. When selecting the target channel of the target area in the transformation characteristics of the target image, the pre-trained channel can be used according to the transformation characteristics The network is selected to correspond to the target image for detection, so that the obtained target channel can retain more valuable and informative feature information.

In some optional embodiments of the present disclosure, the channel selection network may include: a pooling layer, a convolution processing layer, an activation function layer, and a sampling layer.

Fig. 4 is a flow chart of an implementation of step S1041 in a target detection method provided by an embodiment of the present disclosure. As shown in Fig. 4, S1041, according to the transformation characteristics, use a channel selection network for detection to obtain the target channel, The channel selection network is a network model trained in advance according to the transformation characteristics of the sample image. The sample image and the target image are images of the same encoding format, which can include:

S10411. Using a pooling layer, perform global average pooling on the feature values of each channel in the transformed feature to obtain a pooled feature.

The pooling layer can be used to perform global average pooling processing on the transformed features, including performing global average pooling on the feature values of each channel in the transformed features to obtain pooled features, such as 1*1*64 features.

S10412. Use a convolution processing layer to perform convolution processing on the pooled features to obtain convolution features.

The convolution processing layer can be used to perform convolution processing on the pooled features to obtain convolution features.

S10413. Using an activation function layer to process the convolution features to obtain probability features.

The convolution feature can be processed by the activation function layer, and the probability vector of the pooled feature can be obtained as the probability feature. The activation function layer may be a sigmoid function layer.

S10414. Using the sampling layer, sampling the channel corresponding to the target image according to the probability feature to obtain the target channel.

Then the sampling layer can be used to set the probability value of some channels in the probability feature to 1 to indicate that the channel is retained, and set the probability value of another part of the channel to 0 to indicate that the channel is discarded, and determine the channel set to 1 in the probability feature as the target aisle. Each number in the probability feature is a probability value from 0 to 1, and the probability value is used to indicate the probability that the channel where the feature is located is retained. The sampling layer can be a gumbelsoftmax sampling layer.

Since the probability value of the target channel in the sampled probability feature is 1, and the probability value of other channels is 0, the frequency domain feature information of the target channel can be obtained by multiplying the sampled probability feature and the transformation feature.

Attention Mechanism originated from the study of human vision. In cognitive science, due to the bottleneck of information processing, humans will selectively focus on a part of all information while ignoring other visible information. This mechanism is often called the attention mechanism. Different parts of the human retina have different degrees of information processing ability, that is, acuity, and only the fovea of the retina has the strongest acuity. In order to rationally utilize limited visual information processing resources, humans need to select a specific part in the visual area and then focus on it. The attention mechanism has two main aspects: deciding which part of the input needs to be paid attention to; and allocating limited information processing resources to important parts.

Based on this, resources are concentrated at the point of interest, and resources can be used more efficiently. Therefore, the target channel can be reserved selectively, and the accuracy of object detection in the target image can be improved after the target image is reduced. .

Fig. 5 is a flowchart of a target detection method provided by other embodiments of the present disclosure. As shown in Fig. 5, step S101, color coding the input original color image, and obtaining multiple images in the YUV color space may also include :

S1011. Subtract preset values from the pixel values of the multiple image pixels in the YUV color space.

In some optional embodiments of the present disclosure, step S101, performing color coding on the input original color image to obtain multiple images in YUV color space may also include:

After color-coding the input original color image and decomposing the input RGB original image into three (Y, Cr, Cb) images of the YUV color space, the pixel values of multiple image pixels of the YUV color space can also be included Subtract 127 respectively. The pixel value of each image pixel is subtracted by 127 (in this example, 127 is the default value) left shift operation to ensure the symmetry of each 8*8 block.

In some optional embodiments of the present disclosure, the multiple images may include: a Y component image, a U component image, and a V component image; the target image may include: a Y component image.

After color-coding the input original color image and decomposing the input RGB original image into three (Y, Cr, Cb) images of the YUV color space, the Y component image is used to represent the gray scale of the color space image, and the U component The image and the V component image express the color and saturation of the color space image, because in each component image, the Y component image used to represent the grayscale has a greater impact on the visual quality of the color space image than the U component image and the V component image. The contribution of is higher, therefore, the target image may include the Y component image, which can meet the accuracy required for object detection in the target image after the reduction processing of the target image.

In some optional embodiments of the present disclosure, the target image may further include: a U component image and a V component image. That is, the target image includes a Y component image, a U component image, and a V component image, and the target image includes the images of the three components in the entire color space image. Therefore, after the target image is reduced, even if the visual quality of the image is The U component image and the V component image with relatively low contribution can also be processed and selectively retained, thereby improving the accuracy of object detection in the target image after the reduction processing of the target image.

On the premise that the target image includes a Y component image, a U component image, and a V component image, in some optional embodiments of the present disclosure, FIG. 6 is a target detection method provided by an embodiment of the present disclosure. In the method, the flow chart of another embodiment of step S104, as shown in FIG. 6, S104, selecting the target channel of the target area from the transformation feature may include:

S1042. Select a first preset number of low-frequency channels from the transformation characteristics of the Y-component image as the Y-component low-frequency channels.

S1043. Select a second preset number of low-frequency channels from the transformation features of the U-component image as U-component low-frequency channels.

S1044. Select a third preset number of low-frequency channels from the transformation characteristics of the V-component image as V-component low-frequency channels; wherein, the first preset number is greater than the second preset number and greater than the third preset number.

When the target image includes a Y component image, a U component image, and a V component image, when performing step S104 and selecting the target channel of the target area from the transformation features, it may include transforming the features from the Y component image and the U component image respectively. transform features and select low frequency channels from the transform features of the V component image as the Y component low frequency channel, the U component low frequency channel and the V component low frequency channel respectively. Moreover, in performing step S1042, selecting a first preset number of low-frequency channels from the transformation characteristics of the Y component image as the Y component low-frequency channel, the selected low-frequency channel quantity is the first preset number, and performing step S1043, from U Select the second preset number of low-frequency channels in the transformation feature of the component image as the U component low-frequency channel, the number of selected low-frequency channels is the second preset number, and in step S1044, select the third from the transformation feature of the V component image. The preset number of low-frequency channels is used as the V component low-frequency channel, and the selected number of low-frequency channels is the third preset number, and the first preset number is greater than the second preset number and greater than the third preset number, that is, when executing When selecting the low-frequency channel in the transformation characteristics of the Y component image, U component image and V component image, still follow the principle that the aforementioned Y component image contributes more to the visual quality of the color space image. When performing step S1042, for the The first preset number of low-frequency channels selected as the low-frequency channels of the Y component in the transformation characteristics of the component images is greater than the selected number of low-frequency channels for the transformation characteristics of the U component images and the V component images.

Fig. 7 is a flow chart of an implementation of step S105 in a target detection method provided by some embodiments of the present disclosure. As shown in Fig. 7, S105, perform object detection according to the frequency-domain feature information of the target channel in the target image , which can include:

S1051. Input the frequency-domain characteristic information of the target channel into a preset downsampling layer in the pre-trained frequency-domain detection network for processing, and obtain information of the target object.

In some optional embodiments of the present disclosure, an optional way to perform object detection according to the frequency-domain feature information of the target channel in the target image may be to input the preset down-sampling layer in the pre-trained frequency-domain detection network The frequency-domain feature information of the target channel is processed to obtain the information of the target object.

In some optional embodiments of the present disclosure, related mainstream detection networks, such as Faster RCNN, Retinanet, etc., can be used to implement detection on frequency-domain features. The optional method is to splice the obtained frequency-domain feature information The quadruple downsampling in the original frequency domain detection network is used as input. Frequency domain feature information detection can directly use four times the size of the original image as input, so that more small object information in the image can be retained, so the detection performance for small objects is better, and high data such as 4K cameras can be directly used The image with the amount of information is used as input, without having to shrink the image in advance.

FIG. 8 is a schematic diagram of a target detection device provided by an embodiment of the present disclosure. As shown in FIG. 8 , in another aspect of the embodiments of the present disclosure, a target detection device 100 is provided. The target detection device 100 may include:

The encoding module 110 is configured to perform color encoding on the input original color image to obtain multiple images in the YUV color space.

The region division module 120 is configured to divide the pixel region of the target image in the multiple images to obtain multiple pixel regions corresponding to the target image.

The transformation module 130 is configured to perform discrete cosine transformation on each pixel region to obtain transformation features of the target image.

The feature selection module 140 is configured to select a target channel of the target region from the transformed features.

The detection module 150 is configured to perform object detection according to the frequency-domain feature information of the target channel in the target image.

In some optional implementations of the present disclosure, the target detection device can retain the more valuable and informative features in the target image by acquiring the transformation features of the target image and selecting and retaining the transformation features, so that On the basis of not increasing the computing time and program occupation, it can effectively improve the accuracy of object detection in the image, without reducing the size of the image to lose the information of small objects in the image, and without encoding and decoding, resulting in a huge amount of computation.

The above modules may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (Application Specific Integrated Circuit, referred to as ASIC), or, one or more microprocessors ( digital signal processor (DSP for short), or, one or more Field Programmable Gate Arrays (Field Programmable Gate Array, FPGA for short), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, referred to as CPU) or other processors that can call program codes. For another example, these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC for short).

In some optional implementations of the present disclosure, the feature selection module 140 is optionally configured to use the channel selection network corresponding to the pre-trained target image for detection according to the transformation feature, and obtain the target channel. The channel selection network can be The network model obtained by pre-training according to the transformation characteristics of the sample image, the sample image and the target image can be images of the same encoding format.

In some optional implementation manners of the present disclosure, the channel selection network may include: a pooling layer, a convolution processing layer, an activation function layer, and a sampling layer. The feature selection module 140 is optionally configured to use a pooling layer, which can perform global average pooling on the feature values of each channel in the transformed feature to obtain a pooling feature; using a convolution processing layer, can perform convolution on the pooling feature. Product processing to obtain convolutional features; using activation function layer, convolutional features can be processed to obtain probability features; sampling layer can be used to sample the channel corresponding to the target image according to the probability feature to obtain the target channel.

In some optional implementations of the present disclosure, the region division module 120 is optionally configured to perform pixel region division on the target image in the plurality of images, and each pixel region obtained by division may include N*N pixel units , where N is a positive integer greater than 0.

In some optional implementation manners of the present disclosure, each divided pixel area may include 8*8 pixel units.

In some optional implementations of the present disclosure, the encoding module 110 is optionally configured to perform color encoding on the input original color image to obtain multiple images in the YUV color space, and is also configured to perform color encoding on the YUV color space Subtract 127 from the pixel values of multiple image pixels, respectively.

In some optional implementations of the present disclosure, the detection module 150 is optionally configured to input the frequency-domain feature information of the target channel into a preset downsampling layer in a pre-trained frequency-domain detection network for processing, Get information about the target object.

In some optional implementations of the present disclosure, the plurality of images may include a Y component image, a U component image, and a V component image. The target image includes a Y component image, and in some optional implementation manners, the target image may also include a U component image and a V component image.

In the case where the target image includes a Y component image, a U component image, and a V component image, the feature selection module 140 is optionally configured to select a first preset number of low-frequency channels as the Y component from the transformation features of the Y component image Low-frequency channel; select the low-frequency channel of the second preset number as the U component low-frequency channel from the transformation feature of the U component image; select the low-frequency channel of the third preset number as the V component low-frequency channel from the transformation feature of the V component image; wherein , the first preset number is larger than the second preset number and larger than the third preset number.

The above-mentioned apparatus is configured to execute the methods provided in the foregoing embodiments, and the implementation principles and technical effects thereof are similar, and details are not repeated here.

FIG. 9 is a schematic diagram of an electronic device 200 provided by an embodiment of the present disclosure. As shown in FIG. 9, another aspect of the embodiment of the present disclosure provides an electronic device 200, which may include: a memory 201 and a processor 202, The memory 201 stores a computer program executable by the processor 202, and the processor 202 invokes the program stored in the memory 201 to execute any one of the embodiments of the object detection method described above. The specific implementation manner and technical effect are similar, and will not be repeated here.

In the embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

A unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The integrated unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.

Integrated units implemented in the form of software functional units may be stored in a computer-readable storage medium. The software functional unit is stored in a storage medium, including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) or processor (English: processor) to execute the methods described in various embodiments of the present disclosure part of the steps. The aforementioned storage medium can include: U disk, mobile hard disk, read-only memory (English: Read-Only Memory, abbreviated: ROM), random access memory (English: Random Access Memory, abbreviated: RAM), magnetic disk or optical disc Various media that can store program codes.

The above descriptions are only examples of the present disclosure, and are not intended to limit the protection scope of the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Industrial Applicability

The present disclosure provides a target detection method, device, electronic equipment, and storage medium. The target detection method may include color coding an input original color image to obtain multiple images in a YUV color space; The target image is divided into pixel areas to obtain multiple pixel areas corresponding to the target image; discrete cosine transform is performed on each pixel area to obtain the transformation features of the target image; the target channel of the target area is selected from the transformation features; according to the target image The frequency-domain feature information of the target channel is used for object detection. The image can be reduced without encoding and decoding without affecting the detection performance of the target object in the reduced image.

In addition, it can be understood that the object detection method, device, electronic device and storage medium of the present disclosure are reproducible and can be used in various industrial applications, for example, tumor detection in medical images.

Claims

A target detection method, characterized in that, comprising:

Color-encode the input original color image to get multiple images in YUV color space;

dividing the target image in the plurality of images into pixel regions to obtain a plurality of pixel regions corresponding to the target image;

performing a discrete cosine transform on each of the pixel regions to obtain the transform features of the target image;

selecting a target channel of a target region from said transformed features;

Object detection is performed according to the frequency-domain feature information of the target channel in the target image.
The method according to claim 1, wherein the selecting the target channel of the target region from the transformation features comprises:

According to the transformation characteristics, a channel selection network is used for detection to obtain the target channel, and the channel selection network is a network model obtained by training in advance according to the transformation characteristics of the sample image, and the sample image and the target image are the same An image in encoded format.
The method according to claim 2, wherein the channel selection network comprises: a pooling layer, a convolution processing layer, an activation function layer, and a sampling layer;

According to the transformation feature, the channel selection network corresponding to the pre-trained target image is used for detection, and the target channel is obtained, including:

Using the pooling layer to perform global average pooling on the feature values of each channel in the transformation feature to obtain a pooling feature;

Using the convolution processing layer to perform convolution processing on the pooled features to obtain convolution features;

Using the activation function layer to process the convolution feature to obtain a probability feature;

The sampling layer is used to perform sampling processing on the channel corresponding to the target image according to the probability feature to obtain the target channel.
The method according to claim 3, wherein the activation function layer is a sigmoid function layer.
The method according to claim 3, wherein the sampling layer is a gumbelsoftmax sampling layer.
The method according to any one of claims 1 to 5, wherein said dividing the pixel area of the target image in the plurality of images comprises:

Each of the divided pixel regions includes N*N pixel units, where N is a positive integer greater than 0.
The method according to claim 6, wherein each of the pixel regions includes 8*8 pixel units.
The method according to any one of claims 1 to 7, wherein said color coding the input original color image to obtain a plurality of images in YUV color space also includes:

A predetermined value is respectively subtracted from the pixel values of the plurality of image pixels in the YUV color space.
The method according to any one of claims 1 to 8, wherein the object detection according to the frequency domain feature information of the target channel in the target image comprises:

The frequency-domain feature information of the target channel is input to the preset downsampling layer in the pre-trained frequency-domain detection network for processing to obtain the information of the target object.
The method according to any one of claims 1 to 8, wherein the object detection according to the frequency domain feature information of the target channel in the target image comprises:

The frequency-domain feature information of the target channel is spliced to the quadruple downsampled part of the original frequency-domain detection network as input, and the frequency-domain feature information detection uses the original image enlarged by four times as input.
The method according to any one of claims 1 to 10, wherein the plurality of images comprises: a Y component image, a U component image, and a V component image;

The target image includes: a Y component image.
The method according to claim 11, wherein the target image further comprises: a U component image, a V component image;

The target channel for selecting the target region from the transformation features includes:

selecting a first preset number of low-frequency channels as Y-component low-frequency channels from the transformation characteristics of the Y-component image;

Selecting a second preset number of low-frequency channels as U-component low-frequency channels from the transformation characteristics of the U component image;

Select a third preset number of low-frequency channels from the transformation characteristics of the V component image as V component low-frequency channels; wherein, the first preset number is greater than the second preset number, and is greater than the third preset number Set the quantity.
A target detection device, characterized in that it comprises:

An encoding module configured to color-encode the input original color image to obtain a plurality of images in the YUV color space;

A region division module configured to divide the pixel regions of the target image in the plurality of images to obtain a plurality of pixel regions corresponding to the target image;

A transform module configured to perform discrete cosine transform on each of the pixel regions to obtain transform features of the target image;

a feature selection module configured to select a target channel of a target region from the transformed features;

The detection module is configured to perform object detection according to the frequency-domain feature information of the target channel in the target image.
The target detection device according to claim 13, wherein the feature selection module is configured to: use a channel selection network for detection according to the transformation feature to obtain the target channel, and the channel selection network is based on The network model obtained by training the transformed features of the sample image, the sample image and the target image are images in the same encoding format.
An electronic device, characterized by comprising: a memory and a processor, the memory stores a computer program executable by the processor, and when the processor executes the computer program, any one of claims 1 to 12 above is realized The target detection method described in the item.
A computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is read and executed, the target detection method described in any one of claims 1 to 12 is realized .