WO2021217858A1

WO2021217858A1 - Target identification method and apparatus based on picture, and electronic device and readable storage medium

Info

Publication number: WO2021217858A1
Application number: PCT/CN2020/098990
Authority: WO
Inventors: 童新宇; 刘莉红; 刘玉宇
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-04-30
Filing date: 2020-06-29
Publication date: 2021-11-04
Also published as: CN111652226B; CN111652226A

Abstract

A target identification method and apparatus based on a picture, and an electronic device and a computer-readable storage medium. The method comprises: by using a scene segmentation network, executing a convolution operation, an activation operation and a pooling operation on an original picture, so as to obtain a first feature set (S1); in the scene segmentation network, executing an up-sampling operation, the convolution operation and the activation operation on the first feature set, so as to obtain a second feature set, and performing a classification operation on the second feature set according to a pre-constructed classification function, so as to obtain a scene picture set (S2); and inputting the scene picture set into a target identification network to perform target identification, so as obtain a target picture (S3). The method further relates to blockchain technology, and the original picture and the target picture can be stored in a blockchain node. By means of the method, the problem of excessive calculation resources being occupied due to the large amount of calculation required during a target identification process is solved.

Description

Image-based target recognition method, device, electronic equipment and readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 30, 2020, the application number is CN202010360752.4, and the invention title is "Image-based target recognition method, device and readable storage medium". All of them The content is incorporated in this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device, electronic device, and readable storage medium for image-based target recognition.

Background technique

Picture-based target recognition refers to the process of distinguishing one type of target from other targets in a picture. The current picture-based target recognition is mainly divided into traditional machine learning algorithms and deep learning algorithms. The traditional machine learning algorithms first perform digital image processing on the pictures, and then recognize the targets in the pictures based on machine learning such as support vector machines and decision trees. The deep learning algorithm is mainly based on the convolutional neural network, which directly recognizes the target in the picture.

The inventor realized that both methods can identify the target in the picture, but the traditional machine learning algorithm has cumbersome processing steps and low recognition accuracy. Although the deep learning algorithm has high recognition accuracy, because the convolutional neural network directly performs the image Target recognition does not split the steps of target recognition, so the recognition process requires a lot of calculations and takes up too much computing resources.

Summary of the invention

This application provides a picture-based target recognition method, device, electronic equipment, and computer-readable storage medium, the main purpose of which is to split the target recognition steps and solve the problem that the recognition process requires a lot of calculations and takes up too much computing resources.

In order to achieve the above objective, a picture-based target recognition method provided by this application includes:

Use the scene segmentation network to perform convolution, activation, and pooling operations on the original image to obtain the first feature set;

In the scene segmentation network, perform an upsampling operation, the convolution operation, and the activation operation on the first feature set to obtain a second feature set, and according to a pre-built classification function, perform an upsampling operation, the convolution operation and the activation operation on the second feature set. Perform a classification operation on the feature set to obtain a scene picture set;

The scene picture set is input into a target recognition network for target recognition to obtain a target picture.

In order to solve the above-mentioned problems, this application also provides a picture-based target recognition device, which includes:

The first feature acquisition module is configured to use the scene segmentation network to perform convolution, activation, and pooling operations on the original image to obtain the first feature set;

The scene picture extraction module is configured to perform an up-sampling operation, the convolution operation, and the activation operation on the first feature set in the scene segmentation network to obtain a second feature set, which is based on a pre-built classification Function to perform a classification operation on the second feature set to obtain a scene picture set;

The target picture recognition module is used to input the scene picture set into the target recognition network for target recognition to obtain the target picture.

In order to solve the above-mentioned problems, the present application also provides an electronic device, which includes:

Memory, storing at least one instruction; and

The processor implements the following steps when executing instructions stored in the memory:

In order to solve the above problems, the present application also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the following steps are implemented:

The embodiment of this application first uses the scene segmentation network to perform convolution, activation, and pooling operations on the original picture to achieve the purpose of extracting picture features from the original picture and reducing the size of the picture pixel, and at the same time according to the original picture included The scene of the scene, combined with the upsampling operation and the classification function to separate the picture features to obtain the scene picture set. Since the original picture is split into several scene pictures, the picture size is further reduced, and the target recognition network is used to directly identify from the scene picture set. Picture. Since this application uses a deep learning network that includes convolution operations, activation operations, and pooling operations, the accuracy of target recognition is high. At the same time, the original image is processed in a cycle of feature extraction, scene segmentation, and target recognition. Each process is There is a function of reducing the size of the picture, so this application can solve the problem that the recognition process requires a lot of calculations and takes up too much computing resources.

Description of the drawings

FIG. 1 is a schematic flowchart of a method for image-based target recognition provided by an embodiment of this application;

2 is a schematic diagram of modules of a picture-based target recognition device provided by an embodiment of the application;

3 is a schematic diagram of the internal structure of an electronic device that implements a method for image-based target recognition provided by an embodiment of the application;

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

The execution subject of the image-based target recognition method provided in the embodiments of the present application includes but is not limited to at least one of the electronic devices that can be configured to execute the method provided in the embodiments of the present application, such as a server and a terminal. In other words, the image-based target recognition method can be executed by software or hardware installed on a terminal device or a server device, and the software can be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, etc.

Referring to FIG. 1, it is a schematic flowchart of a picture-based target recognition method provided by an embodiment of this application. In this embodiment, the image-based target recognition method includes:

S1. Obtain an original picture, and perform a convolution operation, an activation operation, and a pooling operation on the original picture according to the scene segmentation network to obtain a first feature set.

In the embodiment of the present application, the original picture is a picture used to perform target recognition, that is, a preset type of target is recognized from the original picture. There are various ways to obtain the original picture, including obtaining images taken by a user through a mobile phone, pictures crawled by using crawler technology on the Internet, and the like.

In one of the application scenarios of this application, Xiao Zhang is a truck driver. When driving a truck, he was hit by a high-altitude projectile on the front engine hood of the truck. Therefore, in one of the embodiments of this application, Xiao Zhang uses a mobile phone to take a picture of the truck driver. The picture of the front engine hood of the truck after being hit by a high-altitude projectile is the original picture described in the embodiment of the application. The embodiment of the application recognizes that the engine hood in the picture of the front engine hood of the truck is smashed through the original picture. In the area.

Preferably, in order to identify the area where the engine hood is hit from the front engine hood picture of the truck, the embodiment of the present application needs to construct a scene segmentation network for dividing the original picture into several scene pictures. For example, a picture of the front engine hood of a truck may include pictures of the front engine hood of the truck, truck tires, and roads where the truck tires are located. Therefore, a scene segmentation network is constructed to divide the picture of the front engine hood of the truck into only the front engine engine of the truck. Cover pictures, pictures of truck tires, pictures of highways where truck tires are located.

Preferably, the constructing the scene segmentation network includes: constructing a segmentation layer that performs a convolution operation, an activation operation, and a pooling operation, and constructing an extraction layer that performs an upsampling operation, the convolution operation, and the activation operation; And construct an output layer that performs the convolution operation, the activation operation, and the classification operation, and construct the scene segmentation network according to the segmentation layer, the extraction layer, and the output layer.

After the scene segmentation network is constructed, the scene segmentation network needs to be trained to adjust the internal parameters of the scene segmentation network. Preferably, the training includes:

Step A: Obtain a scene picture training set, and use the segmentation layer to perform first feature extraction on the scene picture training set to obtain a first scene feature set;

Step B: Use the extraction layer to perform second feature extraction in the first scene feature set to obtain a second scene feature set;

Step C: Use the output layer to perform a third feature extraction and the classification operation in the second scene feature set to obtain a first training value;

Step D: When the first training value is greater than the preset first training threshold, return to step A;

Step E: When the first training value is less than or equal to the first training threshold, a trained scene segmentation network is obtained.

In detail, the embodiment of the present application first constructs 5 segmentation layers, each segmentation layer includes convolution operation, activation operation, and pooling operation, and further constructs 4 extraction layers, each segmentation layer includes upsampling operation, convolution Operation and activation operation, and then construct an output layer, the output layer includes convolution operation, activation operation and classification operation.

Wherein, the convolution operation and the pooling operation are the convolution operation and the pooling operation in the currently disclosed convolutional neural network. The activation operation may use a linear rectification function, a Sigmoid function, or the like. The classification operation can use the Softmax function.

In detail, the embodiment of the present application first obtains a scene picture training set from the Internet or public data sets, and inputs the obtained scene picture training set to the scene segmentation network for training, where the first training value can be constructed according to pre-built The loss function of, such as perceptual loss function, square loss function, etc. is calculated.

Further, when the training is completed to obtain the scene segmentation network, the embodiment of the present application inputs the original picture to the scene segmentation network to sequentially perform the convolution operation, the activation operation, and the pooling operation to obtain the first feature set, As the above-mentioned picture of the front engine hood of a truck hit by a high-altitude projectile, convolution, activation, and pooling are performed in the first segmentation layer, and then convolution, activation, and pooling are performed in the second segmentation layer. The first feature set is obtained after convolution operation, activation operation, and pooling operation are performed in the fifth segmentation layer.

S2. In the scene segmentation network, perform an upsampling operation, a convolution operation, and an activation operation on the first feature set to obtain a second feature set, and according to a pre-built classification function, perform an upsampling operation, a convolution operation, and an activation operation on the second feature set Perform a classification operation to obtain a collection of scene pictures.

As described in S1, in this embodiment of the application, the scene segmentation network includes 5 segmentation layers, 4 extraction layers, and 1 output layer. After the original image is processed by the 5 segmentation layers, the first feature set can be obtained. , Further, use 4 extraction layers to operate on the first feature set to obtain the second feature set.

In the embodiment of this application, the four extraction layers respectively perform upsampling, convolution, and activation operations on the first feature set. The upsampling operation includes resampling and interpolation operations, such as pre-setting a desired image size, using For example, in a method such as bilinear interpolation, the first feature set is interpolated to complete the up-sampling operation.

After 4 extraction layers, the second feature set can be obtained. According to the construction process of the output layer, the second feature set is first subjected to convolution and activation operations, and then pre-built classification functions such as Softmax functions are used for classification Operate to obtain the scene picture collection.

S3. Input the scene segmentation set into a target recognition network for target recognition to obtain a target picture.

The target recognition network is mainly to identify the targets that appear in the scene segmentation set, such as the above-mentioned scene picture set including the front engine hood picture of the truck, the picture of the truck tire, and the picture of the road where the truck tire is located. If the target recognition network is to identify the front of the truck The engine hood is identified in each scene picture for the purpose of identifying the front engine hood of the truck.

The embodiment of the present application first constructs the target recognition network, and the construction includes:

Step I: Based on the convolution operation in the scene segmentation network, construct a first target recognition layer that includes an expanded convolution operation;

In detail, the step I includes: extracting the size of the convolution kernel of the convolution operation and setting the expansion rate, using the size of the convolution kernel and the expansion rate as input parameters of the pre-built expansion convolution calculation formula, and calculating The expanded convolution kernel size of the expanded convolution operation is obtained, and the first target recognition layer is constructed by combining the convolution kernel size and the expanded convolution kernel size.

For example, the convolution kernel size (kernel_size) in the convolution operation is 3*3, and the dilation rate (dilation_rate) is 2, according to the dilation_rate*(kernel_size-1)+1 calculation formula for dilation convolution, it is calculated as: 2*(3-1 )+1=5, so the size of the expanded convolution kernel is 5*5.

When the convolution kernel 3*3 and the expanded convolution kernel 5*5 are obtained, the first target recognition layer can be constructed according to the actual application scenario, such as the first target recognition layer with 5 convolution operations and 5 dilated convolution operations Wait.

Step II: construct a similarity measure classification function, and construct a second target recognition layer according to the expanded convolution operation and the similarity measure classification function;

The similarity measure classification function is:

Where y ^* is the label value of the target image training set,

Train the training value of the target image training set for the target recognition network, c is the label value category of the target image training set, if the target image training set has a total of 172 label values, the number of c is 172 .

The construction of the second target recognition layer also needs to be based on actual application scenarios. In the embodiment of this application, the operations of the second target recognition layer mainly include first convolution operation, then multiple expansion convolution operations, and finally classification using the similarity measure The function outputs the target result.

Step III: Combine the first target recognition layer and the second target recognition layer to obtain the target recognition network.

Corresponding to the scene segmentation network, after the construction of the target recognition network is completed, the target recognition network needs to be trained to adjust the internal parameters of the target recognition network. Preferably, the training includes:

Step a: Obtain a target picture training set, and use the first target recognition layer to perform a first dilated convolution operation on the target picture training set to obtain a first target feature set;

Step b: Use the second target recognition layer to perform a second dilated convolution operation on the first target feature set, and calculate a similarity measure to obtain a second training value;

Step c: If the second training value is greater than the second training threshold, return to step I;

Step d: If the second training value is less than or equal to the second training threshold, the target recognition network is obtained.

After combining the above steps of constructing the target recognition network and training the target recognition network, a trained target recognition network is obtained.

Further, in the embodiment of the present application, the scene picture set is input into the target recognition network for target recognition, and the target picture can be obtained.

For example, in one of the application scenarios, the embodiment of the present application uses the scene segmentation network and the target recognition network to input a picture of the front engine hood of a truck into the scene segmentation network to obtain a scene picture including only the front engine hood of the truck, and the truck The scene picture of the background where the front engine hood is located, and the scene picture of the front engine hood of the truck is input to the target recognition network to obtain a picture of the area where the front engine hood of the truck is hit by a high-altitude projectile. This area picture is the target picture.

In a preferred embodiment of the present application, the original picture and the target picture may be stored in a blockchain node.

As shown in Fig. 2, it is a functional block diagram of the image-based target recognition device of the present application.

The image-based target recognition apparatus 100 described in this application can be installed in an electronic device. According to the realized functions, the picture-based target recognition device may include a first feature acquisition module 101, a scene picture extraction module 102, and a target picture recognition module 103. The module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.

In this embodiment, the functions of each module/unit are as follows:

The first feature acquisition module 101 is configured to use a scene segmentation network to perform a convolution operation, an activation operation, and a pooling operation on the original picture to obtain a first feature set;

The scene picture extraction module 102 is configured to perform an up-sampling operation, the convolution operation, and the activation operation on the first feature set in the scene segmentation network to obtain a second feature set, and according to the previous The constructed classification function performs a classification operation on the second feature set to obtain a scene picture set;

The target picture recognition module 103 is configured to input the scene picture set into a target recognition network for target recognition to obtain a target picture.

In detail, the specific implementation steps of each module of the image-based target recognition device are as follows:

The first feature acquisition module 101 uses the scene segmentation network to perform convolution, activation, and pooling operations on the original picture to obtain a first feature set.

Preferably, in order to identify the area where the engine hood has been hit from the picture of the front engine hood of the truck, the embodiment of the application needs to construct a scene segmentation network to divide the original picture into several scene pictures, such as the picture of the front engine hood of the truck. Including the truck front engine hood picture, truck tires, the road where the truck tires are located, etc., so build a scene segmentation network to divide the truck front engine hood picture into only the truck front engine hood picture, the truck tire picture, and the truck tire. Picture of the highway where you are.

Preferably, this application also includes a scene segmentation network construction module 104. The scene segmentation network construction module 104 is configured to construct a segmentation layer that performs the convolution operation, the activation operation, and the pooling operation, and construct and perform the upsampling operation, the convolution operation, and the activation Operating an extraction layer; and constructing an output layer that performs the convolution operation, the activation operation, and the classification operation, and constructs the scene segmentation network according to the segmentation layer, the extraction layer, and the output layer.

Further, the embodiment of the present application may also include a scene segmentation network training module 105, which is used to adjust the internal parameters of the scene segmentation network. Preferably, the scene segmentation network training module 105 performs the following operations when adjusting the internal parameters of the scene segmentation network:

The scene picture extraction module 102 performs an up-sampling operation, the convolution operation, and the activation operation on the first feature set in the scene segmentation network to obtain a second feature set, which is based on a pre-built classification Function to perform a classification operation on the second feature set to obtain a scene picture set.

As mentioned above, in this embodiment of the application, the scene segmentation network includes 5 segmentation layers, 4 extraction layers, and 1 output layer. After the original image is processed by the 5 segmentation layers, the first feature set can be obtained. Further, four extraction layers are used to operate on the first feature set to obtain the second feature set.

The target picture recognition module 103 inputs the scene picture set into a target recognition network for target recognition to obtain a target picture.

Further, the embodiment of the present application further includes a target recognition network construction module 106, and the target recognition network construction module 106 is configured to execute:

For example, the size of the convolution kernel (kernel_size) in the convolution operation is 3*3, and the dilation rate (dilation_rate) is 2, according to the dilation_rate*(kernel_size-1)+1 calculation formula for dilation convolution, it is calculated as: 2*(3-1 )+1=5, so the size of the expanded convolution kernel is 5*5.

The similarity measure classification function is:

Where y ^* is the label value of the target image training set,

Train the predicted value of the target image training set for the target recognition network, c is the label value category of the target image training set, if the target image training set has a total of 172 label values, the number of c is 172 .

Corresponding to the scene segmentation network, after the construction of the target recognition network is completed, the target recognition network needs to be trained to adjust the internal parameters of the target recognition network. Therefore, preferably, the embodiment of the present application further includes a target recognition network training module 107.

The target recognition network training module is used to execute:

As shown in FIG. 3, it is a schematic diagram of the structure of an electronic device that implements the image-based target recognition method according to the present application.

The electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as a picture-based target recognition program 12.

The memory 11 includes at least one type of readable storage medium. The readable storage medium may be non-volatile or volatile. The readable storage medium includes flash memory, mobile hard disk, and multimedia card. , Card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart media card (SMC), and a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as a code for target recognition based on pictures, etc., but also to temporarily store data that has been output or will be output.

Further, the readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store a block chain node Use the created data, etc.

The processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips, etc. The processor 10 is the control unit of the electronic device, which uses various interfaces and lines to connect the various components of the entire electronic device, and runs or executes programs or modules stored in the memory 11 (for example, executing Target recognition based on pictures, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.

The bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.

FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.

For example, although not shown, the electronic device 1 may also include a power source (such as a battery) for supplying power to various components. Preferably, the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power The device implements functions such as charge management, discharge management, and power consumption management. The power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators. The electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

Further, the electronic device 1 may also include a network interface. Optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.

Optionally, the electronic device 1 may also include a user interface. The user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)). Optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.

It should be understood that the embodiments are only for illustrative purposes, and are not limited by this structure in the scope of the patent application.

The picture-based target recognition 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:

In the scene segmentation network, perform an upsampling operation, the convolution operation, and the activation operation on the first feature set to obtain a second feature set, and according to the pre-built classification function, the second feature set is Perform a classification operation on the feature set to obtain a scene picture set;

Specifically, for the specific implementation method of the above-mentioned instructions by the processor 10, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which will not be repeated here.

Further, if the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile or volatile computer readable storage medium . The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .

In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application.

Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved.

In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present application.

Claims

A picture-based target recognition method, wherein the method includes:

Use the scene segmentation network to perform convolution, activation, and pooling operations on the original image to obtain the first feature set;

In the scene segmentation network, perform an upsampling operation, the convolution operation, and the activation operation on the first feature set to obtain a second feature set, and according to a pre-built classification function, perform an upsampling operation, the convolution operation and the activation operation on the second feature set. Perform a classification operation on the feature set to obtain a scene picture set;

The scene picture set is input into a target recognition network for target recognition to obtain a target picture.
The image-based target recognition method of claim 1, wherein the method further comprises constructing the scene segmentation network, and the constructing comprises:

Constructing a segmentation layer that performs the convolution operation, the activation operation, and the pooling operation;

Construct an extraction layer that performs the upsampling operation, the convolution operation, and the activation operation; and

Constructing an output layer that performs the convolution operation, the activation operation, and the classification operation;

The scene segmentation network is constructed according to the segmentation layer, the extraction layer, and the output layer.
3. The image-based target recognition method of claim 2, wherein the method further comprises: training the scene segmentation network, wherein the training comprises:

Step A: Obtain a scene picture training set, and use the segmentation layer to perform first feature extraction on the scene picture training set to obtain a first scene feature set;

Step B: Use the extraction layer to perform second feature extraction in the first scene feature set to obtain a second scene feature set;

Step C: Use the output layer to perform the third feature extraction and the classification operation in the second scene feature set to obtain and output the first training value;

Step D: When the first training value is greater than the preset first training threshold, return to step A;

Step E: When the first training value is less than or equal to the first training threshold, a trained scene segmentation network is obtained.
The image-based target recognition method according to claim 1, wherein the method further comprises constructing the target recognition network, and the constructing comprises:

Extracting the size of the convolution kernel of the convolution operation in the scene segmentation network and setting the expansion rate;

Calculate the expanded convolution kernel size of the expanded convolution operation according to the size of the convolution kernel, the expansion ratio, and a pre-built expanded convolution calculation formula;

Constructing and obtaining the first target recognition layer according to the size of the convolution kernel and the size of the expanded convolution kernel;

Construct a similarity measure classification function, and construct a second target recognition layer according to the dilated convolution operation and the similarity measure classification function;

According to the first target recognition layer and the second target recognition layer, the target recognition network is constructed.
5. The image-based target recognition method of claim 4, wherein the method further comprises: training the target recognition network, wherein the training comprises:

Step a: Obtain a target picture training set, and use the first target recognition layer to perform a first dilated convolution operation on the target picture training set to obtain a first target feature set;

Step b: Use the second target recognition layer to perform a second dilated convolution operation and similarity metric calculation on the first target feature set to obtain and output a second training value;

Step c: If the second training value is greater than the second training threshold, return to step a;

Step d: If the second training value is less than or equal to the second training threshold, the target recognition network is obtained.
8. The image-based target recognition method of claim 5, wherein the similarity metric classification function adopts the following construction method:

Where y * is the label value of the target image training set,
The training value of the target picture training set is trained for the target recognition network, and c is the category of the label value of the target picture training set.
A picture-based target recognition device, wherein the device includes:

The first feature acquisition module is configured to use the scene segmentation network to perform convolution, activation, and pooling operations on the original image to obtain the first feature set;

The scene picture extraction module is configured to perform an up-sampling operation, the convolution operation, and the activation operation on the first feature set in the scene segmentation network to obtain a second feature set, which is based on a pre-built classification Function to perform a classification operation on the second feature set to obtain a scene picture set;

The target picture recognition module is used to input the scene picture set into the target recognition network for target recognition to obtain the target picture.
8. The picture-based target recognition device according to claim 7, wherein the device further comprises a scene segmentation network construction module for:

Constructing a segmentation layer that performs the convolution operation, the activation operation, and the pooling operation;

Construct an extraction layer that performs the upsampling operation, the convolution operation, and the activation operation; and

Constructing an output layer that performs the convolution operation, the activation operation, and the classification operation;

The scene segmentation network is constructed according to the segmentation layer, the extraction layer, and the output layer.
An electronic device, wherein the electronic device includes:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the following steps are implemented:

Use the scene segmentation network to perform convolution, activation, and pooling operations on the original image to obtain the first feature set;

In the scene segmentation network, perform an upsampling operation, the convolution operation, and the activation operation on the first feature set to obtain a second feature set, and according to a pre-built classification function, perform an upsampling operation, the convolution operation and the activation operation on the second feature set. Perform a classification operation on the feature set to obtain a scene picture set;

The scene picture set is input into a target recognition network for target recognition to obtain a target picture.
9. The electronic device according to claim 9, wherein, when the instructions are executed by the at least one processor, the construction of the scene segmentation network is further implemented, and the construction comprises:

Constructing a segmentation layer that performs the convolution operation, the activation operation, and the pooling operation;

Construct an extraction layer that performs the upsampling operation, the convolution operation, and the activation operation; and

Constructing an output layer that performs the convolution operation, the activation operation, and the classification operation;

The scene segmentation network is constructed according to the segmentation layer, the extraction layer, and the output layer.
11. The electronic device according to claim 10, wherein, when the instructions are executed by the at least one processor, training the scene segmentation network is further implemented, wherein the training comprises:

Step A: Obtain a scene picture training set, and use the segmentation layer to perform first feature extraction on the scene picture training set to obtain a first scene feature set;

Step B: Use the extraction layer to perform second feature extraction in the first scene feature set to obtain a second scene feature set;

Step C: Use the output layer to perform the third feature extraction and the classification operation in the second scene feature set to obtain and output the first training value;

Step D: When the first training value is greater than the preset first training threshold, return to step A;

Step E: When the first training value is less than or equal to the first training threshold, a trained scene segmentation network is obtained.
9. The electronic device according to claim 9, wherein the instruction is executed by the at least one processor to further realize the construction of the target recognition network, and the construction comprises:

Extracting the size of the convolution kernel of the convolution operation in the scene segmentation network and setting the expansion rate;

Calculate the expanded convolution kernel size of the expanded convolution operation according to the size of the convolution kernel, the expansion ratio, and a pre-built expanded convolution calculation formula;

Constructing and obtaining the first target recognition layer according to the size of the convolution kernel and the size of the expanded convolution kernel;

Construct a similarity measure classification function, and construct a second target recognition layer according to the dilated convolution operation and the similarity measure classification function;

According to the first target recognition layer and the second target recognition layer, the target recognition network is constructed.
The electronic device according to claim 12, wherein, when the instructions are executed by the at least one processor, training the target recognition network is also implemented, wherein the training comprises:

Step a: Obtain a target picture training set, and use the first target recognition layer to perform a first dilated convolution operation on the target picture training set to obtain a first target feature set;

Step b: Use the second target recognition layer to perform a second dilated convolution operation and similarity metric calculation on the first target feature set to obtain and output a second training value;

Step c: If the second training value is greater than the second training threshold, return to step a;

Step d: If the second training value is less than or equal to the second training threshold, the target recognition network is obtained.
The electronic device according to claim 13, wherein the similarity measure classification function adopts the following construction method:

Where y * is the label value of the target image training set,
The training value of the target picture training set is trained for the target recognition network, and c is the category of the label value of the target picture training set.
A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the following steps:

Use the scene segmentation network to perform convolution, activation, and pooling operations on the original image to obtain the first feature set;

In the scene segmentation network, perform an upsampling operation, the convolution operation, and the activation operation on the first feature set to obtain a second feature set, and according to a pre-built classification function, perform an upsampling operation, the convolution operation and the activation operation on the second feature set. Perform a classification operation on the feature set to obtain a scene picture set;

The scene picture set is input into a target recognition network for target recognition to obtain a target picture.
15. The computer-readable storage medium according to claim 15, wherein when the computer program is executed by the processor, the scene segmentation network is constructed, and the construction includes:

Constructing a segmentation layer that performs the convolution operation, the activation operation, and the pooling operation;

Construct an extraction layer that performs the upsampling operation, the convolution operation, and the activation operation; and

Constructing an output layer that performs the convolution operation, the activation operation, and the classification operation;

The scene segmentation network is constructed according to the segmentation layer, the extraction layer, and the output layer.
15. The computer-readable storage medium according to claim 16, wherein the computer program further realizes training of the scene segmentation network when the computer program is executed by the processor, wherein the training comprises:

Step A: Obtain a scene picture training set, and use the segmentation layer to perform first feature extraction on the scene picture training set to obtain a first scene feature set;

Step B: Use the extraction layer to perform second feature extraction in the first scene feature set to obtain a second scene feature set;

Step C: Use the output layer to perform the third feature extraction and the classification operation in the second scene feature set to obtain and output the first training value;

Step D: When the first training value is greater than the preset first training threshold, return to step A;

Step E: When the first training value is less than or equal to the first training threshold, a trained scene segmentation network is obtained.
15. The computer-readable storage medium according to claim 15, wherein when the computer program is executed by the processor, the target recognition network is constructed, and the construction includes:

Extracting the size of the convolution kernel of the convolution operation in the scene segmentation network and setting the expansion rate;

Calculate the expanded convolution kernel size of the expanded convolution operation according to the size of the convolution kernel, the expansion ratio, and a pre-built expanded convolution calculation formula;

Constructing and obtaining the first target recognition layer according to the size of the convolution kernel and the size of the expanded convolution kernel;

Construct a similarity measure classification function, and construct a second target recognition layer according to the dilated convolution operation and the similarity measure classification function;

According to the first target recognition layer and the second target recognition layer, the target recognition network is constructed.
18. The computer-readable storage medium of claim 18, wherein the computer program, when executed by the processor, further implements training of the target recognition network, wherein the training includes:

Step a: Obtain a target picture training set, and use the first target recognition layer to perform a first dilated convolution operation on the target picture training set to obtain a first target feature set;

Step b: Use the second target recognition layer to perform a second dilated convolution operation and similarity measure calculation on the first target feature set to obtain and output a second training value;

Step c: If the second training value is greater than the second training threshold, return to step a;

Step d: If the second training value is less than or equal to the second training threshold, the target recognition network is obtained.
19. The computer-readable storage medium of claim 19, wherein the similarity metric classification function adopts the following construction method:

Where y * is the label value of the target image training set,
The training value of the target picture training set is trained for the target recognition network, and c is the category of the label value of the target picture training set.