CN116992200A

CN116992200A - Parallax calculation method, binocular vision system and agricultural unmanned aerial vehicle

Info

Publication number: CN116992200A
Application number: CN202311136488.6A
Authority: CN
Inventors: 常志中; 陈启东
Original assignee: Heilongjiang Huida Technology Co ltd
Current assignee: Heilongjiang Huida Technology Co ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-11-03

Abstract

The application provides a parallax calculation method, a binocular vision system and an agricultural unmanned aerial vehicle, wherein the method comprises the following steps: extracting features of a first image acquired by a left eye in the binocular camera to obtain a first feature vector, and extracting features of a second image acquired by a right eye in the binocular camera to obtain a second feature vector; determining a query vector, a key value vector and a weight vector according to the first feature vector and/or the second feature vector; constructing a softmax function through the multiplied result of the query vector and the key value vector matrix to obtain the attention weight among pixels; multiplying the attention weight by the weight vector matrix to obtain the attention among pixels; the disparity for each pixel on the first image and the second image is determined based on the attention between the pixels. In the embodiment of the application, the softmax function constructed by the sigmoid function is beneficial to realizing the deployment of the transducer in the embedded NPU.

Description

Parallax calculation method, binocular vision system and agricultural unmanned aerial vehicle

Technical Field

The application relates to the technical field of agricultural unmanned aerial vehicles, in particular to a parallax calculation method, a binocular vision system and an agricultural unmanned aerial vehicle.

Background

In the field of computer vision, binocular depth estimation has wide application prospect and research significance because three-dimensional (3D) information can be reconstructed. The basic principle of binocular vision is to take a picture with two parallel cameras, and calculate depth information according to the difference (parallax) of the corresponding pixel positions between the left and right camera images and use the depth information to reconstruct a three-dimensional scene. A key loop in parallax computation is the converter (transducer), and the conventional softmax function used in current transducers does not enable deployment of transducers in embedded NPUs.

Disclosure of Invention

The application provides a parallax calculation method, a binocular vision system and an agricultural unmanned aerial vehicle, which are beneficial to realizing deployment of a transducer in an embedded NPU through a softmax function constructed by a sigmoid function.

In a first aspect, there is provided a parallax calculation method, the method comprising: acquiring a first image acquired by a left eye and a second image acquired by a right eye in a binocular camera; extracting features of the first image to obtain a first feature vector, and extracting features of the second image to obtain a second feature vector; determining a query vector Q, a key value vector K and a weight vector V according to the first feature vector and/or the second feature vector; the result of matrix multiplication of the query vector Q and the key value vector KThe following softmax function is input to get the attention weight between pixels:

wherein i represents the position of the pixel, the value range of j is [1, c ], and c is the channel number; multiplying the attention weight by the weight vector V matrix to obtain the attention among the pixels; a disparity for each pixel on the first image and the second image is determined based on the attention between the pixels.

Based on the above technical solution, since the neural network processor (neural networkprocessing, NPU) in most embedded computing platforms accelerates softmax, the softmax function using the construct can be accelerated by NPU to complete the deployment of the transducer in the embedded NPU.

The above attention weights between pixels may include an attention weight between pixels on the first image, an attention weight between pixels on the second image, or an attention weight between pixels on the first image and the second image.

With reference to the first aspect, in certain implementation manners of the first aspect, determining the query vector Q, the key value vector K, and the weight vector V according to the first feature vector and the second feature vector includes: determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors in the width W and/or height H directions of the first feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors positioned in the width W and/or height H directions in the second feature vectors; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the width W direction in the first feature vector and the feature vector located in the width W direction in the second feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the height H direction in the first feature vector and the feature vector located in the height H direction in the second feature vector.

Based on the technical scheme, by alternately carrying out the attention mechanism calculation in the H and W directions, the calculation complexity is reduced while the precision is not lost, and the application at the embedded end is facilitated.

With reference to the first aspect, in some implementations of the first aspect, the determining a parallax of each pixel on the first image and the second image according to the attention between the pixels includes: and according to the attention among the pixel values, performing matching cost calculation, cost aggregation, parallax calculation and parallax optimization to obtain the parallax of each pixel.

In a second aspect, there is provided a binocular vision system comprising: a binocular camera for acquiring a first image through a left eye and a second image through a right eye; the binocular camera is used for sending the first image and the second image to the processor; the processor is used for carrying out feature extraction on the first image to obtain a first feature vector, and carrying out feature extraction on the second image to obtain a second feature vector; the processor is further configured to determine a query vector Q, a key value vector K, and a weight vector V according to the first feature vector and/or the second feature vector; the processor is also used for multiplying the query vector Q and the key value vector K matrixInputting the following softmax function to obtain the attention weight among pixels;

wherein i represents the position of the pixel, the value range of j is [1, c ], and c is the channel number; the processor is further configured to multiply the attention weight by the weight vector V matrix to obtain the attention between the pixels; the processor is further configured to determine a disparity for each pixel on the first image and the second image based on the attention between the pixels.

With reference to the second aspect, in certain implementations of the second aspect, the processor is configured to: determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors in the width W and/or height H directions of the first feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors positioned in the width W and/or height H directions in the second feature vectors; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the width W direction in the first feature vector and the feature vector located in the width W direction in the second feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the height H direction in the first feature vector and the feature vector located in the height H direction in the second feature vector.

With reference to the second aspect, in certain implementations of the second aspect, the processor is configured to: and according to the attention among the pixel values, performing matching cost calculation, cost aggregation, parallax calculation and parallax optimization to obtain the parallax of each pixel.

In a third aspect, there is provided an agricultural unmanned aerial vehicle comprising: a memory for storing computer instructions; a processor for executing computer instructions stored in the memory to cause the apparatus to perform the method of any one of the first aspects above.

In a fourth aspect, there is provided an agricultural unmanned aerial vehicle comprising the binocular vision system of any one of the second aspects above.

In a fifth aspect, there is provided a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the above first aspects.

The computer program code may be stored in whole or in part on a first storage medium, where the first storage medium may be packaged together with the processor or separately from the processor, and embodiments of the present application are not limited in this regard.

In a sixth aspect, there is provided a computer readable medium storing program code which, when run on a computer, causes the computer to perform the method of any one of the first aspects above.

In a seventh aspect, a chip is provided, the chip comprising circuitry for performing the method of any of the first aspects described above.

Drawings

Fig. 1 is a schematic block diagram of a binocular vision system provided by an embodiment of the present application.

Fig. 2 is a schematic flowchart of a parallax calculation method provided by an embodiment of the present application.

FIG. 3 is a schematic diagram of an attention mechanism provided by an embodiment of the present application.

Fig. 4 is a schematic diagram of the H-direction and W-direction attention mechanisms provided by an embodiment of the present application.

Fig. 5 is a schematic diagram of a network structure according to an embodiment of the present application.

Fig. 6 is a schematic diagram of a parallax prediction result after parallax optimization according to an embodiment of the present application.

Detailed Description

In the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; "and/or" herein is an association relationship describing an association object, and means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

In the embodiment of the application, prefix words such as "first" and "second" are adopted, and only for distinguishing different description objects, no limitation is imposed on the position, sequence, priority, quantity or content of the described objects. The use of ordinal words and the like in embodiments of the present application to distinguish between the prefix words used to describe an object does not limit the described object, and statements of the described object are to be read in the claims or in the context of the embodiments and should not constitute unnecessary limitations due to the use of such prefix words.

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

As described above, in the field of computer vision, binocular depth estimation has a wide application prospect and research significance because it can reconstruct three-dimensional (3D) information. The basic principle of binocular vision is to take a picture with two parallel cameras, and calculate depth information according to the difference (parallax) of the corresponding pixel positions between the left and right camera images and use the depth information to reconstruct a three-dimensional scene. A key loop in parallax computation is a converter (transducer), which uses a dot product attention mechanism to compute the similarity between a query vector and a key vector to extract relevant information. The current attention mechanism is a dot product attention mechanism, which causes a large calculation amount and low efficiency of data processing.

The embodiment of the application provides a parallax calculation method, a binocular vision system and an agricultural unmanned aerial vehicle, which are realized through sigmoid

The softmax function of the function construction facilitates deployment of the transducer in the embedded NPU.

Fig. 1 shows a schematic block diagram of a binocular vision system 100 provided by an embodiment of the present application. As shown in fig. 1, the binocular vision system 100 may include a binocular camera 110 and a processor 120. The binocular camera 110 may send the captured pictures to the processor 120. The processor 120 may perform feature vector extraction according to the pictures acquired by the binocular camera, and determine a query vector, a key value vector, and a weight vector according to the extracted feature vectors. In the calculation process of the attention mechanism, a softmax function is constructed through a sigmoid function to calculate the attention among pixels, so that the attention among pixels on the left-eye image output by the left-eye camera or the attention among pixels on the image output by the right-eye camera or the attention among pixels on the image output by the left-eye camera and the image output by the right-eye camera is obtained.

The processor 120 in embodiments of the present application may be a central processing unit (central processing unit, CPU), which may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (fieldprogrammable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Fig. 2 shows a schematic flowchart of a parallax calculation method 200 provided by an embodiment of the present application. The method 200 may be performed by the binocular vision system 100 described above, or may be performed by the processor 120 in the binocular vision system 100.

The method 200 includes:

s210, acquiring a first image acquired by a left eye and a second image acquired by a right eye in the binocular camera.

S220, performing feature extraction on the first image to obtain a first feature vector, and performing feature extraction on the second image to obtain a second feature vector.

In one embodiment, the first image and the second image may be subjected to feature extraction via a shared backlight.

For example, feature vectors with different resolution levels are output by feature encoding using residual connection and spatial pyramid pooling modules.

S230, determining a query vector Q, a key value vector K and a weight value vector V according to the first feature vector and/or the second feature vector.

S240, multiplying the query vector Q and the key value vector K matrixInputting the following softmax function (1) to obtain the attention weight among pixels;

wherein i represents the position of the pixel, the value range of j is [1, c ], and c is the channel number.

The above softmax function (1) can be obtained by the following sigmoid function (2) and softmax function (3):

can be obtained by the sigmoid function (2)Function (4):

will beThe softmax function (1) is obtained by taking the function (4) into the softmax function (3).

S250, multiplying the attention weight and the weight vector V matrix to obtain the attention among the pixels.

The above S240 and S250 may be performed in a transducer.

The transformation architecture used in the embodiments of the present application employs self-attention mechanism (self-attention) and cross-attention mechanism (cross-attention): the self-attention mechanism calculates the attention between pixels in the same image, while the cross-attention mechanism calculates the attention of pixels between the left and right images.

The attention (attention) calculation adopted in the embodiment of the application can be multi-head attention (attention), and the expression capability of the features is enhanced by grouping the feature vectors in the channel dimension and calculating different groups.

Because of the limited computing power and memory in embedded deployments, dot product attention is computationally intensive and softmax operation cannot be accelerated in some embedded platforms. Embodiments of the present application therefore propose the use of improved attention computation approaches. Drawing of the figureA schematic diagram of the attention mechanism provided by an embodiment of the present application is shown at 3. As shown in FIG. 3, the query vector Q and the key value vector K are first matrix-multiplied to obtainWill->The attention weight (or similarity score between pixels) is obtained from the input softmax function (1). The attention weight is multiplied by the weight vector K matrix to obtain the attention between pixels.

Deployment of a transducer on an embedded NPU can be achieved by using a softmax function constructed by a sigmoid function.

Exemplary, the first image includes a pixel x ₁ The second image comprises pixel points x ₂ . In calculating x ₁ And x ₂ The query vector Q may be x when the cross-attention of (1) ₁ The key value vector K and the weight value vector V can be x obtained by linear transformation ₂ Obtained by linear transformation. Alternatively, in calculating x ₂ And x ₁ The query vector Q may be x when the cross-attention of (1) ₂ The key value vector K and the weight value vector V can be x obtained by linear transformation ₁ Obtained by linear transformation.

Illustratively, in calculating x ₁ The query vector Q, the key value vector K, and the weight vector V may be x ₁ Obtained by linear transformation.

In one embodiment, the self-attention may be performed on the pixels on the first image or the second image, and then the cross-attention may be performed on the pixels on the first image and the second image.

Optionally, determining the query vector Q, the key value vector K, and the weight vector V according to the first feature vector and the second feature vector includes: determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors in the width W and/or height H directions of the first feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors positioned in the width W and/or height H directions in the second feature vectors; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the width W direction in the first feature vector and the feature vector located in the width W direction in the second feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the height H direction in the first feature vector and the feature vector located in the height H direction in the second feature vector.

In the conventional transformation, after processing tasks related to images, information in two directions, namely H and W, is spliced and then sent to an attention mechanism layer together to extract related information. The calculation amount of matrix multiplication of the query vector Q and the key value vector K is H.times.W.times.H.times.W in the attention mechanism. In the embodiment of the application, in order to meet the requirement of embedded deployment, the computational cost needs to be reduced, and a network with large computational capacity, namely a transformer, is deployed on an embedded platform with limited computational power. The embodiment of the application provides a method for extracting information by alternately carrying out an attention mechanism in an H direction and a W direction. The calculated amount of the primary H direction is H ² The calculated amount of the primary W direction is W ² The sum is H ² +W ² . Fig. 4 shows a schematic diagram of the H-direction and W-direction attention mechanism provided by an embodiment of the present application.

Since the attention mechanisms of the H direction and the W direction at a time cannot be used to perform the attention mechanism calculation together instead of the H and W directions at a time, the embodiment of the present application also proposes to perform the attention mechanism calculation of the H and W directions by alternating them several times. Therefore, the calculation complexity is reduced while the precision is not lost, and the application at the embedded end is realized.

The above-described transformation using the H-direction and W-direction iteration contributes to a reduction in the calculation amount. Through testing, good results can be obtained with only attention (attention) in the W direction.

And S260, determining the parallax of each pixel on the first image and the second image according to the attention among the pixels.

The above process of determining the disparity for each pixel on the first image and the second image based on the attention between the pixels may refer to prior art implementations. For example, fig. 5 shows a schematic diagram of a network structure provided by an embodiment of the present application. After the parallax of each pixel is obtained through calculation, the parallax of each pixel can be obtained through matching cost calculation, cost aggregation, parallax calculation and parallax optimization output.

Matching cost calculation

This module is to describe the correlation of two pixels on two pictures. The more relevant the pixel, the smaller the cost and vice versa. The pixel point corresponding to the minimum cost can be found through the module, and a certain correlation is reflected.

Cost aggregation

In the above module, only local correlation of a certain pixel is usually considered, and the pixel is very sensitive to noise. In order to make the model more robust, the pixel cost of the neighborhood needs to be calculated through cost aggregation to obtain the optimal parallax.

Parallax computation

The parallax calculation is to select a point with the smallest accumulated cost in the parallax searching range as a corresponding matching point, and the parallax corresponding to the point is the predicted parallax.

Parallax optimization

After obtaining the predicted disparities, it is often necessary to optimize the disparity map as well. Mainly to eliminate false disparities due to noise and to smooth the disparity map.

Fig. 6 is a schematic diagram of a parallax prediction result after parallax optimization according to an embodiment of the present application. Through testing, the disparity calculated epe is approximately 0.8 pixels, run time is 10ms (GPU server), embedded platform top (60 ms). Compared with the prior art, the operation speed of the system is obviously improved on the premise of similar accuracy. And under the same running speed, the accuracy of the system is obviously improved.

The embodiment of the application also provides a binocular vision system, which comprises: a binocular camera 110 for capturing a first image through the left eye and a second image through the right eye; the binocular camera 110 usesTransmitting the first image and the second image to the processor 120; the processor 120 is configured to perform feature extraction on the first image to obtain a first feature vector and perform feature extraction on the second image to obtain a second feature vector; the processor 120 is further configured to determine a query vector Q, a key value vector K, and a weight vector V according to the first feature vector and/or the second feature vector; the processor 120 is further configured to matrix multiply the query vector Q with the key value vector KInputting the following softmax function to obtain the attention weight among pixels;

wherein i represents the position of the pixel, the value range of j is [1, c ], and c is the channel number; the processor 120 is further configured to multiply the attention weight by the weight vector V matrix to obtain the attention between the pixels; the processor 120 is further configured to determine a parallax for each pixel on the first image and the second image based on the attention between the pixels.

Optionally, the processor 120 is configured to: determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors in the width W and/or height H directions of the first feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors positioned in the width W and/or height H directions in the second feature vectors; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the width W direction in the first feature vector and the feature vector located in the width W direction in the second feature vector; or determining the query vector Q, the key value vector K and the weight vector V according to the feature vector located in the height H direction in the first feature vector and the feature vector located in the height H direction in the second feature vector.

Optionally, the processor 120 is configured to: and according to the attention among the pixel values, performing matching cost calculation, cost aggregation, parallax calculation and parallax optimization to obtain the parallax of each pixel.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (randomaccess memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A parallax calculation method, characterized by comprising:

acquiring a first image acquired by a left eye and a second image acquired by a right eye in a binocular camera;

extracting features of the first image to obtain a first feature vector, and extracting features of the second image to obtain a second feature vector;

according to the first characteristic vector and/or the second characteristic vector, a query vector Q, a key value vector K and a weight value vector V are determined;

a result obtained by multiplying the query vector Q and the key value vector K matrixThe following softmax function is input to get the attention weight between pixels:

wherein i represents the position of the pixel, the value range of j is [1, c ], and c is the channel number;

multiplying the attention weight by the weight vector V matrix to obtain the attention among the pixels;

and determining the parallax of each pixel on the first image and the second image according to the attention among the pixels.

2. The method according to claim 1, wherein determining a query vector Q, a key value vector K and a weight vector V from the first feature vector and/or the second feature vector comprises:

determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors in the width W and/or height H directions of the first feature vector; or,

determining the query vector Q, the key value vector K and the weight vector V according to the feature vectors in the width W and/or height H directions of the second feature vector; or,

determining the query vector Q, the key value vector K and the weight vector V according to the feature vector in the width W direction of the first feature vector and the feature vector in the width W direction of the second feature vector; or,

and determining the query vector Q, the key value vector K and the weight vector V according to the characteristic vector positioned in the height H direction in the first characteristic vector and the characteristic vector positioned in the height H direction in the second characteristic vector.

3. The method according to claim 1 or 2, wherein determining the parallax of each pixel on the first image and the second image based on the attention between the pixels comprises:

and according to the attention among the pixel values, performing matching cost calculation, cost aggregation, parallax calculation and parallax optimization to obtain the parallax of each pixel.

4. A binocular vision system, comprising:

a binocular camera for acquiring a first image through a left eye and a second image through a right eye;

the binocular camera is used for sending the first image and the second image to a processor;

the processor is used for carrying out feature extraction on the first image to obtain a first feature vector, and carrying out feature extraction on the second image to obtain a second feature vector;

the processor is further configured to determine a query vector Q, a key value vector K, and a weight vector V according to the first feature vector and/or the second feature vector;

the processor is further configured to multiply the query vector Q and the key value vector K matrix to obtain a resultThe following softmax function is input to get the attention weight between pixels:

the processor is further configured to multiply the attention weight by the weight vector V matrix to obtain the attention between the pixels;

the processor is further configured to determine a parallax for each pixel on the first image and the second image based on the attention between the pixels.

5. The vision binocular vision system of claim 4, wherein the processor is configured to:

6. The vision system of claim 4 or 5, wherein the processor is configured to:

7. An agricultural drone comprising a binocular vision system as claimed in any one of claims 4 to 6.