CN111932584B

CN111932584B - Method and device for determining moving object in image

Info

Publication number: CN111932584B
Application number: CN202010671470.6A
Authority: CN
Inventors: 王晓鲁; 卢维; 任宇鹏; 殷俊; 伊进延
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2023-11-07
Anticipated expiration: 2040-07-13
Also published as: CN111932584A

Abstract

The invention provides a method and a device for determining a moving object in an image, comprising the following steps: acquiring a group of images shot by a binocular camera at two continuous moments; inputting the group of images into a scene flow network model to obtain a scene flow of a first moment monocular image output by the scene flow network model; and determining the moving object in the monocular image at the first moment according to the scene flow of the monocular image at the first moment. The invention solves the problem of low detection precision of detecting the specific moving object in the video image sequence in the related technology, thereby achieving the effect of improving the detection precision.

Description

Method and device for determining moving object in image

Technical Field

The present invention relates to the field of communications, and in particular, to a method and apparatus for determining a moving object in an image.

Background

The moving object detection is to detect a specific moving object in a video image sequence, separate it from the background, and calculate information such as its position. The detection of a moving object is the basis of problems such as tracking of the moving object, behavior analysis of the object and the like. The method has wide application in the fields of intelligent security, intelligent traffic, environment sensing of robots and the like.

The current common motion detection methods include an inter-frame difference method, an optical flow method, a scene flow method and the like. The inter-frame difference method itself has inherent limitations in that it is easy to generate a hole and a complete moving object cannot be obtained. The problems of high calculation time consumption and low precision caused by repeated iterative optimization in the traditional optical flow and scene flow estimation algorithm.

Aiming at the problem of low detection precision of detecting a specific moving object in a video image sequence in the related art, no effective solution exists at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining a moving object in an image, which at least solve the problem of low detection precision of detecting a specific moving object in a video image sequence in the related technology.

According to an embodiment of the present invention, there is provided a method of determining a moving object in an image, including: acquiring a group of images shot by a binocular camera at two continuous moments, wherein the two continuous moments comprise a first moment and a second moment, the binocular camera comprises a left-eye camera and a right-eye camera, the images shot by the left-eye camera at the first moment are left-eye images at the first moment, the images shot by the right-eye camera at the first moment are right-eye images at the first moment, the images shot by the left-eye camera at the second moment are left-eye images at the second moment, and the images shot by the right-eye camera at the second moment are right-eye images at the second moment;

Inputting the set of images into a scene flow network model to obtain a scene flow of a first moment monocular image output by the scene flow network model, wherein the first moment monocular image is the first moment left eye image or the first moment right eye image, the scene flow network model is obtained by using a plurality of sets of first training data through machine learning training, and each set of first training data in the plurality of sets of first training data comprises: the binocular camera shoots obtained images and scene flow true values at two continuous moments; and determining the moving object in the monocular image at the first moment according to the scene flow of the monocular image at the first moment.

Optionally, determining the moving object in the first time monocular image according to the scene flow of the first time monocular image includes: analyzing the scene flow through a clustering algorithm to obtain a motion area in the monocular image at the first moment; dividing an instance object in the first time monocular image by using a first instance division model to obtain a first instance object, wherein the first instance division model is obtained by using a plurality of sets of second training data through machine learning training, and each set of second training data in the plurality of sets of second training data comprises: an image and a first segmentation truth value; and matching the motion area with the first instance object to obtain the motion object.

Optionally, the scene flow network model includes a parallax network model, a first optical flow network model and a second instance segmentation model, where inputting the set of images to the scene flow network model, obtaining a scene flow of a first moment monocular image output by the scene flow network model includes: inputting the first moment left eye image and the first moment right eye image into the parallax network model to obtain first parallax output by the parallax network model, and inputting the second moment left eye image and the second moment right eye image into the parallax network model to obtain second parallax output by the parallax network model, wherein the parallax network model is obtained by machine learning training by using multiple groups of third training data, and each group of third training data in the multiple groups of third training data comprises: the binocular camera shoots two frames of images and a parallax true value at the same moment; inputting the first moment left eye image and the second moment left eye image into the first optical flow network model to obtain a left view optical flow output by the first optical flow network model, wherein the first optical flow network model is obtained by machine learning training by using multiple groups of fourth training data, and each group of fourth training data in the multiple groups of fourth training data comprises: the left-eye camera in the binocular cameras shoots images and optical flow truth values at two continuous moments; dividing an example object in the first moment left eye image by using a second example division model to obtain a second example object, wherein the second example division model is obtained by using a plurality of groups of fifth training data through machine learning training, and each group of fifth training data in the plurality of groups of fifth training data comprises: an image and a second segmentation truth value; a scene flow of the first moment left-eye image is determined from the first disparity, the second disparity, the left-view optical flow, and the second instance object.

Optionally, determining a scene flow of the first moment left eye image according to the first disparity, the second disparity, the left view optical flow, and the second instance object includes: performing motion vector translation on the second parallax according to the left view optical flow to obtain a first mapping parallax for mapping the second parallax to the first moment; optimizing the first mapping parallax by using the first moment left eye image and the second instance object to obtain a first optimized mapping parallax; and obtaining a scene flow of the left-eye image at the first moment through the first optimized mapping parallax, the first parallax and the left-view optical flow.

Optionally, a first loss function between the scene flow of the first moment left eye image output by the scene flow network model and a predetermined known scene flow of the first moment left eye image meets a first target convergence condition, and the first target convergence condition is used for indicating that an output value of the first loss function is within a first predetermined range.

Optionally, the scene flow network model includes a parallax network model, a second optical flow network model and a second instance segmentation model, where inputting the set of images to the scene flow network model, obtaining a scene flow of a first moment monocular image output by the scene flow network model includes: inputting the first moment left eye image and the first moment right eye image into the parallax network model to obtain first parallax output by the parallax network model, and inputting the second moment left eye image and the second moment right eye image into the parallax network model to obtain second parallax output by the parallax network model, wherein the parallax network model is obtained by machine learning training by using multiple groups of third training data, and each group of third training data in the multiple groups of third training data comprises: the binocular camera shoots two frames of images and a parallax true value at the same moment; inputting the first moment right eye image and the second moment right eye image into the second optical flow network model to obtain a right view optical flow output by the second optical flow network model, wherein the second optical flow network model is obtained by machine learning training by using a plurality of groups of sixth training data, and each group of sixth training data in the plurality of groups of sixth training data comprises: the right-eye camera in the binocular cameras shoots images and optical flow truth values at two continuous moments; dividing an example object in the right eye image at the first moment by using the second example division model to obtain a third example object, wherein the second example division model is obtained by using a plurality of groups of fifth training data through machine learning training, and each group of fifth training data in the plurality of groups of fifth training data comprises: an image and a second segmentation truth value; a scene flow of the right eye image at the first moment is determined from the first disparity, the second disparity, the right view optical flow, and the third instance object.

Optionally, determining a scene flow of the right eye image at the first moment according to the first disparity, the second disparity, the right view optical flow, and the third instance object includes: performing motion vector translation on the second parallax according to the right view optical flow to obtain a second mapping parallax for mapping the second parallax to the first moment; optimizing the second mapping parallax by using the right eye image at the first moment and the third example object to obtain a second optimized mapping parallax; and obtaining a scene flow of the right eye image at the first moment through the second optimized mapping parallax, the first parallax and the right view optical flow.

Optionally, a second loss function between the scene flow of the first moment right eye image output by the scene flow network model and a predetermined known scene flow of the first moment right eye image meets a second target convergence condition, and the second target convergence condition is used for indicating that an output value of the second loss function is within a second predetermined range.

According to another embodiment of the present invention, there is provided a determination apparatus of a moving object in an image, including: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a group of images shot by a binocular camera at two continuous moments, the two continuous moments comprise a first moment and a second moment, the binocular camera comprises a left-eye camera and a right-eye camera, the images shot by the left-eye camera at the first moment are left-eye images at the first moment, the images shot by the right-eye camera at the first moment are right-eye images at the first moment, the images shot by the left-eye camera at the second moment are left-eye images at the second moment, and the images shot by the right-eye camera at the second moment are right-eye images at the second moment; the output module is configured to input the set of images into a scene flow network model to obtain a scene flow of a first time monocular image output by the scene flow network model, where the first time monocular image is the first time left eye image or the first time right eye image, the scene flow network model is obtained by using multiple sets of first training data through machine learning training, and each set of first training data in the multiple sets of first training data includes: the binocular camera shoots obtained images and scene flow true values at two continuous moments; and the determining module is used for determining the moving object in the monocular image at the first moment according to the scene flow of the monocular image at the first moment.

According to a further embodiment of the application, there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the application, a group of images shot by the binocular camera at two continuous moments are acquired, the group of images are input into the scene flow network model, the scene flow of the first moment monocular image output by the scene flow network model is obtained, and a moving object in the first moment monocular image is determined according to the scene flow of the first moment monocular image. Because the moving target detection uses the scene flow which is sensitive to movement to detect, the moving target area in the scene can be accurately and completely detected, the moving target area is detected from the scene flow, and the moving objects in the moving target area are segmented by combining the example segmentation. The method for calculating the scene flow based on the neural network ensures the calculation accuracy of the scene flow and realizes higher calculation speed compared with the traditional method. Therefore, the problem of low detection precision of detecting a specific moving object in the video image sequence can be solved, and the effect of improving the detection precision is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a method for determining a moving object in an image according to an embodiment of the present application;

fig. 2 is a flowchart of a method of determining a moving object in an image according to an embodiment of the present application;

FIG. 3 is a schematic illustration of segmentation of a moving target region in accordance with an alternative embodiment of the present application;

FIG. 4 is a schematic diagram of a scene flow network model according to an alternative embodiment of the application;

fig. 5 is a schematic diagram of a network architecture of a FlowNetCorr according to an alternative embodiment of the present application;

FIG. 6 is an optimization schematic according to an alternative embodiment of the application;

FIG. 7 is a schematic diagram of a network architecture of an Encoder-Decoder according to an alternative embodiment of the present application;

fig. 8 is a block diagram of a configuration of a determination apparatus of a moving object in an image according to an embodiment of the present application.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of the mobile terminal according to a method for determining a moving object in an image according to an embodiment of the present application. As shown in fig. 1, the mobile terminal 10 may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1 or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for determining a moving object in an image in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of networks described above may include wireless networks provided by the communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

In this embodiment, there is provided a method for determining a moving object in an image running on the mobile terminal, and fig. 2 is a flowchart of a method for determining a moving object in an image according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

step S202, a group of images shot by a binocular camera at two continuous moments is obtained, wherein the two continuous moments comprise a first moment and a second moment, the binocular camera comprises a left-eye camera and a right-eye camera, the images shot by the left-eye camera at the first moment are left-eye images at the first moment, the images shot by the right-eye camera at the first moment are right-eye images at the first moment, the images shot by the left-eye camera at the second moment are left-eye images at the second moment, and the images shot by the right-eye camera at the second moment are right-eye images at the second moment;

step S204, inputting the set of images into a scene flow network model to obtain a scene flow of a first time monocular image output by the scene flow network model, where the first time monocular image is the first time left eye image or the first time right eye image, the scene flow network model is obtained by using multiple sets of first training data through machine learning training, and each set of first training data in the multiple sets of first training data includes: the binocular camera shoots obtained images and scene flow true values at two continuous moments;

Step S206, determining the moving object in the monocular image at the first moment according to the scene flow of the monocular image at the first moment.

As an alternative embodiment, for the binocular camera, at two consecutive moments in time, a first moment in time t and a second moment in time t+1, there are a group of images captured by the left and right cameras respectively including four images of a first moment left eye image left_image_t0, a first moment right eye image right_image_t0, a second moment left eye image left_image_t1, and a second moment right eye image right_image_t1. In the present embodiment, the left image may be used as the reference image, or the right image may be used as the reference image. When the left image is used as the reference image, the first time monocular image is the first time left eye image left_image_t0. When the right image is used as the reference image, the first time monocular image is the first time right image_image_t0.

As an alternative implementation manner, the scene flow network model is obtained through machine learning training by using multiple sets of training data, and the scene flow S of the left-eye image left_image_t0 at the first moment can be obtained by inputting one set of images into the scene flow network model _{Left side} (u, v, z) a scene stream S of right-eye image right_image_t0 can also be obtained _{Right side} (u, v, z). The scene flow can be used for determining the object moving in the left-eye image left_image_t0 at the first moment, and also can be used for determining the object moving in the right-eye image right_image_t0 at the first moment.

Through the steps, a group of images shot by the binocular camera at two continuous moments are acquired, the group of images are input into the scene flow network model, the scene flow of the first moment monocular image output by the scene flow network model is obtained, and a moving object in the first moment monocular image is determined according to the scene flow of the first moment monocular image. Because the moving target detection uses the scene flow which is sensitive to movement to detect, the moving target area in the scene can be accurately and completely detected, the moving target area is detected from the scene flow, and the moving objects in the moving target area are segmented by combining the example segmentation. The method for calculating the scene flow based on the neural network ensures the calculation accuracy of the scene flow and realizes higher calculation speed compared with the traditional method. Therefore, the problem of low detection precision of detecting a specific moving object in the video image sequence can be solved, and the effect of improving the detection precision is achieved.

Alternatively, the execution subject of the above steps may be a terminal or the like, but is not limited thereto.

As an alternative embodiment, a clustering method may be used to segment the moving target area in the left eye image left_image_t0 at the first moment or the right eye image right_image_t0 at the first moment, as shown in fig. 3, which is a schematic diagram of the segmentation of the moving target area, and mainly includes the following 3 steps:

step S1, smoothing a scene flow by means of mean filtering;

Step S2, clustering scene flows by using an ISODATA clustering algorithm;

and S3, dividing the motion area and the background area, and extracting a motion target area.

The result of scene flow network computation can have some degree of noise due to the complexity of the scene. The scene stream may be preprocessed using an average filter, for example, an average filter of 5*5, before the moving target region is segmented, and the specific filter may be selected according to the actual situation, which is not limited herein.

The clustering algorithm uses iterative self-organizing data analysis (ISODATA clustering algorithm). The ISODATA algorithm can remove a certain class when the number of samples belonging to the class is too small, and divide the class into two sub-classes when the number of samples belonging to the class is too large and the dispersion degree is large, namely automatically carrying out 'merging' and 'splitting' of the class, thereby obtaining each cluster with more reasonable class number. And clustering the preprocessed scene flow by using an ISODATA clustering algorithm, wherein the obtained clustering categories comprise a motion area and a background area. Specifically, the scene flow average value of each category in the scene flow N clustering result may be calculated and denoted as Mi (u, v, z) (i=1, 2,3 … N), a threshold T (u, v, z) is set, and then the magnitude of the scene flow average value Mi (u, v, z) and the threshold T (u, v, z) of each clustering area are compared. When the scene flow average value of the clustering area is larger than the threshold value, the area is a motion area, and otherwise, the area is a background area. The motion area extracted at this time may include a single motion object or may include a plurality of motion objects.

The result of the division of the moving object region may include a plurality of moving objects, and each object instance of the moving object region is divided by an instance division method based on a color image. In this embodiment, the first instance segmentation model may be a Mask-RCNN network based on deep learning, and the Mask of each instance is obtained by performing instance segmentation on the left view image left_image_t0 or the right view image right_image_t0 by using the Mask-RCNN network based on deep learning. Then, by taking an intersection of the instance segmentation mask instance_mask and the motion region mask object_mask, the intersection is the final motion target individual. The moving object detection is carried out through the scene flow moving object region segmentation and the instance segmentation, so that the characteristic that the scene flow is sensitive to the object motion can be utilized, and the advantage of identifying potential object instances in the scene through the instance segmentation can be utilized.

As an alternative implementation manner, in this embodiment, the left view image left_image_t0 is taken as a reference image, as shown in fig. 4, which is a schematic view of a scene flow network model, where the scene flow model includes a parallax network model, a first optical flow network model and a second instance segmentation model, the left_image_t0 and right_image_t0 are input into the parallax network model to calculate a first parallax disp_t0, where the parallax network model may be a DispNet network, the left_image_t1 and right_image_t1 are input into the parallax network model to calculate a second parallax disp_t1, and the left_image_t0 and left_image_t1 are input into the optical flow network model to calculate a left view optical flow, where the parallax network model may be a FlowNet network. And performing instance segmentation calculation on the left_image_t0 by using a second entity instance segmentation model Mask-RCNN network to obtain an instance Mask (corresponding to a second entity instance object). In this embodiment, the network structures of the DispNet and the FlowNet may be the same, and both have a simple version and a Corr version, in this embodiment, the versions of the DispNet Corr and the FlowNet Corr are used, fig. 5 shows the network structure of the FlowNet Corr, the network structure of the DispNet Corr is the same as the network structure of the FlowNet Corr, and fig. 6 shows the optimized definition part in fig. 5.

As an alternative embodiment, disp_t1 is translated by a motion vector using the result of optical flow such that the pixel coordinates of each point become the coordinates at time t0, resulting in a mapped parallax disp_warp. A point P in the scene is projected onto the image plane at time t0, and the position on the image is (x 0, y 0). P changes through the motion position, and the position on the image becomes (x 1, y 1). (u=x1-x 0, v=y0-y 1) represents the motion vector of this point on the image. I.e. the value of the optical flow (x 0, y 0). The parallax value at this point t1 is disp_t1 (x 1, y 1), disp_warp (x 1-u, y 1-v) =disp_t1 (x 1, y 1), that is, disp_warp (x 0, y 0) =disp_t1 (x 1, y 1). Each point operates in this way, resulting in disp_warp. The Mask (corresponding to the second instance object) for the instance is computed from the left view left_image_t0 of the t0 th frame using the instance segmentation network Mask-RCNN network. And obtaining optimized mapping parallax disp_1 after an optimization module definition by using a left image_image_t0, an example mask and disp_warp at the time t 0. The optimization module definition adopts the network structure of the Encoder-Decode, and is shown in a schematic diagram of the network structure of the Encoder-Decode in FIG. 7. And deconvoluting the input data through a convolution kernel to obtain the optimized disp_1.[ H/2, W/2, 32 ] ]The characteristic diagram size of the output of the convolution layer after the convolution layer is 1/2 of the original height width, the channel number is changed to 32, and the other steps are similar. The arrow indicates a jump link and merges together according to the channel. Combining the optimized mapping parallax disp_1, the first parallax disp_t0 and the left view optical flow to obtain a scene flow result S _{Left side} (u, v, z). Specifically, since the depth of the point (x, y) in the image at two times t0 and t1 is represented by one point (x, y), disp_0 (x, y), disp_1 (x, y), respectively, the change in depth (z direction) is known. flow (x, y) represents the x, y direction motion vector of point (x, y) in t0 to t 1. So that the motion of each point in three directions can be known by combining disp_t0, disp_1 and flow, so that the scene flow result can be calculated. In this embodiment, the scene flow calculation uses a deep learning-based method, so that the dense scene flow S (u, v, z) can be rapidly calculated.

As an alternative embodiment, the scene flow network sets two loss functions to guide the network convergence process. The first Loss function is Loss1, and is obtained by calculating a parallax network calculation result disp_t0 and parallax disp_warp after disp_t1 is mapped through left view optical flow. The second Loss function Loss2 is calculated from scene stream sceneflow and sample truth value groudtruth. Specifically, for the first Loss function, loss1: depth variations can be calculated using disp_warp and disp_t0, and differences can be calculated in comparison to ground true values. The calculation formula can be: loss=sum (abs (Δdisp- Δdisp_gt), where Δdisp=disp_warp-disp_0. sum represents all pixels summed and abs represents absolute value. If the true value is given in the form of two disparity maps, the difference between disp_warp and disp_1_gt can be calculated directly. The calculation formula is as follows: loss=sum (abs (disp_warp-disp_1_gt)). The method of calculation for the second Loss function Loss2 is consistent with Loss1, but is then calculated using optimized disp_1. The calculation formula is as follows: loss=sum (abs (disp_1-disp_1_gt)).

As an optional implementation manner, in this embodiment, the left view image right_image_t0 is taken as a reference image, the scene flow model includes a parallax network model, a second optical flow network model and a second instance segmentation model, the left_image_t0 is input into the parallax network model to calculate the first parallax disp_t0, the parallax network model may be a DispNet network, the left_image_t1, the right_image_t1 is input into the parallax network model to calculate the second parallax disp_t1, and the right_image_t0 and the right_image_t1 are input into the optical flow network model to calculate the right view optical flow, where the optical flow network model may be a FlowNet network. And performing instance segmentation calculation on the right_image_t0 by using a second entity instance segmentation model Mask-RCNN network to obtain an instance Mask. In this embodiment, the network structures of the DispNet and the FlowNet may be the same, and both networks have a simple version and a Corr version, and in this embodiment, versions of the DispNet Corr and the FlowNet Corr are used.

As an alternative embodiment, disp_t1 is translated by a motion vector using the result of optical flow such that the pixel coordinates of each point become the coordinates at time t0, resulting in a mapped parallax disp_warp. A point P in the scene is projected onto the image plane at time t0, and the position on the image is (x 0, y 0). P changes through the motion position, and the position on the image becomes (x 1, y 1). (u=x1-x 0, v=y0-y 1) represents the motion vector of this point on the image. I.e. the value of the optical flow (x 0, y 0). The parallax value at this point t1 is disp_t1 (x 1, y 1), disp_warp (x 1-u, y 1-v) =disp_t1 (x 1, y 1), that is, disp_warp (x 0, y 0) =disp_t1 (x 1, y 1).Each point operates in this way, resulting in disp_warp. The Mask of the instance is calculated from the right view right_image_t0 of the t0 th frame using the instance segmentation network Mask-RCNN network. And obtaining optimized mapping parallax disp_1 after an optimization module definition by using a right image_image_t0, an instance mask and disp_warp at the time t 0. The optimization module definition adopts an network structure of an Encoder-Decode. And deconvoluting the input data through a convolution kernel to obtain the optimized disp_1.[ H/2, W/2, 32 ] ]The characteristic diagram size of the output of the convolution layer after the convolution layer is 1/2 of the original height width, the channel number is changed to 32, and the other steps are similar. The arrow indicates a jump link and merges together according to the channel. Combining the optimized mapping parallax disp_1, the first parallax disp_t0 and the right view optical flow to obtain a scene flow result S _{Right side} (u, v, z). Specifically, since the depth of the point (x, y) in the image at two times t0 and t1 is represented by one point (x, y), disp_0 (x, y), disp_1 (x, y), respectively, the change in depth (z direction) is known. flow (x, y) represents the x, y direction motion vector of point (x, y) in t0 to t 1. So that the motion of each point in three directions can be known by combining disp_t0, disp_1 and flow, so that the scene flow result can be calculated. In this embodiment, the scene flow calculation uses a deep learning-based method, so that the dense scene flow S (u, v, z) can be rapidly calculated.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiment also provides a device for determining a moving object in an image, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 8 is a block diagram of a configuration of an apparatus for determining a moving object in an image according to an embodiment of the present invention, as shown in fig. 8, the apparatus including: an acquisition module 82, configured to acquire a set of images captured by a binocular camera at two consecutive moments, where the two consecutive moments include a first moment and a second moment, the binocular camera includes a left-eye camera and a right-eye camera, an image captured by the left-eye camera at the first moment is a left-eye image at the first moment, an image captured by the right-eye camera at the first moment is a right-eye image at the first moment, an image captured by the left-eye camera at the second moment is a left-eye image at the second moment, and an image captured by the right-eye camera at the second moment is a right-eye image at the second moment; the output module 84 is configured to input the set of images into a scene flow network model to obtain a scene flow of a first time monocular image output by the scene flow network model, where the first time monocular image is the first time left eye image or the first time right eye image, the scene flow network model is obtained by machine learning training using multiple sets of first training data, and each set of first training data in the multiple sets of first training data includes: the binocular camera shoots obtained images and scene flow true values at two continuous moments; a determining module 86, configured to determine a moving object in the first time monocular image according to the scene flow of the first time monocular image.

Optionally, the above apparatus implements the determining the moving object in the first time monocular image according to the scene flow of the first time monocular image by: analyzing the scene flow through a clustering algorithm to obtain a motion area in the monocular image at the first moment; dividing an instance object in the first time monocular image by using a first instance division model to obtain a first instance object, wherein the first instance division model is obtained by using a plurality of sets of second training data through machine learning training, and each set of second training data in the plurality of sets of second training data comprises: an image and a first segmentation truth value; and matching the motion area with the first instance object to obtain the motion object.

Optionally, the above device realizes that the scene flow network model includes a parallax network model, a first optical flow network model and a second instance segmentation model, where the set of images are input to the scene flow network model to obtain a scene flow of the first moment monocular image output by the scene flow network model: inputting the first moment left eye image and the first moment right eye image into the parallax network model to obtain first parallax output by the parallax network model, and inputting the second moment left eye image and the second moment right eye image into the parallax network model to obtain second parallax output by the parallax network model, wherein the parallax network model is obtained by machine learning training by using multiple groups of third training data, and each group of third training data in the multiple groups of third training data comprises: the binocular camera shoots two frames of images and a parallax true value at the same moment; inputting the first moment left eye image and the second moment left eye image into the first optical flow network model to obtain a left view optical flow output by the first optical flow network model, wherein the first optical flow network model is obtained by machine learning training by using multiple groups of fourth training data, and each group of fourth training data in the multiple groups of fourth training data comprises: the left-eye camera in the binocular cameras shoots images and optical flow truth values at two continuous moments; dividing an example object in the first moment left eye image by using a second example division model to obtain a second example object, wherein the second example division model is obtained by using a plurality of groups of fifth training data through machine learning training, and each group of fifth training data in the plurality of groups of fifth training data comprises: an image and a second segmentation truth value; a scene flow of the first moment left-eye image is determined from the first disparity, the second disparity, the left-view optical flow, and the second instance object.

Optionally, the above apparatus is configured to implement the determining, according to the first parallax, the second parallax, the left view optical flow, and the second example object, the scene flow of the left view image at the first moment in time by: performing motion vector translation on the second parallax according to the left view optical flow to obtain a first mapping parallax for mapping the second parallax to the first moment; optimizing the first mapping parallax by using the first moment left eye image and the second instance object to obtain a first optimized mapping parallax; and obtaining a scene flow of the left-eye image at the first moment through the first optimized mapping parallax, the first parallax and the left-view optical flow.

Optionally, the scene flow network model includes a parallax network model, a second optical flow network model and a second instance segmentation model, and the above device is configured to implement the inputting the set of images into the scene flow network model by the following manner, so as to obtain a scene flow of the monocular image at the first moment output by the scene flow network model: inputting the first moment left eye image and the first moment right eye image into the parallax network model to obtain first parallax output by the parallax network model, and inputting the second moment left eye image and the second moment right eye image into the parallax network model to obtain second parallax output by the parallax network model, wherein the parallax network model is obtained by machine learning training by using multiple groups of third training data, and each group of third training data in the multiple groups of third training data comprises: the binocular camera shoots two frames of images and a parallax true value at the same moment; inputting the first moment right eye image and the second moment right eye image into the second optical flow network model to obtain a right view optical flow output by the second optical flow network model, wherein the second optical flow network model is obtained by machine learning training by using a plurality of groups of sixth training data, and each group of sixth training data in the plurality of groups of sixth training data comprises: the right-eye camera in the binocular cameras shoots images and optical flow truth values at two continuous moments; dividing an example object in the right eye image at the first moment by using the second example division model to obtain a third example object, wherein the second example division model is obtained by using a plurality of groups of fifth training data through machine learning training, and each group of fifth training data in the plurality of groups of fifth training data comprises: an image and a second segmentation truth value; a scene flow of the right eye image at the first moment is determined from the first disparity, the second disparity, the right view optical flow, and the third instance object.

Optionally, the above apparatus is further configured to implement the determining, according to the first parallax, the second parallax, the right view optical flow, and the third example object, a scene flow of the right view image at the first moment in time by: performing motion vector translation on the second parallax according to the right view optical flow to obtain a second mapping parallax for mapping the second parallax to the first moment; optimizing the second mapping parallax by using the right eye image at the first moment and the third example object to obtain a second optimized mapping parallax; and obtaining a scene flow of the right eye image at the first moment through the second optimized mapping parallax, the first parallax and the right view optical flow.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

s1, acquiring a group of images shot by a binocular camera at two continuous moments, wherein the two continuous moments comprise a first moment and a second moment, the binocular camera comprises a left-eye camera and a right-eye camera, the images shot by the left-eye camera at the first moment are left-eye images at the first moment, the images shot by the right-eye camera at the first moment are right-eye images at the first moment, the images shot by the left-eye camera at the second moment are left-eye images at the second moment, and the images shot by the right-eye camera at the second moment are right-eye images at the second moment;

s2, inputting the group of images into a scene flow network model to obtain a scene flow of a first time monocular image output by the scene flow network model, wherein the first time monocular image is the first time left eye image or the first time right eye image, the scene flow network model is obtained by machine learning training by using multiple groups of first training data, and each group of first training data in the multiple groups of first training data comprises: the binocular camera shoots obtained images and scene flow true values at two continuous moments;

S3, determining the moving object in the monocular image at the first moment according to the scene flow of the monocular image at the first moment.

Optionally, the storage medium is further arranged to store a computer program for performing the steps of:

alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining a moving object in an image, comprising:

acquiring a group of images shot by a binocular camera at two continuous moments, wherein the two continuous moments comprise a first moment and a second moment, the binocular camera comprises a left-eye camera and a right-eye camera, the images shot by the left-eye camera at the first moment are left-eye images at the first moment, the images shot by the right-eye camera at the first moment are right-eye images at the first moment, the images shot by the left-eye camera at the second moment are left-eye images at the second moment, and the images shot by the right-eye camera at the second moment are right-eye images at the second moment;

inputting the set of images into a scene flow network model to obtain a scene flow of a first moment monocular image output by the scene flow network model, wherein the first moment monocular image is the first moment left eye image or the first moment right eye image, the scene flow network model is obtained by using a plurality of sets of first training data through machine learning training, and each set of first training data in the plurality of sets of first training data comprises: the binocular camera shoots obtained images and scene flow true values at two continuous moments;

And determining the moving object in the monocular image at the first moment according to the scene flow of the monocular image at the first moment.

2. The method of claim 1, wherein determining moving objects in the first temporal monocular image from the scene flow of the first temporal monocular image comprises:

analyzing the scene flow through a clustering algorithm to obtain a motion area in the monocular image at the first moment;

dividing an instance object in the first time monocular image by using a first instance division model to obtain a first instance object, wherein the first instance division model is obtained by using a plurality of sets of second training data through machine learning training, and each set of second training data in the plurality of sets of second training data comprises: an image and a first segmentation truth value;

and matching the motion area with the first instance object to obtain the motion object.

3. The method of claim 1, wherein the scene flow network model includes a parallax network model, a first optical flow network model, and a second instance segmentation model, wherein inputting the set of images into the scene flow network model results in a scene flow of the first moment monocular image output by the scene flow network model, comprising:

Inputting the first moment left eye image and the first moment right eye image into the parallax network model to obtain first parallax output by the parallax network model, and inputting the second moment left eye image and the second moment right eye image into the parallax network model to obtain second parallax output by the parallax network model, wherein the parallax network model is obtained by machine learning training by using multiple groups of third training data, and each group of third training data in the multiple groups of third training data comprises: the binocular camera shoots two frames of images and a parallax true value at the same moment;

inputting the first moment left eye image and the second moment left eye image into the first optical flow network model to obtain a left view optical flow output by the first optical flow network model, wherein the first optical flow network model is obtained by machine learning training by using multiple groups of fourth training data, and each group of fourth training data in the multiple groups of fourth training data comprises: the left-eye camera in the binocular cameras shoots images and optical flow truth values at two continuous moments;

dividing an example object in the first moment left eye image by using a second example division model to obtain a second example object, wherein the second example division model is obtained by using a plurality of groups of fifth training data through machine learning training, and each group of fifth training data in the plurality of groups of fifth training data comprises: an image and a second segmentation truth value;

A scene flow of the first moment left-eye image is determined from the first disparity, the second disparity, the left-view optical flow, and the second instance object.

4. The method of claim 3, wherein determining a scene flow for the first moment left eye image from the first disparity, the second disparity, the left view optical flow, and the second instance object comprises:

performing motion vector translation on the second parallax according to the left view optical flow to obtain a first mapping parallax for mapping the second parallax to the first moment;

optimizing the first mapping parallax by using the first moment left eye image and the second instance object to obtain a first optimized mapping parallax;

and obtaining a scene flow of the left-eye image at the first moment through the first optimized mapping parallax, the first parallax and the left-view optical flow.

5. The method according to claim 3 or 4, wherein a first loss function between the scene flow of the first moment left eye image output by the scene flow network model and a predetermined known scene flow of the first moment left eye image satisfies a first target convergence condition, the first target convergence condition being used to indicate that an output value of the first loss function is within a first predetermined range.

6. The method of claim 1, wherein the scene flow network model includes a parallax network model, a second optical flow network model, and a second instance segmentation model, wherein inputting the set of images into the scene flow network model results in a scene flow of the first moment monocular image output by the scene flow network model, comprising:

inputting the first moment right eye image and the second moment right eye image into the second optical flow network model to obtain a right view optical flow output by the second optical flow network model, wherein the second optical flow network model is obtained by machine learning training by using a plurality of groups of sixth training data, and each group of sixth training data in the plurality of groups of sixth training data comprises: the right-eye camera in the binocular cameras shoots images and optical flow truth values at two continuous moments;

Dividing an example object in the right eye image at the first moment by using the second example division model to obtain a third example object, wherein the second example division model is obtained by using a plurality of groups of fifth training data through machine learning training, and each group of fifth training data in the plurality of groups of fifth training data comprises: an image and a second segmentation truth value;

a scene flow of the right eye image at the first moment is determined from the first disparity, the second disparity, the right view optical flow, and the third instance object.

7. The method of claim 6, wherein determining a scene flow for the first moment right eye image from the first disparity, the second disparity, the right view optical flow, and the third instance object comprises:

performing motion vector translation on the second parallax according to the right view optical flow to obtain a second mapping parallax for mapping the second parallax to the first moment;

optimizing the second mapping parallax by using the right eye image at the first moment and the third example object to obtain a second optimized mapping parallax;

and obtaining a scene flow of the right eye image at the first moment through the second optimized mapping parallax, the first parallax and the right view optical flow.

8. The method according to claim 6 or 7, wherein a second loss function between the scene flow of the first moment right eye image output by the scene flow network model and a predetermined known scene flow of the first moment right eye image satisfies a second target convergence condition, the second target convergence condition being used to indicate that an output value of the second loss function is within a second predetermined range.

9. A device for determining a moving object in an image, comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a group of images shot by a binocular camera at two continuous moments, the two continuous moments comprise a first moment and a second moment, the binocular camera comprises a left-eye camera and a right-eye camera, the images shot by the left-eye camera at the first moment are left-eye images at the first moment, the images shot by the right-eye camera at the first moment are right-eye images at the first moment, the images shot by the left-eye camera at the second moment are left-eye images at the second moment, and the images shot by the right-eye camera at the second moment are right-eye images at the second moment;

the output module is configured to input the set of images into a scene flow network model to obtain a scene flow of a first time monocular image output by the scene flow network model, where the first time monocular image is the first time left eye image or the first time right eye image, the scene flow network model is obtained by using multiple sets of first training data through machine learning training, and each set of first training data in the multiple sets of first training data includes: the binocular camera shoots obtained images and scene flow true values at two continuous moments;

And the determining module is used for determining the moving object in the monocular image at the first moment according to the scene flow of the monocular image at the first moment.

10. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1 to 8 when run.