CN118196135A

CN118196135A - Image processing method, apparatus, storage medium, device, and program product

Info

Publication number: CN118196135A
Application number: CN202410316474.0A
Authority: CN
Inventors: 刘浚源
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2024-03-19
Filing date: 2024-03-19
Publication date: 2024-06-14

Abstract

The application discloses an image processing method, an image processing device, a storage medium, a device and a program product, wherein the method comprises the following steps: acquiring a binocular image sequence, wherein each frame of binocular image in the binocular image sequence comprises a left image and a right image; performing image synthesis processing on a left image and a right image in each frame of binocular image to obtain a synthesized image corresponding to each frame of binocular image, wherein each frame of synthesized image comprises an overlapping area between the left image and the right image; and carrying out local motion detection according to the overlapped area to obtain a motion detection result. According to the method, the device and the system, the binocular image sequence is acquired, the image synthesis processing is carried out, and the local motion detection is carried out based on the overlapping area in the synthesized image corresponding to the binocular image, so that the calculation load of equipment is reduced, and the detection efficiency is improved.

Description

Image processing method, apparatus, storage medium, device, and program product

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, a storage medium, a device, and a program product.

Background

With the continuous development of technology, extended Reality (XR) technology has become a popular field today, and the application scenario thereof is increasingly wide. XR technology encompasses multiple manifestations of Virtual Reality (VR), augmented Reality (Augmented Reality, AR), and Mixed Reality (MR). Among these technologies, a perspective function (See-through) is increasingly attracting attention and application. The see-through function brings the user with it a sensation as if he were able to see the surrounding real world directly through the head mounted display device by capturing a real-time view of the surrounding environment and displaying it on the screen. In the use of seethrough, movements can be divided into two categories: global motion (Global motion) and Local motion (Local motion).

Global motion (Global motion) refers to motion of the entire head mounted display device. Global motion may describe translational and rotational states in six degrees of freedom (6degree of freedom,6DoF) through data of inertial sensors (Inertial Measurement Unit, IMU).

Local motion refers to the motion of a moving object in a se-through scene with the device stationary.

For XR devices, detection of global motion is critical, as it is a central element in creating a good quality virtual reality experience. Global motion not only affects Mesh (Mesh) reconstruction and spatial scene reconstruction of See-through, but also is closely related to Field of View (FoV), image noise effects (e.g., multi-frame image quality temporal noise reduction-MCTF), and image anti-shake effects, etc. It is these global motions that are captured by IMU sensors and 6DoF sensors on VR devices.

However, in contrast to global motion detection, local motion detection is still a blank in current XR devices. This is mainly due to limitations in terms of power consumption and computational effort, and no existing local motion detection module is available in the current camera software architecture. In addition, since the See-through scene has extremely high demands on image processing speed, if it is more than 30ms, the user may feel distortion from pixel to pixel (Pixel to Pixel Time, PTP) time (PTP time refers to the time from capturing pixel information by the image sensor until such pixel information is displayed on the screen); if it is more than 50ms, it may cause dizziness to the user. Thus, if local motion detection for color (RGB) images is added serially in an image signal Processor (IMAGE SIGNAL Processor, ISP) as is conventional, not only will the module-to-pixel (Module to Pixel Time, MTP) time (MTP time refers to the time from the input of the image processing module to the final pixel display on the screen) be greatly increased, but the user experience may be severely affected.

In summary, although global motion detection has found widespread use in XR devices, local motion detection is still challenging in current XR devices due to technical limitations and user experience requirements.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, a storage medium, equipment and a program product, which can carry out local motion detection based on an overlapping region in a composite image corresponding to a binocular image, reduce equipment calculation load and improve detection efficiency.

In one aspect, an embodiment of the present application provides an image processing method, including:

Acquiring a binocular image sequence, wherein each frame of binocular image in the binocular image sequence comprises a left image and a right image; performing image synthesis processing on the left image and the right image in each frame of the binocular image to obtain a synthesized image corresponding to each frame of the binocular image, wherein each frame of the synthesized image comprises an overlapping area between the left image and the right image; and carrying out local motion detection according to the overlapped area to obtain a motion detection result.

In another aspect, an embodiment of the present application provides an image processing apparatus, including:

A first acquisition unit, configured to acquire a binocular image sequence, where each frame of binocular image in the binocular image sequence includes a left image and a right image;

the first processing unit is used for carrying out image synthesis processing on the left image and the right image in each frame of the binocular image to obtain a synthesized image corresponding to each frame of the binocular image, wherein each frame of the synthesized image comprises an overlapping area between the left image and the right image;

and the detection unit is used for carrying out local motion detection according to the overlapped area and obtaining a motion detection result.

In another aspect, embodiments of the present application provide a computer readable storage medium storing a computer program adapted to be loaded by a processor to perform the image processing method according to any of the embodiments above.

In another aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute the image processing method according to any one of the embodiments above by calling the computer program stored in the memory.

In another aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the image processing method according to any of the embodiments above.

According to the embodiment of the application, by acquiring the binocular image sequence, each frame of binocular image in the binocular image sequence comprises a left image and a right image, the advantage of binocular vision is fully utilized, each frame of binocular image comprises the left image and the right image, the two images capture the same scene from different angles respectively, and rich information is provided for subsequent image synthesis processing; the method comprises the steps that image synthesis processing is carried out on a left image and a right image in each frame of binocular image to obtain a synthesized image corresponding to each frame of binocular image, wherein each frame of synthesized image comprises an overlapping area between the left image and the right image, the complexity of subsequent processing is reduced, the synthesized image comprises an overlapping area between the left image and the right image through synthesis processing, the overlapping area comprises key visual information, and an important basis is provided for subsequent local motion detection; according to the method, the local motion detection is carried out according to the overlapping area, and the motion detection result is obtained, compared with a traditional global motion detection mode, the method is more efficient and accurate in local motion detection, unnecessary calculation amount is reduced by focusing on the overlapping area, equipment calculation burden is reduced, detection efficiency is improved, and meanwhile, the overlapping area contains abundant visual information, so that the motion detection result is more accurate and reliable. According to the embodiment of the application, the binocular image sequence is acquired, the image synthesis processing is carried out, and the local motion detection is carried out based on the overlapping area in the synthesized image corresponding to the binocular image, so that the calculation load of equipment is reduced, and the detection efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an image processing method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a first application scenario of an image processing method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a second application scenario of the image processing method according to the embodiment of the present application.

Fig. 4 is a schematic diagram of a third application scenario of the image processing method according to the embodiment of the present application.

Fig. 5 is a schematic diagram of a fourth application scenario of an image processing method according to an embodiment of the present application.

Fig. 6 is a schematic diagram of a fifth application scenario of an image processing method according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

Fig. 8 is a first schematic structural diagram of a terminal device according to an embodiment of the present application.

Fig. 9 is a second schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application can be applied to various application scenes such as Extended Reality (XR), virtual Reality (VR), augmented Reality (Augmented Reality, AR), mixed Reality (MR) and the like.

First, partial nouns or terms appearing in the course of the description of the embodiments are explained as follows:

the virtual scene is a virtual scene that an application program displays (or provides) when running on a terminal or a server. Optionally, the virtual scene is a simulation environment for the real world, or a semi-simulated semi-fictional virtual environment, or a purely fictional virtual environment. The virtual scene is any one of a two-dimensional virtual scene and a three-dimensional virtual scene, and the virtual environment can be sky, land, ocean and the like, wherein the land comprises environmental elements such as deserts, cities and the like. The virtual scene is a scene of a virtual object complete game logic such as user control.

Virtual objects refer to dynamic objects that can be controlled in a virtual scene. Alternatively, the dynamic object may be a virtual character, a virtual animal, a cartoon character, or the like. The virtual object is a character controlled by a player through an input device or is artificial intelligence set in the fight of a virtual environment through training

(ARTIFICIAL INTELLIGENCE, AI), or a Non-player character (Non PLAYER CHARACTER, NPC) set in a virtual scene battle. Optionally, the virtual object is a virtual character playing an athletic in the virtual scene. Optionally, the number of virtual objects in the virtual scene fight is preset, or dynamically determined according to the number of clients joining the fight, which is not limited by the embodiment of the present application. In one possible implementation, a user can control a virtual object to move in the virtual scene, e.g., control the virtual object to run, jump, crawl, etc., as well as control the virtual object to fight other virtual objects using skills, virtual props, etc., provided by the application. Alternatively, a virtual object may also refer to a static object that may be interacted with in a virtual scene, such as a virtual object, a virtual control, an interface element, a virtual prop, and the like.

Extended Reality (XR) is a technology that includes concepts of Virtual Reality (VR), augmented Reality (Augumented Reality, AR), and Mixed Reality (MR), and represents an environment in which a Virtual world is connected to a real world, with which a user can interact in real time.

Virtual Reality (VR), a technology of creating and experiencing a Virtual world, generates a Virtual environment through calculation, is a multi-source information (the Virtual Reality mentioned herein at least comprises visual perception, also can comprise auditory perception, tactile perception, motion perception, even further comprises gustatory perception, olfactory perception and the like), realizes the simulation of a fused and interactive three-dimensional dynamic view and entity behavior of the Virtual environment, immerses a user into the simulated three-dimensional environment, and realizes application in various Virtual environments such as maps, games, videos, education, medical treatment, simulation, collaborative training, sales, assistance manufacturing, maintenance and repair and the like.

Augmented reality (Augmented Reality, AR), a technique of calculating camera pose parameters of a camera in the real world (or three-dimensional world, real world) in real time during the process of capturing an image by the camera, and adding virtual elements on the image captured by the camera according to the camera pose parameters. Virtual elements include, but are not limited to: images, videos, and three-dimensional models. The goal of AR technology is to socket the virtual world over the real world on the screen for interaction.

Mixed Reality (MR) integrates computer-created sensory input (e.g., virtual objects) with sensory input from a physical scenery or a representation thereof into a simulated scenery, and in some MR sceneries, the computer-created sensory input may be adapted to changes in sensory input from the physical scenery. In addition, some electronic systems for rendering MR scenes may monitor orientation and/or position relative to the physical scene to enable virtual objects to interact with real objects (i.e., physical elements from the physical scene or representations thereof). For example, the system may monitor movement such that the virtual plants appear to be stationary relative to the physical building.

Enhanced virtualization (Augmented Virtuality, AV): AV scenery refers to a simulated scenery in which a computer created scenery or virtual scenery incorporates at least one sensory input from a physical scenery. The one or more sensory inputs from the physical set may be a representation of at least one feature of the physical set. For example, the virtual object may present the color of the physical element captured by the one or more imaging sensors. As another example, the virtual object may exhibit characteristics consistent with actual weather conditions in the physical scenery, as identified via weather-related imaging sensors and/or online weather data. In another example, an augmented reality forest may have virtual trees and structures, but an animal may have features that are accurately reproduced from images taken of a physical animal.

A virtual Field Of View (FOV) represents a perceived area Of a virtual environment that a user can perceive through a lens in a virtual reality device, using a Field Of View (FOV) Of the virtual Field Of View.

The virtual reality device, the terminal for realizing the virtual reality effect, may be provided in the form of glasses, a head mounted display (Head Mount Display, HMD), or a contact lens for realizing visual perception and other forms of perception, but the form of the virtual reality device is not limited to this, and may be further miniaturized or enlarged as needed.

The head-mounted display device described in the embodiment of the present application may be a virtual reality device, and the virtual reality device described in the embodiment of the present application may include, but is not limited to, the following types:

And the computer-side virtual reality (PCVR) equipment performs related calculation of virtual reality functions and data output by using the PC side, and the external computer-side virtual reality equipment realizes the effect of virtual reality by using the data output by the PC side.

The mobile virtual reality device supports setting up a mobile terminal (such as a smart phone) in various manners (such as a head-mounted display provided with a special card slot), performing related calculation of a virtual reality function by the mobile terminal through connection with the mobile terminal in a wired or wireless manner, and outputting data to the mobile virtual reality device, for example, watching a virtual reality video through an APP of the mobile terminal.

The integrated virtual reality device has a processor for performing the calculation related to the virtual function, and thus has independent virtual reality input and output functions, and is free from connection with a PC or a mobile terminal, and has high degree of freedom in use.

The following will describe in detail. It should be noted that the following description order of embodiments is not a limitation of the priority order of embodiments.

The embodiments of the present application provide an image processing method, which may be executed by a terminal or a server, or may be executed by the terminal and the server together; the embodiment of the present application is explained by taking an image processing method performed by a terminal device as an example.

Referring to fig. 1 to 6, fig. 1 is a flowchart of an image processing method according to an embodiment of the present application, and fig. 2 to 6 are application scenarios of the image processing method according to the embodiment of the present application. The method comprises the following steps 110 to 130:

step 110, a binocular image sequence is acquired, each frame of binocular image in the binocular image sequence including a left image and a right image.

In some embodiments, the acquiring a sequence of binocular images includes: and acquiring a binocular image sequence acquired by the target binocular camera, wherein the binocular image sequence comprises a t-1 frame binocular image and a t frame binocular image.

In some embodiments, the target binocular camera is a monochrome binocular camera, the left image is a monochrome left image, and the right image is a monochrome right image.

Wherein the binocular image sequence is a continuous sequence of frames consisting of a series of binocular images, wherein each frame of binocular images contains a left image and a right image. The left image and the right image are respectively obtained by shooting two lenses of the binocular camera from different visual angles at the same time, and certain parallax exists between the left image and the right image, and the parallax provides abundant space information for subsequent image processing.

In practical applications, a target binocular camera is required to be selected, and the camera has two lenses side by side, which respectively correspond to the left image and the right image. The target binocular camera can be professional equipment specially designed for machine vision or computer vision tasks, or can be a common camera with binocular vision function.

And starting the target binocular camera to acquire images. The camera continuously shoots the scene according to a set frame rate (such as 30 frames per second, 60 frames per second, etc.), and generates a series of binocular images. Each frame of binocular image comprises a left image and a right image, the two images are respectively shot by two lenses of the camera, and parallax exists between the two images due to different visual angles.

Wherein, through the target binocular camera, a series of continuous binocular images, i.e. a binocular image sequence, can be obtained. Each frame in the sequence contains two images: one is the left image and the other is the right image. The left and right images are captured by the left and right lenses of the binocular camera, respectively, reflecting views of the same scene from two different perspectives on the left and right sides, respectively.

In the acquisition of the binocular image sequence, some technical details need to be noted. For example, stability and image quality of the target binocular camera are ensured to avoid image distortion due to shake or noise. In addition, the calibration and calibration of the target binocular camera are required to be accurately controlled, so that the corresponding relation between the left image and the right image is ensured to be accurate.

In addition, according to different practical application scenes and demands, the target binocular camera can be a single-color binocular camera or a color binocular camera. The monocolor binocular camera is generally used for scenes with low requirements on image color information and high requirements on calculation efficiency and real-time performance, such as robot navigation simulation, automatic driving simulation, six-degree-of-freedom (6 DoF) system establishment and the like. The color camera can provide richer color information, and is helpful for accurate image analysis and understanding in more complex scenes.

In the embodiment of the application, the target binocular camera may be a monocolor binocular camera. This means that the binocular image captured by the monochrome binocular camera is monochrome, not color. Compared with a color binocular camera, the monochromatic binocular camera has higher resolution and faster processing speed, and is more suitable for application scenes with higher requirements on instantaneity and precision. Therefore, in this case, both the acquired left and right images are monochrome, and they contain only gradation information, not color information.

Furthermore, in order to analyze the motion state more accurately, variations between successive frames are often of concern. Thus, when a binocular image sequence is acquired, a t-1 frame binocular image and a t frame binocular image are acquired. The two frames of binocular images represent scenes at two adjacent moments in time series, respectively, and by comparing differences between them, occurrence and change of motion can be effectively detected.

As shown in fig. 2 and 3, a binocular image sequence acquired by a monocolor binocular camera is acquired, and each frame of binocular image 10 in the binocular image sequence includes a monocolor left image 11 and a monocolor right image 12.

For example, the application scene is exemplified by a six-degree-of-freedom (6 DoF) architecture, the monochrome binocular camera may also be a 6DoF camera, the binocular image may also be a 6DoF binocular image, and the 6DoF binocular image may include a six-degree-of-freedom left (6 dof_left) image and a six-degree-of-freedom right (6 dof_right) image.

For example, as shown in fig. 2, the imaging system of the head-mounted display device may include a monochrome binocular camera and a color (RGB) camera, wherein the monochrome binocular camera includes a left camera (may be a 6dof_left camera) and a right camera (may be a 6dof_right camera). For example, imaging systems can output 6DoF binocular images (6dof_left and 6dof_right images) with 1 frame color (RGB) images at 60 frames per second (fps) exposure, every 16.6 milliseconds (ms), and this high frame rate ensures image consistency and real-time so that users can feel smooth animation and response when interacting or viewing. The 6DoF image format is an 8bit single channel (mono) grayscale image, which can be 640 x 480 in resolution, which is critical for depth perception and localization because they provide accurate position and pose information of objects in the scene; the single-color left camera and the single-color right camera have a common view angle of about 94 degrees. The RGB image format can be an 8-bit three-channel color chart (Yuv), the resolution is 3240 x 2484, and the high-resolution color image provides rich visual details for a user, so that a scene is more vivid and lively; the field angle is approximately 136.

Wherein the six degrees of freedom generally include three translational degrees of freedom (movement along the X, Y, Z axis) and three rotational degrees of freedom (rotation about the X, Y, Z axis). This technique is commonly used in virtual reality, augmented reality, robotic navigation, and many other fields where precise spatial positioning is required. The 6DoF binocular image can track its position and orientation in three dimensions. In certain applications, 6dof_left and 6dof_right cameras can be used to create a stereoscopic scene, or to increase the field of view, improve positioning accuracy, or both. When 6dof_left cameras and 6dof_right cameras are placed relatively close together and their fields of view partially overlap, they may be referred to as "common view angle". For example, the two cameras have a common view angle of about 94 °. This means that the two cameras can co-observe a field of view of about 94 °. This overlapping field of view allows the two cameras to work cooperatively, improving the accuracy of positioning and tracking.

The target binocular camera and the RGB camera can capture image data simultaneously in the virtual reality system. In the embodiment of the application, the target binocular camera is mainly responsible for acquiring depth information and pose data (position and direction) of the camera, which is important for realizing space positioning, scene reconstruction and interactive operation. The RGB camera captures color (RGB) images and provides rich visual information for the user.

For example, as shown in fig. 3 to 6, in the composite image 30 generated after the image combining process of the monochrome left image 11 and the monochrome right image 12, there is an overlapping area 301 of 94 ° as shown in fig. 5, the overlapping area 301 is also included in the 136 ° field angle of view of the RGB image 20, that is, the field overlapping area 201 in the RGB image 20 as shown in fig. 6 is the same as the view content in the overlapping area 301 as shown in fig. 5, so that at the time of local motion detection, it is possible to detect only whether local motion occurs in the overlapping area without local motion detection for the 136 ° area of the entire RGB image, which significantly reduces the amount of data and time consumption of the process, and improves the efficiency of the system.

As shown in fig. 2, the binocular image sequences captured by the target binocular camera need to be processed by a virtual reality related algorithm, such as synchronous positioning and map construction (Simultaneous Localization AND MAPPING, SLAM), mesh (Mesh) calculation, and the like. The computational work of these algorithms is mainly done on a central processor (Central Processing Unit, CPU) and a graphics processor (graphics processing unit, GPU). The CPU is responsible for processing logic and computing tasks such as path planning, scene understanding and the like; while GPUs are adept at processing image and video data in parallel, speeding up the rendering of visual effects and image processing tasks.

The raw image data output by the RGB camera is processed by an image forming engine (Image Formation Engine, IFE) and an image processing engine (Image Processing Engine, IPE). These processing steps are mainly done on an Image Signal Processor (ISP). The ISP is responsible for processing raw image data captured by the image sensor, performing a series of image processing operations such as denoising, color correction, white balance, exposure compensation, etc., to generate a color image suitable for display.

Wherein, the processing of the binocular image and the processing of the color image are completed concurrently. This means that the two can be independently carried out without mutual interference, thereby improving the processing efficiency and the response speed of the whole system.

In the Mesh calculation process, the newly added local motion detection module is deployed behind the image combination module. The image combining module is used for fusing and aligning the left image captured by the left camera with the right image captured by the right camera so as to construct a complete three-dimensional scene. The local motion detection module is used for carrying out local motion detection in the synthesized image after image combination so as to detect the motion information of the moving target, which is important for realizing accurate space interaction and dynamic scene reconstruction.

In fig. 2, the local motion detection module is disposed at a position in the Mesh calculation process, and is used for performing local motion detection. The local motion detection module receives the synthesized image data output by the image synthesizing module, performs local motion detection processing on an overlapped area in the synthesized image data, and provides real-time and accurate motion information for a virtual reality system.

And 120, performing image synthesis processing on the left image and the right image in each frame of the binocular image to obtain a synthesized image corresponding to each frame of the binocular image, wherein each frame of the synthesized image comprises an overlapping area between the left image and the right image.

The method comprises the steps of combining a left image and a right image in each frame of binocular image to generate a composite image containing information of the left image and the right image. The composite image not only integrates the visual content of the left image and the right image, but also highlights the overlapping area between the left image and the right image, and provides a key basis for subsequent motion detection.

For example, the image synthesis process may include the following preprocessing, alignment, synthesis, and the like.

First, the left image and the right image are preprocessed. The preprocessing may include at least one of denoising, filtering, contrast correction, brightness correction, distortion correction, stereo correction, binocular matching, etc., to eliminate interference information in the image and ensure quality and reliability of the composite image.

For example, the primary purpose of denoising is to eliminate noise components in binocular images, which may originate from sensor noise, transmission noise, or other environmental factors. Common denoising methods include median filtering, gaussian filtering, bilateral filtering, and the like. These methods can effectively smooth an image, reducing the influence of noise on the image quality.

For example, the filtering operation is intended to improve the visual effect of the image or enhance certain features of the image. In addition to being used for denoising, filtering can also be used for image sharpening, edge detection, and the like.

For example, contrast correction is concerned with differences between different colors or brightness levels in binocular images. When the contrast is low, details in the image may not be clear enough and the whole looks blurred; while too high a contrast may result in the image appearing too glaring and losing detail. The goal of contrast correction is therefore to make the image clearer and sharper by adjusting the differences between the different colors or brightness levels in the binocular image. This typically involves stretching the contrast range of the image, making dark portions darker and bright portions brighter, thereby increasing the layering of the image.

For example, luminance correction is mainly focused on the overall brightness level of a binocular image. Too low a brightness may result in too dark an image and illegible details; while too high a brightness may make the image appear too exposed and lose detail. Thus, the goal of the luminance correction is to make the image look brighter and more natural by adjusting the overall luminance level of the image. This typically involves increasing or decreasing the brightness value of the image to achieve the desired visual effect.

For example, distortion is image distortion due to physical characteristics of a camera lens. Distortion correction is particularly important in binocular vision systems because it directly affects the alignment and matching of the left and right images. Distortion correction generally includes correction of radial distortion and tangential distortion. Radial distortion occurs mainly at the edges of the image, manifesting as stretching or compression of the image; tangential distortion is caused by imprecise lens mounting or by the sensor plane not being parallel to the lens. By calibrating the camera and applying the corresponding distortion coefficients, these distortions can be effectively eliminated. The specific steps of distortion correction generally include: calibrating the camera by using calibration tools such as checkerboard and the like to obtain internal parameters and external parameters of the camera; calculating a distortion coefficient according to the calibration result; correcting an original image by using a distortion coefficient to obtain an undistorted image; after distortion correction, the straight line in the image becomes more straight, and the shape and position of the object are more accurate.

For example, with respect to stereo correction. In the binocular vision system, due to the relative positional relationship (such as horizontal spacing, angular deviation, etc.) between the two cameras, there is a geometrical inconsistency in the images captured by the left and right cameras. To eliminate such inconsistencies, stereo correction is required. The process is by transforming the planes of the two images (left and right) so that they are coplanar and parallel to the baseline. The baseline is a straight line connecting the centers of the two cameras, which determines the working range of the stereoscopic vision system. After the stereo correction, the polar lines of the left image and the right image are aligned, namely the corresponding pixels are positioned on the same row. In this way, the process of searching for the corresponding pixel is greatly simplified when stereo matching is performed, and only one horizontal scanning line is needed, without traversing the whole image. The method not only remarkably reduces the computational complexity and improves the matching efficiency, but also is beneficial to improving the accuracy and stability of matching.

For example, with regard to binocular matching, it is intended to find corresponding points of the same object or scene in two images (left and right images). These corresponding points reflect the positional relationship of the object in the left and right camera views. By calculating the parallax between the corresponding points (i.e., the horizontal pixel differences between the corresponding points in the left and right images), object information in the three-dimensional scene can be further restored. The parallax is inversely proportional to the distance of the object from the camera, i.e. the closer the object is to the camera, the greater the parallax; conversely, the farther the object is from the camera, the less parallax. Therefore, through binocular matching and parallax calculation, depth information of objects in a scene can be acquired, and three-dimensional reconstruction is realized. Wherein, through binocular matching processing, an overlapping region between the left image and the right image can be determined.

The left and right images may then be precisely aligned by image registration or alignment techniques. Because of parallax in the binocular camera shooting, a small position deviation may exist between the left and right images. Therefore, it is necessary to calculate the transformation relationship between the two images by an algorithm and transform one of the two images so that the two images can be perfectly matched in the overlapping region.

After the alignment is completed, the image synthesizing operation is started. This can be achieved by various methods such as weighted averaging, maximum synthesis, minimum synthesis, etc. The specific choice of which method depends on the application scenario and the requirements. For example, in some cases, it may be desirable to retain specific information in the left and right images, where a weighted average approach may be used to combine the two images based on the weights of the pixels.

For example, the weighted average method is a simple and effective image synthesis method. In this approach, the value of each pixel is calculated from a weighted average of the values of the corresponding pixels in the left and right images. The choice of weights may be determined based on factors such as reliability, brightness, contrast, etc. of the pixels. For example, in the overlapping region, the weights may be dynamically adjusted according to the disparity or depth information of the two images to preserve more detail information. By this method, the two images can be smoothly combined, and artifacts and distortion in the synthesis process can be reduced.

For example, the maximum synthesis method takes the maximum value of the corresponding pixel in the left image and the right image as the value of the pixel in the synthesized image. This method is suitable for the case where it is necessary to preserve areas of high brightness or contrast in the two images. For example, in binocular images taken at night or under low light conditions, there may be some areas of one image that are brighter or sharper than the other. By maximum synthesis, an image with higher overall brightness and richer details can be synthesized.

For example, the minimum synthesis method takes the minimum value of the corresponding pixel in the left image and the right image as the value of the pixel in the synthesized image. This method is commonly used to remove noise or highlight outliers. In some cases, there may be abnormally high areas in one image due to sensor noise or lighting conditions, while these areas in another image may be relatively normal. By minimum synthesis, these abnormally-high areas can be removed, making the synthesized image smoother and more natural.

For example, the multiband fusion method is a more complex image synthesis method, which is based on a multiscale or multiband decomposition technique. Firstly, respectively carrying out multi-frequency band decomposition on a left image and a right image to obtain sub-images with different frequency bands or scales. Then, the sub-images of each frequency band or scale are synthesized, and methods such as weighted average, maximum synthesis or minimum synthesis can be adopted. The synthesized sub-images are then recombined into a final synthesized image. The method can fully utilize information of different frequency bands or scales, and improves the quality and visual effect of the synthesized image.

When a specific synthesis method is selected, factors such as application scenes, image quality, calculation efficiency and the like need to be considered. Tradeoffs and choices are needed depending on the situation. For example, in a scenario where more detailed information needs to be retained, a weighted average approach may be more appropriate; whereas in a scenario where noise or outliers are removed, the minimum synthesis may be more efficient.

During the synthesis, special care is taken to deal with the overlapping areas. The overlapping area is a common part of the left image and the right image, contains rich spatial information and parallax information, and is a key place for motion detection. Therefore, during the synthesis, it is necessary to ensure that the pixels in the overlapping region can be accurately processed, and problems such as blurring and misalignment are avoided.

And obtaining a synthesized image corresponding to each frame of binocular image through image synthesis processing. The composite image not only contains all the information of the left image and the right image, but also highlights the overlapping area between the two. The overlapping region provides powerful support for subsequent motion detection, so that the detection algorithm can more accurately identify moving objects in the scene.

For example, after the composite image corresponding to each frame of the binocular image is obtained, the overlapping area between the left image and the right image may also be identified in each frame of the composite image. Such as color mapping, transparency overlay, edge delineation, etc., may be used to identify the overlap region.

Color mapping: the overlapping areas are assigned a unique color or combination of colors for easy identification in the composite image. This method may be implemented by color replacement or blending of pixels in the overlap region.

And (3) transparency superposition: in the composite image, the overlapping region may be marked by adjusting its transparency. In particular, the pixels of the overlapping area may be superimposed with a translucent color such that this part of the area is visually distinguished from other areas. This method both retains the original information of the overlapping area and highlights its position by the change in transparency.

Edge delineation: edge delineation is used to identify boundaries of overlapping regions. Boundaries of the overlapping regions may be extracted by an edge detection algorithm and lines or contours drawn on these boundaries. The color, thickness and pattern of the lines can be adjusted as needed to clearly show the boundaries of the overlapping areas in the composite image.

In some embodiments, the performing image synthesis processing on the left image and the right image in the binocular image of each frame to obtain a synthesized image corresponding to the binocular image of each frame includes:

Performing image synthesis processing on a t-1 frame left image and a t-1 frame right image in the t-1 frame binocular image to obtain a t-1 frame synthesized image corresponding to the t-1 frame binocular image, wherein the t-1 frame synthesized image comprises a first overlapping area between the t-1 frame left image and the t-1 frame right image, and t is a positive integer;

And carrying out image synthesis processing on a t-frame left image and a t-frame right image in the t-frame binocular image to obtain a t-frame synthesized image corresponding to the t-frame binocular image, wherein the t-frame synthesized image comprises a second overlapping area between the t-frame left image and the t-frame right image.

The image synthesis processing is performed on the left image and the right image in each frame of binocular image, and is a continuous and frame-by-frame process. This process is typically applied to a video stream or a sequence of successive images to generate a composite image corresponding to each frame of binocular image.

First, a t-1 frame binocular image is processed. This frame includes a t-1 frame left image and a t-1 frame right image. The two images capture different portions of the same scene through different camera perspectives, respectively. The two images may have undergone an alignment operation prior to the image synthesis process to ensure their spatial consistency. After alignment is completed, image synthesis processing is carried out on the t-1 frame left image and the t-1 frame right image. This process may involve various synthesis methods, such as weighted averaging, maximum synthesis, or minimum synthesis, depending on the application scenario and requirements. The goal of the synthesis is to generate a t-1 frame synthesized image that blends the two image information. During the composition, the left and right images will have a part of the overlap region, such as an overlap region with 94 °, due to the parallax between them. The overlapping area is specially processed in the composite image to preserve information from both images. For example, the first overlap region may be marked using transparency overlay, color mapping, or edge delineation, etc., to be clearly visible in the composite image.

Then, the t-th frame binocular image is processed. Like the t-1 th frame, the t-th frame also contains a left image and a right image. The two images are also aligned to ensure their spatial correspondence. Then, image synthesis processing is performed on the t-th frame left image and the t-th frame right image, and a t-th frame synthesized image is generated.

In the t-th frame composite image, there is also a second overlap region, which is caused by the parallax of the t-th frame left image and the t-th frame right image. Similar to the first overlapping region, the second overlapping region also needs to be specially processed in the composite image in order to preserve the information from both images and highlight its position.

By doing this synthesis processing for each frame of binocular image, a series of consecutive synthesized images can be generated that not only preserve the information of the original left and right images, but also provide more rich scene information by highlighting the overlapping areas. The processing method has wide application value in the fields of stereoscopic vision, three-dimensional reconstruction, target tracking and the like.

And 130, carrying out local motion detection according to the overlapped area to obtain a motion detection result.

In some embodiments, the performing local motion detection according to the overlapping area, and obtaining a motion detection result includes: calculating an optical flow field according to the first overlapping area in the t-1 frame composite image and the second overlapping area in the t frame composite image; and obtaining a motion detection result according to the optical flow field.

In a virtual reality application scenario, especially when a monocolor binocular camera is used, local motion detection is a key step for ensuring smoothness and authenticity of user experience. Because of the specific technical characteristics of the monochrome binocular camera, such as lower resolution, smaller data volume, shorter exposure time, and higher consistency of pixel movement, optimizing these characteristics is particularly important. Furthermore, considering that the head movement of the user is frequent and rapid during the use of the head mounted display device, the local motion detection algorithm needs to be able to respond to these changes quickly and accurately.

In the embodiment of the application, the optical flow method is adopted for local motion detection, so that the method is an efficient and practical choice. The optical flow method can estimate the motion mode of a pixel, namely the optical flow field, by analyzing the change of the pixel in an image sequence in a time domain and the correlation between each pixel in adjacent frame images. The technology is not only suitable for the data characteristics of the monochromatic binocular camera, but also can effectively cope with the rapid change of the head movement of the user. The optical flow field can reflect motion information of objects in the scene, thereby facilitating extraction of moving objects.

Based on the characteristic that the See-through scene has high requirement on real-time performance, the scheme can use an optical flow method with small calculation amount and high calculation speed.

In calculating the optical flow field, an optical flow method based on brightness change, such as a sparse optical flow (Lucas-Kanade) algorithm or a dense optical flow (Horn-Schunck) algorithm, can be adopted. These algorithms estimate the motion of the pixel by minimizing the luminance error based on the luminance conservation assumption.

For the Lucas-Kanade algorithm, it assumes that all pixels within a small window have the same motion and solves the optical flow field by an iterative method; that is, by minimizing the sum of squares of luminance variations within a local window, the algorithm can estimate the motion vector for the center pixel of each window; this algorithm is suitable for situations where motion is small in the scene and is computationally efficient.

For the Horn-Schunck algorithm, the optical flow field is considered to be smoothly varying across the image based on a global smoothness assumption. By introducing a global smoothness constraint, the algorithm can solve for the optical flow field for the entire image. This algorithm is more efficient for handling complex scenes and large movements.

In addition, correlation-based optical flow methods are also an effective option, such as Normalized Cross-Correlation (NCC) optical flow algorithms and sum of squares error (Sum of Squared Differences, SSD) optical flow algorithms. These algorithms estimate motion by computing the correlation of pixels between adjacent frames, often with good results for areas that are rich in texture or have significant features.

For the NCC optical flow algorithm, optical flow is calculated based on the correlation between pixels in the image sequence. The core idea is to compare the normalized cross-correlation degree between the corresponding areas in the current frame and the previous frame, so as to estimate the motion of the pixel point. In the NCC optical flow algorithm, a pixel is first selected as the center, and a neighborhood window is defined. Then, the previous frame image is searched for the region most similar to the neighborhood window in the current frame. The similarity is measured by calculating a normalized cross-correlation coefficient between the two windows. The calculation of the normalized cross-correlation coefficient involves the normalization of the square of the product of the pixel values within the two windows. When the most similar region is found, the motion vector of the pixel point can be estimated by comparing the positions of the center pixels of the two regions.

For SSD) optical flow algorithms, optical flow is calculated based on changes in pixel luminance values. The basic idea is to estimate the motion of a pixel by comparing the square difference of the luminance values of the corresponding pixels in the current frame and the previous frame. In the SSD optical flow algorithm, a pixel is also selected as a center, and a neighborhood window is defined. Then, the sum of squares of the luminance values of the pixels in the window in the current frame and the pixels in the corresponding position window in the previous frame is calculated. By minimizing the sum of the squared differences, the pixel position in the previous frame that is the closest match to the current frame can be found, thereby estimating the motion vector of the pixel point.

After the optical flow field is calculated, a motion detection result can be obtained according to the information of the optical flow field. The motion detection result may include a position, a speed, a direction, etc. of the moving object. By analyzing the optical flow field, the pixels which are moving and the pixels which are static can be judged, so that the moving target in the overlapping area can be accurately extracted.

In some embodiments, the calculating an optical flow field from the first overlapping region in the t-1 st frame composite image and the second overlapping region in the t-1 st frame composite image includes: performing feature point detection on the first overlapping region in the t-1 th frame composite image and the second overlapping region in the t-1 th frame composite image to obtain first feature point data corresponding to the t-1 th frame composite image and second feature point data corresponding to the t-1 th frame composite image; performing feature point matching on the first feature point data and the second feature point data to obtain matched feature point data; and obtaining the optical flow field according to the matching characteristic point data.

First, feature point detection is performed on a first overlapping region in a t-1 th frame composite image and a second overlapping region in the t-1 th frame composite image. The purpose of this step is to extract key feature points in two adjacent frame images, and provide a basis for subsequent feature point matching. The method for detecting the characteristic points is various, wherein Harris (Harris) corner detection is a common method, and the method screens corner points by calculating the corner response function value of each pixel point in an image and setting a threshold value, so that key points with remarkable characteristics are obtained. Corner points generally refer to boundary intersection points of images or large gray value changes in the images, and the points represent characteristic information of the images and simultaneously blurring irrelevant information. Corner points are the end points of isolated points and line segments with the maximum or minimum intensity on certain attributes, which can simplify the image feature extraction flow and lead the result to be accurate. In addition, FAST (Features from ACCELERATED SEGMENT TEST) is a high-speed corner detection algorithm that rapidly detects feature points by comparing gray differences between pixel points and their surrounding pixel points.

After the first feature point data corresponding to the t-1 th frame composite image and the second feature point data corresponding to the t frame composite image are obtained, the feature points need to be matched. Feature point matching is a key step that establishes correspondence between feature points in adjacent frames. In this step, a K-Nearest Neighbor (KNN) matching method or a brute force matching method may be used. KNN matching is a matching method based on distance measurement, which finds the most similar matching point by calculating the distance between feature point descriptors. Violent matching is a simple and intuitive matching method that traverses all possible matching point pairs and calculates the similarity between them to find the best matching point.

Then, the optical flow field can be calculated according to the matched characteristic point data. The optical flow field is a two-dimensional vector field describing the motion pattern of the pixels in the image, and reflects the motion direction and speed of the pixels between adjacent frames. The motion vector of each characteristic point can be estimated through the matched characteristic point data, and then the optical flow field of the whole overlapping area can be obtained through interpolation or fitting and other methods. The optical flow field provides important information for subsequent tasks such as motion detection, target tracking and the like.

Taking a specific application scenario as an example, it is assumed that a set of low resolution 6Dof (six degrees of freedom) image data is being processed. These image data are processed by the frame-by-frame input algorithm SDK. In each frame, feature point detection is performed on the current frame and the previous frame images, and methods such as Harris corner detection or FAST feature point detection are used. Then, the detected feature points are matched, and strategies such as KNN matching or violent matching can be adopted. Then, according to the matched characteristic point data, an optical flow field can be calculated, so that the motion information of the pixel points in the image is obtained. Such information is critical to tasks such as understanding image content, detecting moving objects, and achieving a higher level of scene understanding.

In general, by performing feature point detection, matching and optical flow field calculation on the overlapping areas in adjacent frames, motion information in the images can be more accurately understood, and powerful technical support is provided for applications such as virtual reality and augmented reality.

In some embodiments, the obtaining a motion detection result according to the optical flow field includes: analyzing the optical flow field, determining moving objects in the first overlap region and the second overlap region; determining a motion area according to the motion information of the moving object; and acquiring a first motion detection result, wherein the first motion detection result comprises first prompt information and coordinate information of the motion area, and the first prompt information is used for prompting the existence of local motion.

And performing deep analysis on the optical flow field to determine a moving object in the first overlapping region and the second overlapping region. This step typically involves a determination of the vector direction and magnitude in the optical flow field. The optical flow vectors describe the direction and speed of movement of the pixel points between successive frames, and therefore, by comparing the changes in the optical flow vectors, the moving object can be detected. Specific analysis methods may include calculating gradients or divergences of the optical flow fields that provide a better understanding of the characteristics of the optical flow fields to accurately identify moving targets.

Then, a moving region is determined according to the moving information of the moving object. The motion information generally includes a position, a velocity, a motion trajectory, and the like of a moving object. By extracting these pieces of key information, the boundary of the moving object can be further determined, thereby dividing the moving area. In this step, it may be necessary to use some image processing techniques, such as contour detection, thresholding, etc., which are able to more accurately mark the motion region.

Then, a first motion detection result is acquired. The result includes the first hint information and coordinate information of the movement area. The first alert message is used to indicate the presence of local motion and may be a simple text message such as "local motion detected" or a visual alert such as highlighting the motion field. The coordinate information of the motion region describes in detail the position of the motion region in the image, typically including the upper left and lower right corner coordinates of the motion region. This information is critical for subsequent processing and analysis.

In some embodiments, the obtaining a motion detection result according to the optical flow field further includes:

And when the fact that the moving object does not exist in the first overlapping area and the second overlapping area is determined, a second motion detection result is obtained, wherein the second motion detection result comprises second prompt information, and the second prompt information is used for prompting that the local motion does not exist.

Further, if it is determined that there is no moving object in the first overlapping region and the second overlapping region, a second motion detection result is acquired. The result includes a second cue for the absence of local motion. The prompt message in this case may also be a text or visual prompt so that the user or system can quickly understand the current state of motion.

The motion detection result of the local motion has a very large number of purposes, for example, if a fast moving object of the current scene is detected, snapshot logic is started, the exposure time is reduced, and the interframe fusion strength is reduced. For example, if no moving object is detected in the picture, the exposure is dynamically lengthened in time to improve the image quality. For example, the motion detection result can also be used for a dynamic switching multi-frame synthesis function, a dynamic selection fusion algorithm and the like.

In some embodiments, the method is applied to a head mounted display device having a color camera and a display screen configured thereon; the method further comprises the steps of: acquiring a color image sequence acquired by the color camera; constructing a three-dimensional grid based on a composite image corresponding to each frame of the binocular image in the binocular image sequence; performing image fusion processing on the color image sequence and the three-dimensional grid to obtain a target image; and displaying the target image on the display screen.

The method may be applied, among other things, to head-mounted display devices that typically integrate a color camera and a display screen to support applications such as augmented reality (XR), virtual Reality (VR), augmented Reality (AR), or Mixed Reality (MR). The binocular vision principle can be fully utilized, and high-quality image processing and virtual-real fusion effects can be realized.

Wherein the head mounted display device captures a sequence of color images of the external environment through its color camera. The image sequences contain rich color and texture information and provide basic data for subsequent three-dimensional reconstruction and image fusion.

Wherein, based on the composite image corresponding to each frame of binocular image in the binocular image sequence, the system starts to construct a three-dimensional grid. In the construction process, the composite image plays a key role. The composite images are fused by binocular vision algorithm to generate images containing depth information. Depth information is critical to the construction of a three-dimensional grid, which can help the system accurately restore the three-dimensional shape and position of objects in the scene.

As shown in fig. 2, in the process of constructing the three-dimensional mesh, in order to further improve the quality and accuracy of the mesh, screening and triangulating processes may be performed on the composite image. The filtering operation may remove noise and extraneous information, preserving the portions useful for constructing the grid. The triangularization process converts the screened image data into a triangular grid form, and forms a triangular patch by connecting adjacent points, so that a complete three-dimensional grid is constructed. In the process, a triangulation algorithm in computer graphics can be used to ensure that the generated grid has good geometric properties and topological structure.

After the three-dimensional grid is constructed, the color image sequence and the three-dimensional grid are subjected to image fusion processing. The goal of this step is to map the real world color and texture information onto a three-dimensional grid, generating a target image with realism. In this process, a series of operations may be performed to optimize the fusion effect. For example, time domain filtering, re-projection, frame picking, rendering, distortion correction, etc. may be performed to obtain the target image.

For example, a time domain filtering process is performed. The time domain filtering can smooth the image sequence in the time domain, eliminating high frequency noise and glitches caused by camera shake or environmental changes. Through time domain filtering, the stability and the continuity of the image sequence can be improved, and a foundation is laid for subsequent image fusion.

For example, a reprojection operation is performed. Reprojection is the process of mapping points on a three-dimensional grid onto a two-dimensional image plane according to a specific viewing angle and projection rule. Through reprojection, the three-dimensional grid and the color image can be ensured to be consistent in geometry, and an accurate corresponding relation is provided for subsequent color mapping and texture mapping.

In addition, the operations of frame picking, rendering, distortion correction and the like are also important links in the image fusion process. The frame picking is to select key frames from a continuous image sequence for fusion so as to reduce the calculation amount and improve the processing speed. Rendering is to render the processed image data onto a display screen for presentation to a user. And the distortion correction is to correct lens distortion possibly existing in the head-mounted display equipment, so that the displayed image is ensured to accord with the visual habit of human eyes.

After the series of processing, a target image in which real world information and virtual elements are fused is obtained. The images are displayed to the user through the display screen of the head-mounted display device, and an immersive virtual-real combination experience is provided for the user.

In summary, the method realizes the high-quality virtual-real fusion effect by comprehensively utilizing binocular vision, three-dimensional grid construction and image fusion processing technology. The application of these techniques not only promotes interactivity and immersion of the head mounted display device, but opens up new possibilities for future AR and MR applications.

As shown in fig. 2, a Mesh (Mesh) triangle network may be constructed based on a three-dimensional Mesh, and then the Mesh triangle network and the color image sequence may be subjected to image fusion processing to obtain a target image. This process can provide finer geometries and smoother surface details, helping to generate higher quality virtual-real fusion images.

First, triangulation is performed using a specific algorithm based on point cloud data in a three-dimensional grid. Triangulation is a process of dividing point cloud data into a series of triangles, which are connected to each other to form a Mesh triangle network together. In the process, the algorithm considers the factors such as the spatial distribution, the density, the surface characteristics and the like of the point cloud data so as to ensure that the generated Mesh triangular network can accurately represent the shape and the structure of the three-dimensional object.

Secondly, in order to improve the quality and performance of the Mesh triangle network, some optimization treatments can be performed. For example, redundant triangles may be removed, adjacent similar triangles may be merged, the size and shape of the triangles may be adjusted, and so on. The optimization operations can reduce the complexity of the Mesh triangle network and improve the processing efficiency and rendering performance of the Mesh triangle network.

And then, carrying out image fusion processing on the constructed Mesh triangular network and the color image sequence. In this process, the geometry of the Mesh triangle network and the pixel information of the color image sequence need to be considered. The color, texture and other information in the color image sequence can be accurately mapped to the corresponding position of the Mesh triangle network by applying texture mapping algorithm, color interpolation technology, illumination model and coloring technology, image fusion and optimization and other algorithms and technologies. The Mesh triangle network not only has accurate geometric shape, but also can present real color and texture effects.

Regarding texture mapping algorithms: texture mapping is the process of applying a two-dimensional image (texture) to a three-dimensional object surface. When mapping a color image sequence onto a Mesh triangle network, methods such as UV mapping, parameterized mapping or image-based texture mapping can be adopted. The algorithms correspond pixels in the color image to surface elements of the Mesh triangular network according to vertex coordinates and texture coordinates of the Mesh triangular network, so that accurate mapping of textures is realized.

Regarding color interpolation techniques: since the Mesh triangle network is composed of a series of triangles, each triangle may correspond to a plurality of pixels. Therefore, it is necessary to ensure smooth transitions of color and texture inside each triangle by color interpolation techniques. Common interpolation methods include linear interpolation, bilinear interpolation, or higher order interpolation methods. According to the method, the color or texture value of any point in the triangle is calculated according to the color value or texture coordinate of the vertex of the triangle, so that smooth color and texture transition is realized.

Regarding illumination models and coloring techniques: in order to enhance the realism of the Mesh triangle network surface, illumination models and coloring techniques can be introduced. The illumination model describes the interaction between the light source and the object surface, affecting the shading and color change of the object surface. By calculating the irradiation effect of the light source on the Mesh triangular network surface, realistic shadow, highlight, reflection and other effects can be generated. The coloring technology is used for rendering the Mesh triangle network according to the color value calculated by the illumination model, so that the Mesh triangle network presents more real color and texture effects.

Regarding image fusion and optimization: after mapping the color image sequence onto the Mesh triangle network, image fusion and optimization operations may also be required to further improve the fusion effect. This includes eliminating seams between images, adjusting color balance and contrast, optimizing texture resolution, etc. These operations may be implemented by image processing algorithms and post-processing techniques to ensure that the final target image is best viewed.

In order to achieve high quality image fusion, some advanced image processing techniques may be employed. For example, the time domain filtering can be utilized to eliminate noise and interference in the image, so that the definition and stability of the image are improved; the geometrical consistency of the Mesh triangle network and the color image can be ensured by utilizing a reprojection technology; the fusion effect can be further optimized by utilizing the operations of frame picking, rendering, distortion correction and the like. And obtaining the target image fused with the Mesh triangular network and the color image sequence through image fusion processing. The target image not only keeps the geometric shape and structure information of the Mesh triangle network, but also integrates the color and texture information of the color image sequence, and a more real and vivid visual effect is presented.

By constructing the Mesh triangular network and then performing image fusion processing, the geometric information of the three-dimensional grid and the visual information of the color image sequence can be fully utilized, and a high-quality virtual-real fusion effect can be realized. This not only can promote the interactivity and the immersive sensation of head-mounted display equipment, but also can bring richer, true visual experience for the user.

For example, as shown in fig. 2, in a head-mounted display device or other visual processing application, in addition to the Mesh triangle network and the color image sequence, image fusion processing may be performed in combination with other data, so as to obtain a more accurate and rich target image. These other data may include Time of Flight (ToF) data, environmental texture (Environmental Texture, ET) data, inertial measurement unit (Inertial Measurement Unit, IMU) data, and the like, each of which has unique meaning and use.

The ToF data, among other things, is the distance between an object and a sensor determined by measuring the time it takes for light to travel from emission to reception. It uses light sources such as infrared light or laser to emit pulses and measures the time required for these pulses to return, thereby obtaining depth information of the object. The ToF data can provide high-precision depth images, which are important for constructing Mesh triangle networks, performing three-dimensional reconstruction, and achieving more accurate image fusion. By combining the ToF data with the color image sequence, a more real and three-dimensional target image can be generated, and the visual experience of a user is improved.

Where ET data generally refers to texture information of the surface of objects in a scene, which information may be captured by a camera or other sensor. The environmental texture data can reflect the details and texture of the object surface, and provides rich texture information for image fusion processing. The ET data can enhance the surface detail of the Mesh triangular network, so that the generated target image is more real and finer in texture. By combining ET data with color image sequences and depth information, higher-quality image fusion can be realized, and the visual effect and sense of reality of images are improved.

The IMU data contains measurement information of sensors such as an accelerometer and a gyroscope, and can detect the motion state of equipment in real time, such as acceleration, angular velocity and the like. The data has important significance in the aspects of motion tracking, attitude estimation, stability control and the like. The IMU data can provide real-time information about the motion of the head mounted display device or camera, helping to correct image distortion and blurring due to motion. In the image fusion process, through combining IMU data with a color image sequence, depth information and texture data, more accurate alignment and fusion can be realized, the influence of motion on image quality is reduced, and the overall visual effect is improved.

In the image fusion processing process, toF data, ET data and IMU data can be fused with a Mesh triangle network and a color image sequence. By combining these data from different sources, more rich and accurate three-dimensional information, texture details, and motion states can be obtained. The multi-source data fusion method can improve the performance of the head-mounted display device and provide more realistic and natural visual experience for users.

In some embodiments, the method further comprises:

shortening the exposure time of the color camera based on the first motion detection result; or alternatively

And extending the exposure time of the color camera based on the second motion detection result.

In a head-mounted display device or other similar applications, when Mesh computation and color image sequence processing are combined, dynamic adjustment of the exposure time of a color camera is an important optimization measure. The method can flexibly adjust the exposure parameters according to the motion condition in the scene, thereby optimizing the image quality and reducing the adverse effect generated by the motion.

Specifically, when the system detects local motion (i.e., the first motion detection result) by the motion detection module, it shortens the exposure time of the color camera. The reason for this is that shorter exposure times can reduce the smear caused by the object movement. Smear generally occurs when the exposure time is too long and the object is displaced during exposure, causing the contours of the object in the image to become blurred or multiple overlapping images. Shortening the exposure time ensures that every instant is captured clearly, thereby reducing or eliminating smear and making the moving object more clear.

In contrast, when the motion detection algorithm determines that there is no local motion in the scene (i.e., the second motion detection result), the system may extend the exposure time of the color camera. Increasing the exposure time helps to increase the signal-to-noise ratio, i.e. the intensity of the image signal relative to the intensity of the noise. By doing so, more light information can be acquired in a relatively static scene, so that the image is brighter, the details are richer, and the influence of random noise is reduced. Extending exposure times is particularly useful in scenes where light conditions are poor or higher image quality is desired.

By combining a three-dimensional grid with dynamic exposure adjustment techniques, a head mounted display device or other related application can provide a more accurate, clearer virtual-real fusion image. The technology can bring smoother and real visual experience to users no matter the trawling is reduced in a moving scene or the image quality is improved in a static scene.

For example, the exposure time of a target binocular camera (such as a monochrome binocular camera) may also be shortened based on the first motion detection result; or based on the second motion detection result, the exposure time of the target binocular camera is prolonged. Dynamic adjustment of the exposure time of the target binocular camera is also an important optimization measure. The binocular camera simulates a visual system of human eyes through two cameras side by side, so that depth information and three-dimensional space sense are obtained. The exposure parameters of the scene are flexibly adjusted according to the movement condition in the scene, so that the image quality can be further optimized and the adverse effect caused by movement can be reduced.

In some embodiments, the head mounted display device is further configured with a facial recognition module thereon; the method further comprises the steps of:

Increasing the sensitivity of the face recognition module based on the first motion detection result; or alternatively

And reducing the sensitivity of the face recognition module based on the second motion detection result.

For example, head mounted display devices also integrate a facial recognition module. The face recognition module is mainly used for recognizing and analyzing the facial features of the user, so that more accurate face tracking and focusing are realized, and the user experience is improved. The motion detection result can be used for optimizing image acquisition and processing, and can also be used for dynamically adjusting the sensitivity of the face recognition module, so that the accuracy and stability of face recognition are further improved.

When the local motion detection module of the head-mounted display device outputs a first motion detection result, that is, detects that there is a local motion, the system may correspondingly increase the sensitivity of the face recognition module. This is because, in a moving state, the face of the user may be displaced, rotated, or deformed, causing the facial features to become blurred or difficult to recognize. By increasing the sensitivity of the facial recognition module, the system is able to respond to these changes more quickly, capturing more facial details, and thus increasing the rate of facial recognition in motion.

In contrast, when the local motion detection module outputs a second motion detection result, i.e., it is determined that there is no local motion, the system may decrease the sensitivity of the face recognition module. The purpose of this is to reduce misrecognitions and to reduce system power consumption. In static or relatively stable scenes, the facial features of the user are often sharp and stable, where excessive sensitivity may lead to misrecognitions or overstocking, wasting computing resources, and possibly degrading system performance. The reduced sensitivity allows the facial recognition module to operate more efficiently and stably while maintaining a degree of accuracy.

The strategy for dynamically adjusting the sensitivity of the face recognition module combines the advantages of the motion detection result and the face recognition technology, and can optimize the performance of face recognition according to the requirements of actual scenes. The face recognition method not only can improve the face recognition rate in the motion state and reduce the situations of false recognition and missing recognition, but also can reduce the power consumption and the calculation burden of the system in the static scene and realize more efficient resource utilization.

In some embodiments, the head mounted display device is further configured with a hand detection module thereon; the method further comprises the steps of:

starting the hand detection module based on the first motion detection result; or alternatively

And closing the hand detection module based on the second motion detection result.

For example, the head-mounted display device is also configured with a hand detection module. The hand detection module is mainly used for tracking and analyzing hand actions of a user in real time and providing more natural and visual interaction experience for the user. However, hand detection algorithms are often complex and occupy significant computing resources. If the hand detection module is started for a long time, the waste of computing resources may be caused, and even the problems of equipment heating and the like may be caused. Therefore, the hand detection module is dynamically turned on or off according to the motion detection result, and the hand detection module becomes an efficient and energy-saving solution.

For example, when the local motion detection module of the head-mounted display device outputs a first motion detection result, that is, detects that there is a local motion, the system may turn on the hand detection module. This is because the local motion often means that the user is performing a certain hand motion or interaction operation, and at this time, turning on the hand detection module can capture the hand motion of the user in real time, so as to provide accurate interaction feedback for the user. For example, in a virtual reality game, a user may control a game character or operate a game prop through hand movements, and the turning on of the hand detection module can ensure that the movements are accurately recognized and responded to.

For example, when the local motion detection module outputs a second motion detection result, i.e., it is determined that there is no local motion, the system may turn off the hand detection module. Because there is little or no significant change in the user's hand motion in static or relatively stable situations, continuing to turn on the hand detection module at this time not only does not provide additional interactive value, but also wastes computing resources and may cause the device to heat. By closing the hand detection module, the system can release computing resources, reduce power consumption, and maintain stable operation of the device.

The strategy for dynamically loading the hand detection module based on the motion detection result can optimize the distribution of computing resources according to the requirements of actual scenes. The hand detection module can be quickly started when needed, and accurate interaction feedback is provided; and the hand detection module is closed in time when not needed, so that the waste of resources and the overheating of equipment are avoided. The dynamic management mode not only improves the user experience, but also prolongs the service life of the equipment.

The embodiment of the application creatively introduces a local motion detection algorithm in the processing flow of the virtual reality related algorithm. The step plays a vital role in a virtual reality related algorithm, and can accurately identify and track dynamic changes in a scene and provide key information for subsequent interaction, rendering and positioning. By introducing a local motion detection algorithm, the embodiment of the application obviously improves the response speed and accuracy of the virtual reality system and provides a smoother and natural experience for users.

The embodiment of the application can utilize the overlapping area after the monocolor binocular image is combined to carry out local motion detection, and does not directly process the RGB image. The innovative strategy fully utilizes the characteristics of the monochromatic binocular image, and effectively improves the efficiency and accuracy of local motion detection. Meanwhile, the embodiment of the application further improves the processing speed of the whole system through parallel processing with RGB image processing, and realizes high-efficiency and real-time local motion detection. This parallel processing strategy enables the system to handle multiple tasks simultaneously, thereby improving overall performance.

The embodiment of the application successfully fills the defect of the See-through scene in the aspect of local motion detection by introducing a local motion detection algorithm. The realization of the technical effect enables the virtual reality system to more accurately identify and track the dynamic change in the scene, and provides more natural and real interactive experience for users.

The embodiment of the application remarkably reduces the bandwidth requirement of data transmission by using a monocolor binocular image with smaller data volume to carry out local motion detection. The data size of a monochrome binocular image is typically 0.3MB compared to an RGB image, which is much smaller than an RGB image whose data size is typically 12 MB. This means that embodiments of the present application can save about 97% of bandwidth resources when performing local motion detection. The realization of the technical effect not only improves the processing efficiency of the system, but also is beneficial to reducing the hardware cost and the maintenance cost.

The local motion detection and ISP processing are completed in parallel, and the processing speed of the system is remarkably improved. Embodiments of the present application can reduce processing time by about 4ms compared to serial processing schemes common in the industry. The realization of the technical effect enables the system to respond to the operation and the instruction of the user more quickly, and provides smoother and natural virtual reality experience.

All the above technical solutions may be combined to form an optional embodiment of the present application, and will not be described in detail herein.

In order to facilitate better implementation of the image processing method according to the embodiment of the present application, the embodiment of the present application further provides an image processing apparatus. Referring to fig. 7, fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the application. The image processing apparatus 200 may include:

A first obtaining unit 210, configured to obtain a binocular image sequence, where each frame of binocular image in the binocular image sequence includes a left image and a right image;

a first processing unit 220, configured to perform image synthesis processing on the left image and the right image in each frame of the binocular image, so as to obtain a synthesized image corresponding to each frame of the binocular image, where each frame of the synthesized image includes an overlapping area between the left image and the right image;

and a detecting unit 230, configured to perform local motion detection according to the overlapping area, and obtain a motion detection result.

In some embodiments, the first obtaining unit 210 may be configured to: and acquiring a binocular image sequence acquired by the target binocular camera, wherein the binocular image sequence comprises a t-1 frame binocular image and a t frame binocular image.

In some embodiments, the first processing unit 220 may be configured to: performing image synthesis processing on a t-1 frame left image and a t-1 frame right image in the t-1 frame binocular image to obtain a t-1 frame synthesized image corresponding to the t-1 frame binocular image, wherein the t-1 frame synthesized image comprises a first overlapping area between the t-1 frame left image and the t-1 frame right image, and t is a positive integer; and carrying out image synthesis processing on a t-frame left image and a t-frame right image in the t-frame binocular image to obtain a t-frame synthesized image corresponding to the t-frame binocular image, wherein the t-frame synthesized image comprises a second overlapping area between the t-frame left image and the t-frame right image.

In some embodiments, the detecting unit 230 may be configured to calculate an optical flow field according to the first overlapping region in the t-1 st frame composite image and the second overlapping region in the t-1 st frame composite image; and obtaining a motion detection result according to the optical flow field.

In some embodiments, the detecting unit 230 may be configured to, when calculating the optical flow field according to the first overlapping region in the t-1 st frame composite image and the second overlapping region in the t-1 st frame composite image: performing feature point detection on the first overlapping region in the t-1 th frame composite image and the second overlapping region in the t-1 th frame composite image to obtain first feature point data corresponding to the t-1 th frame composite image and second feature point data corresponding to the t-1 th frame composite image; performing feature point matching on the first feature point data and the second feature point data to obtain matched feature point data; and obtaining the optical flow field according to the matching characteristic point data.

In some embodiments, the detecting unit 230 may be configured to, when acquiring a motion detection result according to the optical flow field: analyzing the optical flow field, determining moving objects in the first overlap region and the second overlap region; determining a motion area according to the motion information of the moving object; and acquiring a first motion detection result, wherein the first motion detection result comprises first prompt information and coordinate information of the motion area, and the first prompt information is used for prompting the existence of local motion.

In some embodiments, the detecting unit 230 may be further configured to, when acquiring a motion detection result according to the optical flow field: and when the fact that the moving object does not exist in the first overlapping area and the second overlapping area is determined, a second motion detection result is obtained, wherein the second motion detection result comprises second prompt information, and the second prompt information is used for prompting that the local motion does not exist.

In some embodiments, the image processing apparatus 200 may be applied to a head-mounted display device on which a color camera and a display screen are configured; the image processing apparatus 200 further includes:

the second acquisition unit is used for acquiring the color image sequence acquired by the color camera;

the construction unit is used for constructing a three-dimensional grid based on the synthesized image corresponding to each frame of the binocular image in the binocular image sequence;

The fusion unit is used for carrying out image fusion processing on the color image sequence and the three-dimensional grid to obtain a target image;

And the display unit is used for displaying the target image on the display screen.

In some embodiments, the image processing apparatus 200 further comprises a second processing unit configured to:

In some embodiments, the head mounted display device is further configured with a facial recognition module thereon; the image processing apparatus 200 further includes a third processing unit configured to:

In some embodiments, the head mounted display device is further configured with a hand detection module thereon; the image processing apparatus 200 further includes a fourth processing unit configured to:

The respective units in the above-described image processing apparatus 200 may be implemented in whole or in part by software, hardware, and a combination thereof. The above units may be embedded in hardware or may be independent of a processor in the terminal device, or may be stored in software in a memory in the terminal device, so that the processor invokes and executes operations corresponding to the above units.

The image processing apparatus 200 may be integrated in a terminal or a server having a memory and a processor mounted therein and having an arithmetic capability, or the image processing apparatus 200 may be the terminal or the server.

In some embodiments, the present application further provides a terminal device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps in the above method embodiments when executing the computer program.

As shown in fig. 8, fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application, and the terminal device 300 may be generally provided in the form of glasses, a head mounted display (Head Mount Display, HMD), or a contact lens for realizing visual perception and other forms of perception, but the form of realizing the terminal device is not limited thereto, and may be further miniaturized or enlarged as required. The terminal device 300 may include, but is not limited to, the following:

The detection module 301: various sensors are used to detect user operation commands and act on the virtual environment, such as to update the images displayed on the display screen along with the user's line of sight, to achieve user interaction with the virtual and scene, such as to update real content based on the detected direction of rotation of the user's head.

Feedback module 302: receiving data from the sensor, providing real-time feedback to the user; wherein the feedback module 302 may be for displaying a graphical user interface, such as displaying a virtual environment on the graphical user interface. For example, the feedback module 302 may include a display screen or the like.

Sensor 303: on one hand, an operation command from a user is accepted and acted on the virtual environment; on the other hand, the result generated after the operation is provided to the user in the form of various feedback.

Control module 304: the sensors and various input/output devices are controlled, including obtaining user data (e.g., motion, speech) and outputting sensory data, such as images, vibrations, temperature, sounds, etc., to affect the user, virtual environment, and the real world.

Modeling module 305: constructing a three-dimensional model of a virtual environment may also include various feedback mechanisms such as sound, touch, etc. in the three-dimensional model.

In an embodiment of the present application, a three-dimensional model of the constructed virtual environment may be constructed by the modeling module 305; displaying, by the feedback module 302, the virtual environment generated by the head mounted display device; acquiring a binocular image sequence through the detection module 301 and the sensor 303, wherein each frame of binocular image in the binocular image sequence comprises a left image and a right image; and performing image synthesis processing on the left image and the right image in each frame of binocular image through the control module 304 to obtain a synthesized image corresponding to each frame of binocular image, wherein each frame of synthesized image comprises an overlapping area between the left image and the right image, and performing local motion detection according to the overlapping area to obtain a motion detection result.

In some embodiments, as shown in fig. 9, fig. 9 is another schematic structural diagram of a terminal device according to an embodiment of the present application, where the terminal device 300 further includes a processor 310 with one or more processing cores, a memory 320 with one or more computer readable storage media, and a computer program stored in the memory 320 and capable of running on the processor. The processor 310 is electrically connected to the memory 320. It will be appreciated by those skilled in the art that the terminal device structure shown in the figures does not constitute a limitation of the terminal device, and may include more or less components than those illustrated, or may combine certain components, or may have a different arrangement of components.

The processor 310 is a control center of the terminal device 300, connects respective parts of the entire terminal device 300 using various interfaces and lines, and performs various functions of the terminal device 300 and processes data by running or loading software programs and/or modules stored in the memory 320 and calling data stored in the memory 320, thereby performing overall monitoring of the terminal device 300.

In the embodiment of the present application, the processor 310 in the terminal device 300 loads the instructions corresponding to the processes of one or more application programs into the memory 320 according to the following steps, and the processor 310 executes the application programs stored in the memory 320, so as to implement various functions:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

In some embodiments, the processor 310 may include a detection module 301, a control module 304, and a modeling module 305.

In some embodiments, as shown in fig. 9, the terminal device 300 further includes: radio frequency circuitry 306, audio circuitry 307, and power supply 308. The processor 310 is electrically connected to the memory 320, the feedback module 302, the sensor 303, the rf circuit 306, the audio circuit 307, and the power supply 308, respectively. It will be appreciated by those skilled in the art that the terminal device structure shown in fig. 8 or 9 does not constitute a limitation of the terminal device, and may include more or less components than illustrated, or may combine certain components, or may be arranged in different components.

The radio frequency circuitry 306 may be configured to receive and transmit radio frequency signals to and from a network device or other terminal device via wireless communication to and from the network device or other terminal device.

The audio circuit 307 may be used to provide an audio interface between the user and the terminal device via a speaker, microphone. The audio circuit 307 may transmit the received electrical signal after audio data conversion to a speaker, and convert the electrical signal into a sound signal for output by the speaker; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 307 and converted into audio data, which are processed by the audio data output processor 310 for transmission to, for example, another terminal device via the radio frequency circuit 306, or which are output to a memory for further processing. The audio circuit 307 may also include an ear bud jack to provide communication of the peripheral ear bud with the terminal device.

The power supply 308 is used to power the various components of the terminal device 300.

Although not shown in fig. 8 or 9, the terminal device 300 may further include a camera, a wireless fidelity module, a bluetooth module, an input module, and the like, which are not described herein.

In some embodiments, the present application also provides a computer-readable storage medium storing a computer program. The computer readable storage medium may be applied to a terminal device or a server, and the computer program causes the terminal device or the server to execute a corresponding flow in the image processing method in the embodiment of the present application, which is not described herein for brevity.

In some embodiments, the present application also provides a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the terminal device reads the computer program from the computer readable storage medium, and the processor executes the computer program, so that the terminal device executes a corresponding flow in the image processing method in the embodiment of the present application, which is not described herein for brevity.

The present application also provides a computer program comprising a computer program stored in a computer readable storage medium. The processor of the terminal device reads the computer program from the computer readable storage medium, and the processor executes the computer program, so that the terminal device executes a corresponding flow in the image processing method in the embodiment of the present application, which is not described herein for brevity.

It should be appreciated that the processor of an embodiment of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The Processor may be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It will be appreciated that the memory in embodiments of the application may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDR SDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and Direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiment of the present application may be integrated in one first processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a terminal device (which may be a personal computer, a server) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image processing method, the method comprising:

acquiring a binocular image sequence, wherein each frame of binocular image in the binocular image sequence comprises a left image and a right image;

Performing image synthesis processing on the left image and the right image in each frame of the binocular image to obtain a synthesized image corresponding to each frame of the binocular image, wherein each frame of the synthesized image comprises an overlapping area between the left image and the right image;

And carrying out local motion detection according to the overlapped area to obtain a motion detection result.

2. The image processing method according to claim 1, wherein the acquiring a binocular image sequence includes:

And acquiring a binocular image sequence acquired by the target binocular camera, wherein the binocular image sequence comprises a t-1 frame binocular image and a t frame binocular image.

3. The image processing method according to claim 2, wherein the image synthesis processing is performed on the left image and the right image in the binocular image of each frame to obtain a synthesized image corresponding to the binocular image of each frame, including:

4. The image processing method according to claim 3, wherein the performing local motion detection according to the overlapping region to obtain a motion detection result includes:

Calculating an optical flow field according to the first overlapping area in the t-1 frame composite image and the second overlapping area in the t frame composite image;

And obtaining a motion detection result according to the optical flow field.

5. The image processing method according to claim 4, wherein the calculating an optical flow field from the first overlapping region in the t-1 th frame composite image and the second overlapping region in the t-1 th frame composite image includes:

Performing feature point detection on the first overlapping region in the t-1 th frame composite image and the second overlapping region in the t-1 th frame composite image to obtain first feature point data corresponding to the t-1 th frame composite image and second feature point data corresponding to the t-1 th frame composite image;

performing feature point matching on the first feature point data and the second feature point data to obtain matched feature point data;

And obtaining the optical flow field according to the matching characteristic point data.

6. The image processing method according to claim 4, wherein the acquiring the motion detection result according to the optical flow field includes:

analyzing the optical flow field, determining moving objects in the first overlap region and the second overlap region;

determining a motion area according to the motion information of the moving object;

And acquiring a first motion detection result, wherein the first motion detection result comprises first prompt information and coordinate information of the motion area, and the first prompt information is used for prompting the existence of local motion.

7. The image processing method according to claim 6, wherein the acquiring the motion detection result according to the optical flow field further comprises:

8. The image processing method according to claim 7, wherein the method is applied to a head-mounted display device on which a color camera and a display screen are provided; the method further comprises the steps of:

acquiring a color image sequence acquired by the color camera;

constructing a three-dimensional grid based on a composite image corresponding to each frame of the binocular image in the binocular image sequence;

Performing image fusion processing on the color image sequence and the three-dimensional grid to obtain a target image;

And displaying the target image on the display screen.

9. The image processing method according to claim 8, wherein the method further comprises:

10. The image processing method of claim 8, wherein the head-mounted display device is further configured with a facial recognition module; the method further comprises the steps of:

11. The image processing method according to claim 8, wherein the head-mounted display device is further provided with a hand detection module; the method further comprises the steps of:

12. The image processing method according to claim 2, wherein the target binocular camera is a monochrome binocular camera, the left image is a monochrome left image, and the right image is a monochrome right image.

13. An image processing apparatus, characterized in that the apparatus comprises:

14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which is adapted to be loaded by a processor for performing the image processing method according to any of claims 1-12.

15. A terminal device, characterized in that the terminal device comprises a processor and a memory, the memory storing a computer program, the processor being adapted to execute the image processing method according to any one of claims 1-12 by calling the computer program stored in the memory.

16. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the image processing method of any of claims 1-12.