CN113538522A

CN113538522A - Instrument vision tracking method for laparoscopic minimally invasive surgery

Info

Publication number: CN113538522A
Application number: CN202110922513.8A
Authority: CN
Inventors: 王宗耀; 沈珺; 郭靖; 蔡述庭; 熊晓明
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-10-22
Anticipated expiration: 2041-08-12
Also published as: CN113538522B

Abstract

The invention discloses a surgical instrument visual tracking method for laparoscopic minimally invasive surgery, which is based on a deep learning method, and is characterized in that on the premise of not marking surgical instruments before surgery, a surgical instrument area is detected or segmented from a surgical video stream, and local features are extracted, so that the detection and positioning efficiency of the surgical instruments and the accuracy of feature extraction are greatly improved; meanwhile, image enhancement processing is carried out on the region of interest by using an image filtering algorithm and region screening, so that the identification precision of the tracking target point is improved; then calculating a 2D-3D conversion relation between the surgical instrument and the target area through modeling and mathematical expression so as to determine the depth perception information of the surgical instrument end effector and the target area; and finally, realizing a real-time tracking program of the target point by using a mask algorithm. The data set is established through clinical medicine guidance to train and test the surgical instrument detection model, so that the automatic detection and tracking of the laparoscopic surgical instrument are realized, and the model has higher practicability.

Description

Instrument vision tracking method for laparoscopic minimally invasive surgery

Technical Field

The invention relates to the technical field of laparoscopic minimally invasive surgery, in particular to an instrument vision tracking method for the laparoscopic minimally invasive surgery.

Background

Minimally Invasive Surgery (MIS) is a latest surgical technique, in which a doctor does not directly operate a focus of a patient any more during an operation, but performs the operation by inserting an instrument into the abdominal cavity of the patient, and simultaneously projects a video stream captured by a laparoscope onto a display to observe the intra-operative process. Compared with the traditional open surgery, the minimally invasive laparoscopic surgery has the advantages that the wound is small and the recovery is fast, the postoperative care is simple, and the development trend of the current intrathoracic and intraperitoneal surgery is realized, but the indirect operation can cause a doctor to lose depth perception information of surgical instruments and a focus area, and the operation is still very challenging for the surgeon to perform.

The laparoscopic minimally invasive surgery is to determine and obtain the relative pose of a surgical instrument and a target object or a pretreatment area under the laparoscopic vision, and then the surgical robot completes the corresponding posture adjustment and intraoperative treatment of the surgical instrument under the guidance of a doctor.

The main challenge of such surgery is how to compensate for the loss of depth perception of the two-dimensional laparoscopic surgical instruments. The traditional laparoscope can not obtain the depth and direction perception information of the surgical instrument, and often depends on the clinical experience of doctors, so that the operation efficiency is low and high-frequency postoperative complications occur. The current more advanced surgical robotic system is equipped with a three-dimensional laparoscope and 3D glasses, the three-dimensional laparoscope has better depth perception and anatomy detail recognition capability, but the overall image quality is poor, another disadvantage is that sufficient feasibility analysis and clinical data verification are lacked, and the high-definition two-dimensional laparoscope is more feasible and reliable clinically, and furthermore, the three-dimensional laparoscope is poor in flexibility, especially in angle conversion. The loss of depth perception by two-dimensional laparoscopy is therefore an urgent problem to be solved.

The second challenge of the surgery is how to obtain the relative pose between the surgical instrument and the target object and the real-time positioning of the surgical instrument, and the realization of the real-time positioning, tracking and navigation of the surgical instrument is very important. In a traditional laparoscopic minimally invasive surgery, methods such as ultrasound, electromagnetic and optical positioning are used for positioning surgical instruments, for example, patent CN201910908481.9 proposes a photomagnetic fusion surgical navigation method, which combines electromagnetic positioning and optical positioning to position and navigate a surgery, but electromagnetic positioning is easily affected by a magnetic field and has a high requirement on a surgical environment, and an optical positioning instrument needs to attach a marker to a surgical instrument, which means that both standard and common surgical instruments need to be modified.

A third challenge of surgery is how to implement real-time visual tracking of surgical instruments, which is helpful for scene and surgical stage analysis and intra-operative monitoring in clinical surgery, and delay or loss of tracking of surgical instruments easily causes doctors to deviate from ideal path or trajectory planning, thereby affecting post-operative effect evaluation and even causing increase of postoperative complications.

Disclosure of Invention

The invention aims to provide an instrument vision tracking method for a laparoscopic minimally invasive surgery, which is used for at least solving one or more problems of high perception error of depth, direction and relative pose of a surgical instrument and a target area, low detection efficiency of the surgical instrument, insufficient positioning precision, unreliable real-time tracking delay and the like in the existing laparoscopic minimally invasive surgery.

In order to realize the task, the invention adopts the following technical scheme:

an instrument vision tracking method for laparoscopic minimally invasive surgery, comprising the following steps of:

performing frame-by-frame processing on the acquired laparoscopic surgery video stream to perform target area segmentation on the surgical instrument to obtain a gray level image containing a target area; performing image enhancement preprocessing on the gray level image, including spatial filtering on the gray level image, then converting the filtered image into a binary image to perform adaptive threshold processing to obtain a binary image containing a target region, and then performing Euclidean distance conversion on the binary image to obtain a distance image; determining a point of maximum pixel value in the surgical machine shaft portion and the end effector portion based on the range image to determine a position of the target tracking point;

for the array of the maximum gray level points of each communication area in the distance image, the point with the maximum pixel value in the laparoscope shaft part is recorded as a point M, the point with the maximum pixel value in the end effector part is recorded as a point N, and a line L connecting the point M and the point N is the axis of the surgical part of the surgical instrument; on the axis L, if a point is located at the shaft portion and its next point is located at the end effector portion, the point is taken as a target tracking point S, and finally pixel coordinate information of the target tracking point S is output;

calibrating a laparoscopic camera, and calculating X-axis and Y-axis coordinates of the target tracking point in a world coordinate system in each frame of image by using an internal participation external parameter matrix obtained after calibration; acquiring depth information of the target tracking point in the image by using a monocular visual depth estimation method based on deep learning, thereby obtaining a Z-axis coordinate of the target tracking point in a world coordinate system, and thus obtaining a complete coordinate of the target tracking point in the world coordinate system; the monocular visual depth estimation method extracts depth information by establishing a prediction model, wherein the prediction model comprises a feature extraction network and a fusion prediction network; the feature extraction network acquires local features of the image by using a low-level network and acquires semantic features of the image by using a high-level network; the feature extraction network comprises first to fifth encoders with sequentially increased network levels, each encoder comprises a plurality of convolution layers which are sequentially connected, and a maximum pooling layer is connected to the last convolution layer; the fusion prediction network comprises a jump layer, a batch normalization layer and a deconvolution layer which are connected in sequence; starting from a feature image output by the highest layer of the feature extraction network, namely the maximum pooling layer of a fifth encoder, the feature image is subjected to processing of a skip layer, a batch normalization layer and a deconvolution layer of a fusion prediction network, then is fused with the feature output by the previous encoder at a skip layer, and then is fused with the feature image output by the previous encoder at a skip layer after being subjected to processing of the fusion prediction network, and finally is output as a depth image after passing through a convolution layer;

initializing a target tracking point, wherein the position of the target tracking point in a first frame of a video stream is the initialization position of a tracking task, and the next frame of the video stream is used as the input of target extraction and target tracking point positioning; positioning target tracking points frame by frame, firstly, introducing an ellipse mask aiming at a target area to obtain a processing area, and fitting an ellipse by a least square method by utilizing the edge of the target area of the previous frame; secondly, on the premise of keeping the ellipse at the same center, generating an ellipse mask of the current frame by zooming the ellipse; finally, obtaining a target area of the current frame by using an elliptical mask;

after the target area of the current frame is obtained, if the target area does not meet the preset condition, the current frame is considered to be failed to be processed, and the current frame is initialized and then is processed again;

after the target area is found, fitting an ellipse to the edge of the target area in the image by using a least square error method; taking the target tracking point as the central point of the elliptical mask, and simultaneously using the three-dimensional coordinate of the target tracking point in a world coordinate system to guide the surgical robot; the mask ellipse for the current frame will also be used to generate the mask for the next frame.

Further, the method further comprises:

using a multi-threaded parallel processing approach, the first thread acquires frames from the laparoscope and starts executing the instrument tracking program: after the image is obtained, determining pixel coordinate information of a target tracking point and three-dimensional coordinate data of the target tracking point, and after the target tracking point is determined, sending the three-dimensional coordinate data to a second thread by a first thread and starting to process the next frame; the second thread is started by receiving the three-dimensional coordinate data of the target tracking point from the first thread and receiving the position data of the robot joint from the third thread, and according to the information, the second thread generates a path and sends a corresponding motion command to an actuator of the robot;

finally, the third thread continuously collects position data from the robot position sensor and keeps the variables up to date for the second thread to use; after detecting the target tracking point in the next frame, the first thread immediately tries to start the second thread again; however, if the process of the last path generation has not completed, the new process will not start and the first thread continues with the next frame.

Further, a lightweight network LinkNet is used as a segmentation model to segment each frame of image in the video stream and extract a target region; the segmentation model is trained on an open dataset using ResNet18 as an encoder; in order to train the segmentation model, the output of the LinkNet network is evaluated and the back propagation is guided by two loss functions, namely cross-entropy loss and Jaccard, wherein the expression of the loss function is as follows:

F＝αE-(1-α)ln J

wherein alpha is a weight parameter and takes a value from 0 to 1;

where N denotes the total number of pixels in the laparoscopic image, p_iRepresenting the prediction probability that the ith pixel belongs to the target region, q_iRepresenting the true value of the ith pixel.

Further, the ellipse is defined as follows:

Ax²+Bxy+Cy²+Dx+Ey+F＝0

min||M^TX||²＝M^TXX^TM

wherein, M ═ A, B, C, D, E, F]^TA parameter matrix representing a fitted ellipse, X ═ u²,uv,v²,u,v ,1]^TAn input matrix representing a fitted ellipse, X ═ P { (u, v) | (u)₁,v₁),(u₂,v₂),…,(u_N,v_N) And expressing a pixel point set comprising all pixel points at the edge of the target area, and N expresses the number of the pixel points in the pixel point set.

Further, the preset conditions are as follows:

the area of the target region is larger than a minimum limit, wherein the minimum limit is 5% of the binary image;

the boundary of the tail end of the target area is overlapped with the boundary of the original frame image;

if any one of the conditions is not satisfied, the current image frame is considered to have failed processing.

Compared with the prior art, the invention has the following technical characteristics:

1. the invention processes the laparoscope video frame by frame, adopts the lightweight network LinkNet based on deep learning to segment each frame of image in real time on the premise of ensuring high accuracy, quickly and accurately extracts a target area containing surgical instruments, and can improve the segmentation accuracy to the leading level of the industry through preoperative model training.

2. According to the method, a deep learning method is adopted in an image segmentation stage, two loss functions of cross entropy loss and Jaccard are fused in a model training stage to evaluate the output of the LinkNet network and guide back propagation, so that the error and the output delay are greatly reduced, the segmentation accuracy can be further improved, and the efficiency is far higher than that of the traditional visual segmentation.

3. A series of image enhancement steps such as adaptive threshold processing and Euclidean distance conversion after the segmented image is obtained provide the most popular research method for target region segmentation and object extraction, and all pixel point information of the target region containing surgical instruments can be obtained.

4. The determination of the target tracking point is crucial to the accuracy and success rate of unmarked real-time visual tracking of surgical instruments, the planning of surgical trajectories and the guidance and analysis of surgical actions, the target tracking point is determined by using a connecting line of two central points of an end effector and a shaft part, and the position of the target tracking point determined by the method is an innovative method in laparoscopic minimally invasive surgery at present.

5. The invention adopts the monocular camera to calibrate and determine the internal reference and external reference coefficients of the laparoscopic camera, and provides a 2D-3D coordinate conversion formula, so that X-axis and Y-axis numerical values of all pixel points in a target area can be calculated.

6. The invention provides a monocular vision depth estimation method based on deep learning, constructs a multi-feature fusion network architecture model based on a complete convolution neural network (FCN), is a superior scheme for solving the depth perception loss problem at present, and is also the core advantage of the invention.

7. The invention provides a depth perception estimation evaluation index, and provides theoretical and numerical basis for operation effect evaluation and analysis.

8. The target tracking point is positioned frame by fitting an ellipse mask through a least square method, so that accurate positioning of surgical instruments and real-time capture of the target tracking point can be guaranteed, the calculated amount can be greatly reduced, and the system smoothness is improved.

9. The tracking program of the invention adopts a multithreading parallel processing method, thereby accelerating the processing speed, realizing real-time operation and reducing the system delay.

Drawings

FIG. 1 is a schematic view of a LinkNet network structure;

FIG. 2 is a schematic diagram of classification of target pixel points, with interior points on the left and isolated points on the right;

FIG. 3 is a schematic view of a surgical instrument end effector, shaft, and hand-held portion;

FIG. 4 is a schematic diagram of a 2D-3D coordinate transformation process;

FIG. 5 is a schematic diagram of a multi-feature converged network architecture;

FIG. 6 is a schematic diagram of a feature extraction network and a converged prediction network architecture;

FIG. 7 is an exemplary graph of fitting an ellipse mask edge (left) to a target tracking point (right);

fig. 8 is a flowchart of a surgical instrument detection, location and tracking method.

Detailed Description

According to the method based on deep learning, on the premise that the surgical instrument is not marked before the operation, the surgical instrument area is detected or segmented from the operation video stream, and local features are extracted, so that the detection and positioning efficiency of the surgical instrument and the accuracy of feature extraction are greatly improved; meanwhile, image enhancement processing is carried out on the region of interest by using an image filtering algorithm and region screening, so that the identification precision of the tracking target point is improved; then calculating a 2D-3D conversion relation between the surgical instrument and the target area through modeling and mathematical expression so as to determine the depth perception information of the surgical instrument end effector and the target area; and finally, realizing a real-time tracking program of the target point by using a mask algorithm. The data set is established through clinical medicine guidance to train and test the surgical instrument detection model, so that the automatic detection and tracking of the laparoscopic surgical instrument are realized, and the model has higher practicability.

Referring to the attached drawings, the instrument vision tracking method for the laparoscopic minimally invasive surgery comprises the following steps:

step 1, in order to realize real-time monitoring, positioning and tracking of surgical instruments, frame-by-frame processing needs to be performed on acquired laparoscopic surgery video streams to segment a target region where the surgical instruments are located, and the specific process is as follows:

1.1 laparoscopic video acquisition and reading: the system is packaged with an OpenCV (open content description language) library serial port interface, a video Capture function is called to read a laparoscope video file, and after a video object is obtained, a get method can be used to obtain video related information: width, height, total frame number of the video, and frame rate of the original video.

1.2 video frame-by-frame processing: reading the video frame by frame, wherein the first return value is whether the video reading is successful or not, the second return value is the current frame of the video, the video can iterate to the next frame after the video is read, and the next frame can be read when the re-debugging read method is downloaded. Therefore, the video can be read out frame by frame with while loop.

1.3 rewrite video: the video frame is transmitted in, and some necessary debugging information is added. The frame previously read in with the read function is of Mat type and can therefore be further processed directly above.

1.4 image segmentation and target extraction: each pixel in each frame of laparoscopic image of a video stream is divided into a foreground (surgical instrument area) and a background through deep learning segmentation, and although the precision of the current segmentation algorithm is high, the parameters and the number of operations are very large, so that the real-time delay-free tracking of surgical instruments is greatly influenced. Therefore, the invention uses the lightweight network LinkNet as a segmentation model, can obtain higher segmentation accuracy rate under the condition of not influencing the processing time, and the model structure is similar to U-Net and is a typical Encode-Decoder structure. The network bypasses the spatial information, and directly connects the encoder and the decoder to improve the accuracy and reduce the processing time to a certain extent. In this way, the information lost by different layers in the encoded part is preserved, while no additional parameters and operations are added when the lost information is relearned.

1.5 training a segmentation model: training the segmentation model on an open data set (ImageNet), using ResNet18 as an encoder; to train the segmentation model, the output of the LinkNet network was evaluated with two loss functions, cross-entropy loss and Jaccard and directed to back propagation. These two loss functions are defined as follows:

where N denotes the total number of pixels in the laparoscopic image, p_iPrediction of the ith pixel belonging to the foreground (surgical instrument region)Probability, q_iRepresenting the true value (q) of the ith pixel _i1 denotes that the ith pixel belongs to the foreground, q_i0 indicates that the ith pixel belongs to the background).

The invention fuses two loss functions and defines a new loss function expression:

F＝αE-(1-α)ln J

where α is a weighting parameter, and takes a value from 0 to 1.

Step 2, each frame of image is divided into a foreground (surgical instrument area) and a background through step 1, that is, the laparoscopic image is processed into a gray image containing a target area, and then some necessary image enhancement preprocessing is performed to enhance the details of the image so as to effectively screen out the target area (surgical instrument area), and the detailed process is as follows:

2.1 spatial filtering: removing small and isolated noise regions through a filtering algorithm, and enhancing a target region (surgical instrument region); the median filtering effect is the best in image smoothing, and the median filtering is a nonlinear image processing method, so that the boundary information can be kept while denoising is carried out. The invention calls a media blur () function through OpenCV to realize median filtering, and the size of a kernel is 5 x 5. And obtaining a de-noised and smooth gray image after filtering.

2.2 adaptive thresholding: and converting the gray level image into a binary image, and obtaining the object through region screening. The adaptive Threshold method calculates the local Threshold according to the brightness distribution of different areas of the image, and the adaptive Threshold processing is realized through an API function adaptive Threshold () provided by Opencv. Compared with an API function provided by OpenCV, the method provided by the invention has the advantages that a median method is additionally used for determining the threshold, and the final effect is better compared with the conventional threshold determining method. After threshold processing, the gray level image is converted into a binary image, the image is divided into a target image and a background image in the binary image, the target image is white if the pixel value is 255, and the background image is black if the pixel value is 0. After the thresholding, a binary image containing the target region may be obtained by screening.

2.3 Euclidean distance transformation: and aiming at the binary image, obtaining a distance image through Euclidean distance transformation, and determining the point with the maximum pixel value in the shaft part and the end effector part of the surgical instrument so as to calculate the position of the target tracking point.

Euclidean distance transform (Euclidean distance transform) is used for transforming a binary image into a gray-scale image, and in the transformed gray-scale image, the gray level of each pixel point of each connected domain is related to the nearest distance from the pixel point to a background pixel of the connected domain. Wherein the gray level indicates the maximum number of different grays in the image, the larger the gray level, the larger the brightness range of the image. The farther a pixel point of a foreground region in an image is from a pixel point of a background, the more the distance value is replaced by a pixel value, the brighter the point (pixel point of a foreground target) of the newly generated image is, so that a point with the maximum gray level in each connected region in the foreground can be obtained, wherein the set of the points with the maximum gray level is a framework of the target image, namely a set of pixels at the central part of the target image, and the gray level reflects the influence relationship between the background pixel and the boundary of the target image. This is done to facilitate subsequent determination of target tracking points.

The Euclidean distance conversion comprises the following specific steps:

classifying target pixel points in an image into internal points, boundary points and isolated points;

calculating all internal points and non-internal points in the image, and respectively recording the point sets as S₁，S₂；

For S₁Is calculated at S using the distance formula disf (), for each interior point (x, y) in S₂Of the set S, these minimum distances constituting the set S₃；

Fourthly, calculating S₃Max and Min;

calculating the converted gray value G for each interior point according to the following formula, wherein S₃(x, y) denotes S₁Middle inner point (x, y) at S₂Minimum of (1)Distance.

G(x,y)＝255×|S₃(x,y)-Min|/|Max-Min|

And sixthly, keeping the isolated points unchanged, and executing Euclidean distance transformation. Calculating the Euclidean distance:

wherein b [ i ]]Is a background point with a pixel value of 0, x [ i ]]Is an input coordinate point, y_iIs the minimum distance of the background point to x, and n is the dimension.

The distance between a non-zero pixel and a nearest zero pixel is calculated by calling an OpenCV distance transformation API function distancetransform (), the skeleton of the target image (the set of points with the maximum gray level in each connected region in the foreground) is solved, and a distance image array is obtained. And (4) carrying out normalization processing on the returned result of the Euclidean distance transformation by using normalization (), displaying the skeleton of the target image and outputting an array matrix of the maximum gray level points of each connected region.

The set of the points with the maximum gray level obtained by the Euclidean distance transformation is the set of the pixels at the central part of the target area (surgical instrument area), so that the point with the maximum pixel value in the surgical instrument area can be easily extracted.

And 3, determining the position of the target tracking point.

The surgical instrument surgical access portion may be divided into an end effector and a shaft portion, and articulated together, as shown in FIG. 3; in the operation process, the tail end of the surgical instrument moves frequently and is often shielded by organ tissues, if the tip end of the surgical instrument is defined as a target tracking point, a large positioning error and even tracking failure are easily caused, and in addition, the tail end actuator has various types, and the tail end is not practical to be defined as the target tracking point. Thus, we define the surgical instrument joint as the target tracking point, rather than the surgical instrument tip.

And obtaining a distance image after Euclidean distance transformation, and outputting an array of maximum gray level points of each communication region, wherein a point with the maximum pixel value in the shaft part is recorded as a point M, a point with the maximum pixel value in the end effector part is recorded as a point N, and a line L connecting the point M and the point N is an axis of the surgical part of the surgical instrument. On the axis L, if one point is located at the shaft portion and its next point (in the direction from M to N) is located at the end effector portion, the point is taken as the target tracking point S, and finally the pixel coordinate information of the target tracking point S is output.

And 4, calibrating the laparoscopic camera.

The camera calibration aims at acquiring internal reference and external reference matrixes of the laparoscopic camera (meanwhile, rotation and translation matrixes of each calibrated image can be acquired), and the internal reference and external reference coefficients can correct images shot by the camera later to acquire images with relatively small distortion.

For the monocular laparoscopic camera model, one view is to project points in three-dimensional space to the image plane through a perspective transformation. Let the coordinate of a certain point in the world coordinate system be P ═ Xw, Yw, Zw]^TAfter rigid transformation (rotation and translation), the coordinates of the point in the camera coordinate system are P ═ Xc, Yc, Zc]^TAnd the coordinate of the point P in the image coordinate system after projection imaging is P ═ x, y]^TThe coordinate in the pixel coordinate system is p ═ u, v]^T. The projection formula from the world coordinate system to the pixel coordinate system is as follows:

where (u0, v0) is the reference point (usually in the center of the image) and fx, fy is the focal length in pixels. R is a 3 multiplied by 3 orthogonal rotation matrix, T is a three-dimensional translation vector, and the camera internal reference matrix A:

the intrinsic parameter matrix is independent of the view of the scene and once calculated can be reused (as long as the focal length is fixed), and the focal length of the hand-held laparoscope in laparoscopic surgery is planned before the surgery and can be treated as a fixed focal length. The rotation-translation matrix [ R | T ] is called an external reference matrix, and is used to describe the motion of the camera relative to a fixed scene, or conversely, the rigid motion of objects around the camera. I.e. [ R | T ] transforms the coordinates of the point (X, Y, Z) to a coordinate system that is fixed relative to the camera.

A calibration process:

extracting angular points;

extracting sub-pixel angular points;

drawing an angular point;

calibrating parameters;

evaluating the calibration result;

sixthly, checking the calibration effect, and correcting the chessboard diagram by using the calibration result.

An internal reference matrix A and an external reference matrix, namely a rotation-translation matrix [ R | t ], of the laparoscopic camera can be obtained through calibrating the camera, and the internal reference matrix A and the external reference matrix are used for calculating the three-dimensional coordinates of the target tracking point.

And 5, calculating the three-dimensional coordinates of the target tracking points.

The real-time tracking of the target tracking point and the control of the surgical robot need to acquire three-dimensional coordinate data of the target tracking point, namely world coordinate system coordinates (Xw, Yw, Zw) of the target tracking point (u, v) in a certain frame of image, wherein the values of Xw and Yw can be obtained through a coordinate conversion relation between a laparoscope camera and the image, the Z axis represents the depth information of the target tracking point, and the video stream acquired by the monocular laparoscope loses depth perception information, so that Zw cannot be directly obtained. Following solution of X_wAnd Y_wThe value of (c).

Establishing a 2D-3D conversion relation between the laparoscopic image and the surgical instrument:

f is the camera focal length, fx, fy is the focal length in pixels. So if an image from the camera is up-sampled or down-sampled by some factor, all of these parameters (fx, fy, u0 and v0) will be scaled (multiplied or divided) by the same scale. Wherein,

is a translation and scaling matrix.

Is a rotation-translation matrix [ R | T]I.e. a homogeneous form of the extrinsic parameter matrix. Zc is the coordinate in the camera coordinate system as P ═ Xc, Yc, Zc]^TThe depth value of (2) is obtained by image coordinate system-camera coordinate system transformation.

The rotation-translation matrix [ R | T ] and the internal reference matrix A are obtained through the step 4, pixel coordinates of the target tracking point are read through OpenCV, image coordinates, camera image coordinates and world coordinates are obtained through inverse operation in sequence, and Xw and Yw values of the target tracking point in a world coordinate system can be obtained.

And 6, determining depth perception information of the surgical instrument, namely solving a target tracking point Zw value.

After the video stream obtained from the two-dimensional laparoscope is transmitted to a display, some important depth perception and other fusion information are lost, which can have a serious influence on the subsequent accurate guidance of the doctor. How to realize the depth perception of the surgical instrument by utilizing the monocular camera is particularly important in clinical practice, the invention utilizes a monocular vision depth estimation method based on the deep learning, aims to solve the problem of depth perception loss by utilizing the monocular laparoscope with low cost and high practicability, and comprises the following specific processes:

6.1 full convolution neural network (FCN) based multi-feature converged network architecture.

Reading the gray level images processed in the steps 2.1 and 2.2, inputting the gray level images into a prediction model, generating multilayer characteristic images at different levels of the network through a characteristic extraction network, gradually fusing the multilayer characteristic images through a fusion prediction network, and finally generating a predicted depth image. Depth information of the target area (surgical instrument) can be obtained.

The feature extraction network comprises first to fifth encoders with sequentially increased network levels, each encoder comprises a plurality of convolution layers which are sequentially connected, and a maximum pooling layer is connected to the last convolution layer; the number of convolution layers in each encoder and parameters such as the size and the step length of the convolution kernel are adjusted according to actual conditions; in this embodiment, the fifth encoder has three sequentially connected 3 × 3 convolutional layers, and finally, the largest pooling layer. The feature extraction network extracts feature images generated after feature extraction is carried out on input images through each encoder, the feature images generated are of multi-scale nature due to different perception ranges of each layer of the network on the images, and full and rich feature expressions are obtained through a feature extraction part from local features of the images obtained through a low-level network to semantic features of the images obtained through a high-level network.

The fusion prediction network comprises a jump layer, a batch normalization layer and a deconvolution layer which are connected in sequence; and starting from a feature image output by the highest layer of the feature extraction network, namely the maximum pooling layer of the fifth encoder, processing the feature image by a skip layer, a batch normalization layer and a deconvolution layer of a fusion prediction network, fusing the feature image with the features output by the previous encoder at a skip layer, processing the fused feature image by the fusion prediction network, fusing the fused feature image with the feature image output by the previous encoder at a skip layer, and so on, and finally outputting a depth image through a convolution layer.

In the network, the feature images output from the highest layer of the network, namely the maximum pooling layer (see fig. 6) of the fifth encoder are merged with the feature images output from the previous pooling layer step by step. In the process, the feature image of the higher level respectively passes through a jump layer (a convolution layer), a batch normalization layer and a deconvolution layer, and then is fused with the feature image of the previous level passing through the jump layer. Wherein, the skip layer is a layer of 1 × 1 convolution layer to ensure the consistency of the number of channels fusing two layers of characteristic images; the batch normalization layer can accelerate the training speed of the network and reduce the complicated parameter adjustment work, and has strong functions; the deconvolution layer is a 4 x 4 convolution layer, and the deconvolution layer can gradually increase the size of an output image on one hand and achieve the purpose of keeping the space size of the two layers of feature images fused to be consistent on the other hand.

For example, after the feature image a of the 5 th encoder pooling layer is output, the feature image a passes through a skip layer, a batch normalization layer and an deconvolution layer of the fusion prediction network, the feature image a and the feature image B output by the 4 th encoder pooling layer are fused at one skip layer, the fused feature image C continues to pass through a skip layer, a batch normalization layer and a deconvolution layer of the fusion prediction network, and the feature image D is obtained and then fused with the feature image E of the 3 rd encoder pooling layer at one skip layer until the feature image a is fused to the bottommost feature image of the network.

The fusion process is to perform matrix summation operation on the corresponding feature images according to the channel dimension. And gradually fusing the subsequent layer and the previous layer, gradually reducing the number of characteristic image channels of the image, gradually increasing the size of the fused image, and finally obtaining an output depth image by taking the convolution layer as linear prediction.

6.2 Algorithm execution Process

Firstly, the execution process of the whole algorithm is carried out on an ImageNet data set, firstly, the image data is changed into the resolution of 320 multiplied by 240 through down sampling, and a data expansion method, namely, small-range rotation, scaling, color conversion and random horizontal turnover processing is applied to expand the sample data volume of the image and increase the diversity of the image, so that the trained network model has stronger robustness.

Secondly, initializing parameters of the whole network model, and adopting a pre-training result as an initial parameter in the feature extraction network part. Initialization of convolutional layer parameters such as a jump layer is performed by using a normal distribution with a mean value of 0 and a variance of 0.01 to perform random number initialization. The deconvolution layer parameters are initialized with the filter parameters of the bilinear interpolation.

Thirdly, network training is carried out by using a random gradient descent (SGD), and the selected training loss function is as follows:

wherein l (x) represents a loss function target value;

the predicted depth image is output by the network, and y is an actual value of a label depth image in a training image data set;

and c is one fifth of the maximum difference value between the first predicted depth image and the ith label depth image. Where i is the index of pixels in the image, i.e., the set of image pixels that act on the entire training set, and x ∈ (-c, c).

The loss function measures the difference between the network output prediction image and the tag depth image provided for training, i.e. l (x) is required to converge gradually in the training process, i.e. the difference is reduced gradually to optimize the network parameters gradually.

Fourthly, training the whole network for 20 periods, setting the batch size of each batch to be 5, setting the initial learning rate to be 0.01, and gradually reducing the initial learning rate according to each 5 periods, for example, changing the initial learning rate to be 0.5 times of the previous learning rate.

The depth image output in the step 6 can obtain the depth information of each pixel point of the image, and the three-dimensional coordinates of the target tracking point can be completely obtained based on the steps 5 and 6. The real-time frame-by-frame tracking of the target tracking point requires three-dimensional coordinate data thereof, and simultaneously, the three-dimensional coordinate data also needs to be transmitted to a surgical robot control algorithm to guide the robot to complete corresponding inverse motion.

For monocular image depth estimation, objective evaluation criteria need to be applied in specifically evaluating how the numerical performance of the resulting image behaves. The depth perception estimation evaluation index comprises: average relative error, root mean square error, threshold accuracy.

And 7, initializing a target tracking point.

The grayscale image, the binary image, and the distance image including the target object (surgical instrument) are sequentially obtained in step 2. Step 3 defines the position of the target tracking point, steps 5 and 6 find out the three-dimensional coordinate of the target tracking point, and finally the position of the target tracking point is returned to the original frame. To accomplish the tracking task, the target tracking point must be located frame by frame in the laparoscopic video. The position of the target tracking point in the first frame of the video stream is the initial position of the tracking task. The next frame of the laparoscopic video is used as input for target extraction and target tracking point positioning. The target area of the current frame (the target area containing the surgical instruments) will be limited to the surrounding of the surgical instrument extraction area of the previous frame to reduce the calculation process.

And 8, generating a tracking mask of the target area of the surgical instrument.

After the real-time three-dimensional coordinates of the target tracking points are obtained through the steps 5 and 6, the target tracking points are positioned frame by adopting a mask method.

8.1 determine the mask. The target area is an elongated area and an elliptical mask is introduced to obtain the processed area. Fitting an ellipse by using the edge of the target area in the previous frame by using a least square method, wherein the ellipse is defined as follows:

Ax²+Bxy+Cy²+Dx+Ey+F＝0

min||M^TX||²＝M^TXX^TM

wherein, M ═ A, B, C, D, E, F]^TA parameter matrix representing a fitted ellipse, X, y being variables of an ellipse equation, X ═ u²,uv,v²,u,v,1]^TAn input matrix representing a fitted ellipse, X ═ P { (u, v) | (u)₁,v₁),(u₂,v₂),…,(u_N,u_N) And expressing a pixel point set comprising all pixel points at the edge of the target area, and N expresses the number of the pixel points in the pixel point set.

8.2 extracting the target area of the current frame. Next, an ellipse mask of the current frame is generated by scaling the ellipse while keeping the ellipse at the same center. Finally, an ellipse mask is used to obtain a target area of the current frame, and the mask generation process is as follows:

wherein I and I_MRepresenting the original image and the mask image, respectively, and M is the mask generated in the previous cycle.

And 9, realizing the tracking process.

In the previous step 1, after each frame of image of the video stream is segmented, in step 2, the grayscale image is filtered to reduce image noise and is subjected to adaptive threshold processing to screen out a target region containing surgical instruments. Meanwhile, the target area containing the instrument can meet the following conditions:

the area of the image is larger than the minimum limit (the ratio of the target area containing the surgical instruments in the binary image after the self-adaptive threshold processing in the step 2 is larger than 5 percent, namely the ratio of the foreground area in the binary image is larger than 5 percent);

it crosses the image boundary, i.e. the tail end boundary of the target area coincides with the boundary of the original frame image.

If either of these conditions is not met, the current image frame is deemed to have failed processing and a new processing loop should be initiated and started from the next frame.

After finding the instrument target region, an ellipse is fitted to its edges in the image using a least squares error method. And 5, taking the target tracking point obtained in the steps 5 and 6 as the central point of the elliptical mask, and simultaneously obtaining the three-dimensional coordinate of the target tracking point after the pixel coordinate of the target tracking point is subjected to coordinate conversion in the steps 5 and 6 to be used for guiding the surgical robot. The mask ellipse of the current frame is also used to generate the mask for the next frame. I.e., the ellipse mask size of the current frame is scaled by a certain scale, e.g., 1.2 times the major diameter and 2 times the minor diameter, to generate the next frame mask ellipse. Then, in the next mask generation phase, only the portion of the image frame will undergo the process of step 9. The moving amplitude of the surgical instrument in the laparoscopic surgery is not large, assuming that a target tracking area is extracted from a first frame, a mask is generated based on the current frame (the first frame), and then the mask is properly scaled in the next frame (the second frame), so that the instrument can occupy a larger area in the mask, which is beneficial to improving the tracking efficiency, the mask of the first frame is not used in a third frame image, but the surgical instrument is tracked in the mask of the second frame, and the purpose of tracking the surgical instrument can be better overlapped with the mask so as to avoid losing the target.

Step 10, the loop of the program is tracked.

In order to accelerate the processing speed and realize real-time operation, the invention provides a multithreading parallel processing method, aiming at executing different functions in parallel by using a multi-core CPU/GPU.

The first thread acquires a frame from the input device (laparoscope) and starts executing the instrument tracking program. After the image is acquired, the pixel coordinate information of the target tracking point is acquired in step 2, the three-dimensional coordinate data of the target tracking point is acquired in steps 5 and 6, and after the target tracking point is determined, the first thread sends the three-dimensional coordinate (after 2D-3D conversion) of the target tracking point to the second thread and starts to process the next frame. The second thread is started by receiving three-dimensional coordinate data of the target tracking point from the first thread and position data of the robot joint from the third thread. Based on this information, the second thread generates a path and sends appropriate motion commands to the actuators of the robot.

Finally, the third thread continuously collects position data from the robot position sensor and keeps the variables up to date for use by the second thread. Immediately after detecting the target tracking point in the next frame, the first thread attempts to start the second thread again. However, if the process of the last path generation has not completed, the new process will not start and the first thread continues with the next frame.

In this way, the second thread receives up-to-date information of the tip coordinates all the time, which helps to maintain real-time operation in case the path generation process takes too long.

And 11, analyzing the unmarked visual tracking effect of the surgical instrument.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A visual tracking method of surgical instruments for laparoscopic minimally invasive surgery is characterized by comprising the following steps:

2. The method for visual tracking of surgical instruments for laparoscopic minimally invasive surgery of claim 1, wherein said method further comprises:

3. The visual tracking method of the surgical instrument for the laparoscopic minimally invasive surgery as claimed in claim 1, wherein a lightweight network LinkNet is used as a segmentation model to perform segmentation of each frame of image in a video stream and target region extraction; the segmentation model is trained on an open dataset using ResNet18 as an encoder; in order to train the segmentation model, the output of the LinkNet network is evaluated and the back propagation is guided by two loss functions, namely cross-entropy loss and Jaccard, wherein the expression of the loss function is as follows:

F＝αE-(1-α)lnJ

wherein alpha is a weight parameter and takes a value from 0 to 1;

4. The method for visual tracking of surgical instruments for laparoscopic minimally invasive surgery according to claim 1, wherein said ellipse is defined as follows:

Ax²+Bxy+Cy²+Dx+Ey+F＝0

min||M^TX||²＝M^TXX^TM

wherein, M ═ A, B, C, D, E, F]^TA parameter matrix representing a fitted ellipse, X ═ u²，uv，v²，u，v，1]^TAn input matrix representing a fitted ellipse, X ═ P { (u, v) | (u)₁，v₁)，(u₂，v₂)，...，(u_N，v_N) And expressing a pixel point set comprising all pixel points at the edge of the target area, and N expresses the number of the pixel points in the pixel point set.

5. The method for visually tracking surgical instruments for laparoscopic minimally invasive surgery according to claim 1, wherein the preset conditions are: