CN114627136B

CN114627136B - Tongue image segmentation and alignment method based on feature pyramid network

Info

Publication number: CN114627136B
Application number: CN202210106423.6A
Authority: CN
Inventors: 张明川; 王莎莎; 王琳; 郑瑞娟; 吴庆涛; 朱军龙; 冀治航
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2024-02-27
Anticipated expiration: 2042-01-28
Also published as: CN114627136A

Abstract

The invention provides a tongue image segmentation and alignment method based on a feature pyramid network, which comprises the steps of constructing a feature pyramid, carrying out multi-scale fusion on the detail features of a lower layer and the semantic features of a higher layer of the network, intercepting a tongue body region by utilizing a target detection network, and then further generating a mask for each feature image to finish tongue image segmentation; the segmented tongue images are further processed, alignment of the tongue images is achieved through conformal mapping, an efficient and accurate method is provided for malformed medical image processing and diagnosis, and the method has a certain significance for objectification development of tongue diagnosis.

Description

Tongue image segmentation and alignment method based on feature pyramid network

Technical Field

The invention relates to the fields of computer vision, image processing and traditional Chinese medicine tongue diagnosis, in particular to a tongue image segmentation and alignment method based on a characteristic pyramid network.

Background

The theory of traditional Chinese medicine considers that the tongue is a window for mapping the viscera changes in the human body, and reflects the pathogenic qi nature, advance and retreat of the disease and the healthy qi. The tongue inspection of traditional Chinese medicine is an important component of inspection in traditional Chinese medicine, and refers to a method for judging the health condition of a human body by observing tongue features such as tongue body, tongue quality, tongue fur, sublingual collaterals and the like, and is one of main basis in the clinical diagnosis process of traditional Chinese medicine.

The tongue diagnosis is painless and noninvasive, and is one of the most commonly used diagnostic methods in traditional Chinese medicine. The appearance characteristics of the tongue body such as color, texture, shape, tongue fur and the like reveal a great deal of information of the health condition of the human body. However, traditional tongue diagnosis is highly dependent on the experience of the clinician, and different doctors diagnosing the same patient may draw different conclusions, which makes traditional Chinese medicine tongue diagnosis less objective. These deficiencies can be improved by a computer-aided tongue diagnosis method, but the collected tongue picture also comprises background areas such as faces, lips and the like, and some tongue bodies are askew, which affects the subsequent auxiliary diagnosis and treatment, so that the tongue bodies need to be segmented and aligned first.

At present, the traditional tongue segmentation method is roughly divided into three types, namely an edge-based method, a region-based method and a region-and edge fusion-based method. The method based on the region is characterized in that a watershed algorithm is used for dividing the tongue image into a plurality of small regions, and then the regions based on the color similarity are combined to obtain the final tongue body segmentation. For edge-based methods, edge initialization is required to segment the final tongue. The method is based on the fusion method of the region and the edge, firstly, the region of interest of the tongue body is extracted by utilizing color information, and then, the region of interest is used for replacing an original image to carry out subsequent segmentation. Although these approaches have met with some success, certain limitations remain. For example, the lips and the tongue are very similar in color and require additional pretreatment, which complicates the whole process and makes it impossible to accurately segment the tongue with smooth edges.

In recent years, with the rapid development of deep learning, a method based on a deep convolutional neural network is developed to improve the robustness of tongue segmentation. Because the deep neural network has strong characteristic learning capability, the deep learning-based method has higher performance than the traditional tongue segmentation method. The shallow layer part in the deep network has more detail information, the high layer part has rich semantic information, however, most networks are segmented by utilizing the last high layer characteristic, but the shallow layer detail characteristic is ignored, and the shallow layer detail information can improve the segmentation accuracy to a certain extent. The ideal segmentation effect cannot be achieved using a single high-level feature.

Disclosure of Invention

In order to solve the problems of the traditional method and the depth network in tongue image segmentation, through the thought of learning a residual error network, the invention provides a tongue image segmentation and alignment method based on a feature pyramid network.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a tongue image segmentation and alignment method based on a feature pyramid network comprises the following steps:

step 1: collecting a required tongue picture by using tongue picture collecting equipment, and carrying out data enhancement processing on the picture;

step 2: manually marking the tongue bodies in all the picture data;

step 3: preprocessing the marked picture, and then sending the preprocessed picture into a constructed feature pyramid network to extract features to obtain an effective feature layer;

step 4: sending the effective feature layer extracted in the step 3 into an RPN to obtain a suggestion frame;

step 5: carrying out maximum pooling operation on all the suggestion boxes obtained in the step 4;

step 6: connecting two fully-connected network layers after the suggestion frame subjected to the maximum pooling operation, respectively judging whether an object is contained in the suggestion frame, and then adjusting the suggestion frame to obtain a prediction frame;

step 7: taking the prediction frame obtained in the step 6 as a region interception part of the mask model, and classifying pixel points by using the mask model after interception to obtain a semantic segmentation result;

step 8: calculating boundary areas and Fourier coefficients of the segmented tongue images;

step 9: calculating corresponding areas of the boundaries of the unit disc and the tongue picture;

step 10: after determining the boundary value, calculating the mapping of the internal region by using a Kexil integral formula;

step 11: constructing a conformal map, which updates the real part and the imaginary part of the function through iteration;

step 12: outputting the aligned image.

Further, in the step 1, when the data enhancement processing is performed on the picture, the data set is expanded through rotation and horizontal overturn.

Further, step 2 adopts lableme software when manually marking the tongue body in all the picture data.

Further, in step 5, all the suggestion boxes are fixed to 7x7 features after the maximum pooling operation.

Further, the feature pyramid network is constructed from the middle 4 bnecks of mobiletv 2.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, by constructing the feature pyramid, multi-scale fusion is carried out on the low-level detail features and the high-level semantic features of the network, a tongue region is firstly intercepted by utilizing a target detection network, and then a mask is further generated for each feature map, so that tongue image segmentation is completed; the segmented tongue images are further processed, alignment of the tongue images is achieved through conformal mapping, an efficient and accurate method is provided for malformed medical image processing and diagnosis, and the method has a certain significance for objectification development of tongue diagnosis.

Drawings

FIG. 1 is a flow chart of a tongue segmentation and alignment method based on a feature pyramid network of the present invention;

FIG. 2 is a schematic view of the shape of the unit disks mentioned in the examples;

FIG. 3 is a schematic illustration of the shape of the standard tongue as referred to in the examples;

fig. 4 is a tongue labeling diagram given in the example.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all, embodiments of the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.

In the invention, a tongue image segmentation and alignment method based on a feature pyramid network is provided, wherein a Mobilenetv2 is used as a main feature extraction network, the feature pyramid network is constructed by taking the output of 4 bnecks in the middle of the Mobilenetv2, and the attention of the network to detail features is enhanced through multi-scale feature fusion, so that more accurate segmentation is facilitated. The segmented images are aligned using a conformal mapping based tongue alignment method, the main flow of which is described in detail below.

The invention is mainly used for dividing the tongue picture shot in the standard environment, and the shot picture comprises a face part and a lip part. Therefore, in order to divide the tongue more conveniently and rapidly, the specific position of the tongue is detected first, and the required tongue part is selected by a frame, so that the complete tongue is divided. The invention is mainly divided into two parts, wherein the first part is used for positioning and dividing a tongue picture, the second part is used for realizing tongue picture alignment through conformal mapping on the divided tongue picture, the flow diagram of the invention is shown in figure 1, and the specific steps are as follows:

1. tongue positioning and segmentation

The positioning of the tongue body is to firstly detect the position of the tongue body in the image, wherein the tongue body in the image is a foreground point, the rest part is a background point, and more accurate prediction frames are obtained through continuous adjustment, and meanwhile, a mask is generated for each prediction frame, so that the tongue body is segmented. And replacing a backup in the network with a mobilent v2 by using a Mask R-CNN network model, and combining the high-resolution and low-resolution feature images to make full use of low-layer detail information and high-layer semantic information to realize more accurate segmentation. The specific work is as follows:

(1) Basic feature extraction: firstly, inputting a tongue picture into a network, firstly, obtaining a feature picture through standard convolution, then expanding a channel of the feature picture by using 1×1 convolution, enriching feature quantity, then performing feature integration by using 3×3 convolution, finally compressing by using 1×1 convolution, and finally connecting the input feature picture with the compressed feature picture by using jump connection, thereby enriching feature quantity and reducing parameter quantity. For the extracted high-level feature images, the spatial resolution of the high-level feature images is improved to be the same as that of the feature images of the upper layer in a deconvolution mode, the high-level feature images and the low-level feature images are fused in an addition mode, then the fused feature images are convolved to generate a common feature layer, and the fused feature images with multiple scales are continuously and iteratively generated for subsequent use.

(2) Obtaining a prediction frame: and sending the extracted multi-scale effective feature map into an RPN (remote procedure network) to generate a suggestion frame, removing some overlapped suggestion frames through non-maximum value suppression operation, sending the suggestion frame generated by the RPN and the multi-scale effective feature layer to an ROI (region planning) layer together, fixing all region planning to be the same size, carrying out concat connection on the ROI generated in the process, and sending the ROI to a full-connection layer and a mask module. The fully connected layer can conduct category prediction and regression of the bounding box to generate a more accurate prediction box.

(3) mask semantic segmentation: the Mask module outputs a binary Mask for each region of interest selected by the box while performing classification and bounding box regression. The mask branch first performs a max pulling operation on the region of interest to a size of 14×14×256, then performs two deconvolutions, where the convolution kernels used are all 3×3, and outputs a 28×28×80 mask through the final deconvolution operation. Sigmoid is applied to each pixel in the region of interest, and then the average of the cross entropy of all pixels over the region of interest is taken as the segmentation penalty. The class probability of each pixel can be obtained through the part, so that semantic segmentation is realized; pixels of the tongue image are classified into 2 classes, namely, a tongue region and a background region.

2. Tongue alignment

For the segmented tongue images, a self-adaptive alignment method is used, and standard alignment can be realized for the segmented tongue images with different shapes. Firstly, constructing regional mapping of the segmented tongue boundary by utilizing a Fourier descriptor, then expanding the mapping to an internal region by utilizing the combination of a Kexil method and a finite difference method, and finally realizing mapping from the segmented inclined tongue image to the standard tongue image by utilizing Riemann mapping. Fig. 2 and 3 are schematic views of the shapes of the unit disks and standard tongue images mentioned herein, respectively.

In the whole mapping process, two mappings are obtained, namely, the mapping from the segmented tongue picture to the unit circle and the mapping from the segmented tongue picture to the standard tongue picture. In the Riemann mapping, there is a mapping from the original tongue to the unit disk, and there is a mapping from the standard tongue to the unit disk, and at this time there will be an inverse mapping from the unit disk to the standard tongue, and by this inverse mapping, a composite mapping from the segmented original tongue to the standard tongue is obtained.

Assuming that the tongue picture area is omega 1, the unit disc is D, and the standard tongue picture area is omega 2; firstly, the mapping from D to omega 1 and the mapping from D to omega 2 are obtained; the mapping from Ω 1 to Ω 2 is then found, i.e. what we finally need. The method mainly comprises the following steps:

the first step: firstly, calculating a corresponding tongue picture boundary, and calculating a boundary region by using a Fourier descriptor through a formula (1):

and a second step of: on the basis of obtaining the boundary correspondence, further calculating the mapping of the internal region, wherein the cauchy integral formula is shown in formula (2), and when Γ is a unit circle, the cauchy integral formula becomes formula (3):

wherein z and z ₀ Points on the boundary and interior, respectively, f (z ₀ ) And F (z) ₀ ) Mapping representing boundary and interior regions, respectively:

wherein z is _k ＝e ^2jπ(k/K) Represents the form of a dot on a unit circle, phi _k Representing the corresponding tongue boundary.

And a third step of: constructing a conformal mapping, and setting functions of the conformal mapping as follows:

F＝U+j·V (5)

fourth step: calculating the mapping of the original tongue image to the unit circle:

assume that the original tongue is denoted as I ₀ [u,v,w]The image of the original image mapped onto the unit circle is expressed as J [ x, y, w]The expression can be expressed as follows:

J[x,y,w]＝I ₀ [U(x,y),V(x,y),w] (6)

wherein (u, v) ε Ω ₁ (original tongue area), (x, y) ∈d (unit circle), w=1, 2,3 represents RGB three channels.

Fifth step: omega determination ₁ To omega ₂ Mapping of I ₀ [u,v,w]The original tongue picture is represented by the image,representing the standard tongue picture after alignment; for arbitrary->Its corresponding pixel position on the unit circle is denoted +.>The mapping from unit circle to standard tongue image is thus expressed as:

the mapping of the original image to the unit disk is calculated by the formula (6), and the formula (6) is substituted into the formula (7) to obtain the following formula:

thus far, the original image I can be obtained by the formula (8) ₀ I to standard tongue picture ₁ Is mapped to the mapping of (a).

The specific implementation manner of this embodiment is as follows:

step 1: collecting a required tongue picture by using tongue picture collecting equipment, carrying out data enhancement processing on the picture, and expanding the data by rotation and horizontal overturning;

step 2: manually marking the tongue bodies in all the picture data by using lableme software; the tongue picture annotation diagram given in the embodiment is shown in fig. 4;

the feature pyramid network is constructed by the output of the middle 4 bnecks of the Mobilenetv 2;

step 5: performing maximum pooling operation on all the suggestion frames obtained in the step 4, and fixing all the suggestion frames to 7x7 features after the maximum pooling operation;

step 12: outputting the aligned images, and ending the algorithm.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A tongue image segmentation and alignment method based on a feature pyramid network is characterized by comprising the following steps:

step 2: manually marking the tongue bodies in all the picture data;

step 12: outputting the aligned image;

in the above steps, the specific steps for obtaining the mapping are as follows: the tongue picture area is set to be omega 1, the unit disc is set to be D, and the standard tongue picture area is set to be omega 2; firstly, the mapping from D to omega 1 and the mapping from D to omega 2 are obtained, and then the mapping from omega 1 to omega 2 is obtained;

wherein z is _k ＝e ^2jπ(k/K) Represents the form of a dot on a unit circle, phi _k Representing the corresponding tongue boundary;

F＝U+j·V (5)

let the original tongue picture be represented as I ₀ [u,v,w]The image of the original image mapped onto the unit circle is expressed as J [ x, y, w]Expressed by the following formula:

J[x,y,w]＝I ₀ [U(x,y),V(x,y),w] (6)

wherein (u, v) ε Ω ₁ (original tongue area), (x, y) ∈d (unit circle), w=1, 2,3 represents RGB three channels;

2. The method for tongue segmentation and alignment based on feature pyramid network as claimed in claim 1, wherein step 1 expands the set by rotation and horizontal flip when the data enhancement processing is performed on the picture.

3. The method for segmenting and aligning tongue images based on a feature pyramid network as claimed in claim 1, wherein step 2 adopts lableme software when manually marking the tongue bodies in all picture data.

4. The method for tongue segmentation and alignment based on feature pyramid network according to claim 1, wherein in step 5, all suggested frames are fixed to 7x7 features after maximum pooling operation.

5. The tongue segmentation and alignment method based on feature pyramid network of claim 1, wherein the feature pyramid network is constructed from the outputs of the middle 4 bnecks of mobiletv 2.