Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT

Chen Zhao¹,
Renjun Shuai ORCID: orcid.org/0000-0003-1342-4075¹,
Li Ma²,
Wenjia Liu³ &
…
Menglin Wu¹

5358 Accesses
26 Citations
1 Altmetric
Explore all metrics

Abstract

Cervical cell classification has important clinical significance in cervical cancer screening at early stages. However, there are fewer public cervical cancer smear cell datasets, the weights of each classes’ samples are unbalanced, the image quality is uneven, and the classification research results based on CNN tend to overfit. To solve the above problems, we propose a cervical cell image generation model based on taming transformers (CCG-taming transformers) to provide high-quality cervical cancer datasets with sufficient samples and balanced weights, we improve the encoder structure by introducing SE-block and MultiRes-block to improve the ability to extract information from cervical cancer cells images; we introduce Layer Normlization to standardize the data, which is convenient for the subsequent non-linear processing of the data by the ReLU activation function in feed forward; we also introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images we use Tokens-to-Token Vision Transformers (T2T-ViT) combing transfer learning to classify the cervical cancer smear cell image dataset to improve the classification performance. Classification experiments using the model proposed in this paper are performed on three public cervical cancer datasets, the classification accuracy in the liquid-based cytology Pap smear dataset (4-class), SIPAKMeD (5-class), and Herlev (7-class) are 98.79%, 99.58%, and 99.88%, respectively. The quality of the images we generated on these three data sets is very close to the source data set, the final averaged inception score (IS), Fréchet inception distance (FID), Recall and Precision are 3.75, 0.71, 0.32 and 0.65 respectively. Our method improves the accuracy of cervical cancer smear cell classification, provides more cervical cell sample images for cervical cancer-related research, and assists gynecologists to judge and diagnose different types of cervical cancer cells and analyze cervical cancer cells at different stages, which are difficult to distinguish. This paper applies the transformer to the generation and recognition of cervical cancer cell images for the first time.

Cervical Cancer Histopathological Image Classification Using Imbalanced Domain Learning

Cervical cell classification with deep-learning algorithms

Article 10 January 2023

Deep Transfer Learning Model for Automated Screening of Cervical Cancer Cells Using Multi-cell Images

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cervical cancer is the fourth most common cause of death from cancer in females [48]. It is estimated that there will be 604,127 cases and 341,831 deaths worldwide in 2020, and it is the second most common cancer in women worldwide [28, 52]. At early stages of cervical cancer, the cure rate is nearly 100% [42], so the prevention, early detection and classification of cervical cancer are essential [64].

At present, cervical cancer screening methods mainly include human papillomavirus detection, cervical smear and acetic acid testing under colposcopy [72]. After the introduction of a Papanicolaou (Pap) smear [40], the standard screening test for cervical cancer and premalignant lesions is cervical cytology. As the most common screening test, cervical cytology has been extensively used and effectively reduces incidence and mortality. At present, manual screening of abnormal cells from a cervical cytology slide is still common practice. However, it is usually tedious, inefficient and expensive. Consequently, automatic screening methods have attracted increasing attention [3, 5]. Additionally, some research on cervical cell analysis shows that each independent cervical cell has intrinsic similarity. For example, superficial and intermediate cells generally have relatively small nuclei and have clear cytoplasmic and nuclear margins, while dyskeratotic and metaplastic cells have overlapping cytoplasmic and nuclear margins. In addition, koilocytotic cells have the presence of a perinuclear cavity, while other cells have a relatively thick cytoplasm [18, 34]. These observations indicate that there exists a potential relationship between cervical cell images. Therefore, accurate cervical cell classification is crucial to the automatic screening method. The analysis of Pap smear images requires low error tolerance and skilled pathologists, and the screening process is expensive and time-consuming. Therefore, an automated classification process can assist gynecologists in diagnosis and provide more objective test explanations.

Recently, deep learning has brought considerable improvements in accuracy in many applications [46]. Due to its high accuracy in many fields, deep learning has become the most advanced machine learning technology. Deep learning and CNNs have been successfully used in breast cancer detection [7], skin cancer recognition [73], and COVID-19 recognition and analysis [26]. Among them, there are many studies based on convolutional neural networks, and convolutional neural networks (CNNs) have been the standard for 3D medical image classification and segmentation. The convolutional operations used in these networks, however, inevitably have limitations in modeling the long-range dependency due to their inductive bias of locality and weight sharing [69]. To solve these problems, the transformer was created. Beginning at the end of 2020, transformer-based research has gradually increased. At present, some transformer-based research has surpassed CNN-based research in the fields of image classification, image detection, and image segmentation [33]. Dosovitskiy et al. [11] proposed a vision transformer (ViT) for the first time to be applied to image classification. They applied a method that does not focus on pixels but focuses on small areas of the image. They believe that dependence on CNN is not necessary and that applying direct pure converters based on image patch sequences can perform image classification tasks well.

Carion et al. [4] proposed a new method (DEtection TRansformer or DETR) that views object detection as a direct set prediction problem. Their approach streamlines the detection pipeline, effectively removing the need for many hand-designed components such as a nonmaximum suppression procedure or anchor generation that explicitly encode the prior knowledge about the task. Xie et al. [69] proposed a novel framework that efficiently bridges a convolutional neural network and a transformer (CoTr) for accurate 3D medical image segmentation. Compared to CNN, the self-attention in the transformer can produce a more interpretable model, from which the attention distribution can be checked, and each attention head can learn to perform different tasks. The number of operations required to calculate the association between two positions does not increase with distance.

Research on the classification of cervical cancer cells is mostly carried out on two public datasets. Herlev [27] consists of 917 images of Pap smear cells classified carefully by cytotechnicians and doctors. Each cell is described by 20 numerical features, and the cells fall into 7 classes. SIPaKMeD [43] consists of 4049 annotated cell images. The cells are classified by expert cytopathologists into five different classes. In addition to the public dataset, there is nonpublic dataset research [68]. Currently, the number of samples in the two public datasets is limited, and the classification accuracy of various studies is above 90%, which tends to overfit, and the cell types in various datasets are inconsistent. The weights are not balanced, and the clarity and quality of the samples are uneven. Regarding the nonpublic dataset, cervical cancer researchers cannot obtain the data and can only conduct research on the limited public dataset, and the research progress is limited.

To solve the above problems, this paper proposes a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on taming transformers and Tokens-to-Token Vision Transformers (T2T-ViT) for the first time. The method proposed in this paper expands its dataset sample size, balances the number and weight of each type of cervical cell, and generates high-quality sample images. A new dataset (liquid-based cytology Pap smear dataset [24]) is introduced to provide more objective materials for the cervical cancer cell generation model in this paper, and T2T-ViT is used to further improve classification accuracy, provide more objective information for gynecologists, improve the efficiency of clinical-pathological diagnosis, detect the patient’s condition in time, and improve the survival rate of cervical cancer patients. Figure 1 is an overview of the method framework of the model in this paper.

The main contributions and novelty of this model are as follows:

We propose a cervical cancer cell sample image generation model based on taming transformers (CCG-taming transformers), to our best knowledge, this is the first model combining CNNs and Transformer and applied to cervical cancer research,
In CCG-taming transformers, we adjust the encoder structure of VQGAN in the taming transformers: 1) we introduce new convolutional structures MultiRes-block to better realize the extraction and analysis of the key information of the cervical cancer cell image by the encoder; 2) we introduce SE-block to alleviate the vanishing gradient problem in the neural network; 3) we introduce Layer normalization to enhance the discrimination ability of feature representations. This model mainly solves the problem that there are few public data sets in the current cervical cancer research, the samples of each class are very different, and the weights and quantities of various samples in the data sets are not balanced. The cervical cancer cell images generated by this model can provide certain reference value for cervical cancer research.
We introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images we generate.
We introduce T2T-ViT combing transfer learning to classify the cervical cancer cell image dataset, which can solve the problem that the classification model based on CNNs may lose the details of the feature map.

2 Related work

Artificial intelligence and deep learning play an important role in cell classification, medical image classification, generation and analysis [17, 59, 64]. As new technologies develop, they become cost-effective and less time-consuming. They are now more popular than traditional methods (such as Pap smears, colposcopy, and cervicography) [8]. These technologies have nothing to do with human experience. Although they cannot replace gynecologists for pathological evaluation, they can provide assistance for clinical diagnosis to a large extent, improve the diagnostic efficiency of gynecologists, and reduce the subjective components of diagnosis.

Several works classify and detect cervical cancer in the literature. Convolutional neural networks (CNNs) [35] have been proposed to automatically learn multilevel features through hierarchical deep architecture. Wieslander et al. [63] used ResNet for binary classification of benign and malignant cervical cells. Plissiti et al. [43] proposed the annotated cervical cell image dataset SIPaKMeD and applied CNN to classify five types of cervical cells. Gautam et al. [14] proposed a patch-based approach using CNN combined with transfer learning for the segmentation of nuclei in single-cell images. Ghoneim et al. [15] proposed a cervical cancer cell detection and classification system based on convolutional neural networks (CNNs) and achieved 91.2% accuracy in the classification problem (7-class). William et al. [65] proposed contrast local adaptive histogram equalization for image enhancement. Cell segmentation was achieved through a trainable Weka segmentation classifier, and a sequential elimination approach was used for debris rejection. Wang et al. [58] proposed multiscale representation for scene classification, which is realized by a global–local two-stream architecture. They also [60] explored the attention mechanism and propose a novel endto-end attention recurrent convolutional network (ARCNet) for scene classification, their research has made outstanding contributions to the classification field. Although the neural network framework based on CNN has achieved good accuracy in the classification of cervical cancer smear cell images, the training requires more computing resources, and the number of network layers needs to reach a certain depth to capture the deep level of the sample image information. Compared with the transformer, in the CNN-based model, the number of operations required to calculate the association between two positions through convolution increases with distance. The number of operations required to calculate the association between two positions based on the self-attention in the transformer is independent of the distance.

Generative adversarial networks (GANs) [1] have been successfully used to synthesize human faces, landscapes and even medical images; they are mainly used to expand the size of datasets and balance the weight and number of samples in each category. Pollastri et al. [44] presented a novel strategy that employs DC-GAN to augment data in the skin lesion segmentation task. The proposed framework generates both skin lesion images and their segmentation masks, making the data augmentation process extremely straightforward. Karras et al. [31] proposed an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes and stochastic variation in the generated images, and it enables intuitive, scale-specific control of the synthesis. Thuy et al. [54] proposed combining deep learning, transfer learning and generative adversarial networks to improve classification performance. Fine-tuning on the VGG16 and VGG19 networks was used to extract the well-discriminated cancer features from histopathological images before feeding them into the neural network for classification. Han et al. [20] proposed a 3D multiconditional GAN (MCGAN) to generate realistic/diverse nodules placed naturally on lung computed tomography images to boost sensitivity in 3D object detection. However, they ignore the potential relationships among cervical cell images during feature learning, and thus, may influence the representation ability of CNN features. Besides, GANs usually require large training datasets, which are often scarce in the medical field, and GANs are only applied to medical image synthesis at a relatively low resolution. In this paper, we use CCG-taming transformers and SMOTE-Tomek Links to balance the weight and number of samples between different categories, through this method, we can generate more samples of cervical cancer cells in different categories, providing more objective reference value for the research of cervical cancer.

Amazing results from transformer models on natural language processing have encouraged the vision community to study their research on computer vision problems [19]. Vaswani et al. [57] proposed the transformer network architecture for the first time, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Valanarasu et al. [56] proposed a gated axial attention model that extends the existing architectures by introducing an additional control mechanism in the self-attention module to train the model effectively on medical images, they also proposed a local-global training strategy (LoGo) to train the model effectively on medical images with significant performance. Zhu et al. [74] proposed deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 fewer training epochs. He et al. [22] introduced DropBlock as a regularization technique for HSI accurate classification to mitigate the overfitting problem in CNN-based HSI classification.

Chen et al. [6] studied the low-level computer vision task (e.g., denoising, superresolution and deraining) and developed a new pretrained model, namely, an image processing transformer (IPT). Wang et al. [62] investigated a simple backbone network useful for many dense prediction tasks without convolutions, and they proposed the pyramid vision transformer (PVT), which overcomes the difficulties in applying the transformer to various dense prediction tasks. The transformer-based model has achieved high accuracy in image classification, target detection and other fields. Park et al. [41] proposed a novel vision transformer by using the low-level CXR feature corpus that is obtained to extract the abnormal CXR features. Wang et al. [61] combined transfer learning with the transformer model to predict the small-dataset Heck reaction. Besides, the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency, and the redundant attention backbone design of ViT leads to limited feature richness in fixed computational budgets and limited training samples. To solve these problems, we introduced T2T-ViT, which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study.

3 Method

In this paper, we propose a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on Tokens-to-Token Vision Transformers (T2T-ViT) with transfer learning. The overall framework of the proposed model is illustrated in Fig. 2.

First, we used ImageNet to train T2T-ViT, preprocessed the cervical cancer Pap cell smear dataset, used CCG-taming transformers to expand the dataset, and then balanced the weights and numbers of different classes of images. After that, the newly obtained dataset was passed to the T2T-ViT that was pretrained by ImageNet and completed the transfer learning for classification, and finally, the result of the classification was obtained. Next, we elaborate on each part of the model in Fig. 2.

3.1 Taming transformers

Taming Transformers [13] are used to generate a variety of high-quaility images; in contrast to CNNs, they contain no inductive bias that prioritizes local interactions. It uses a convolutional VQGAN to learn a codebook of context-rich visual parts, whose composition is subsequently modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of convolutional approaches to transformer based high resolution image synthesis.

3.2 CCG-taming transformers

An image x ∈ ℝ^{H × W × 3} is represented by the spatial collection of codebook entries as $ {z}_{\mathrm{q}}\in {\mathbb{R}}^{h\times w\times {n}_z} $, and n_z represents is the dimensionality of codes. First learn a convolutional network composed of an encoder E and a decoder G, and it will learn the code to represent the image discrete codebook $ Z={\left\{{z}_k\right\}}_{k=1}^K\subset {\mathbb{R}}^{n_z} $. Image x passes through encoder E, We obtain z_q using the encoding $ \hat{z}=E(x)\in {\mathbb{R}}^{h\times w\times {n}_z} $ and a subsequent elementwise quantization q(⋅) of each spatial code $ {\hat{z}}_{ij}\in {\mathbb{R}}^{n_z} $ onto its closest codebook entryz_k:

$$ {z}_{\mathrm{q}}=\mathrm{q}\left(\hat{z}\right)=\left(\underset{z_k\in Z}{argmin}\left\Vert {\hat{z}}_{ij}-{z}_k\right\Vert \right)\in {\mathbb{R}}^{h\times w\times {n}_z} $$

(1)

$$ \hat{x}=G\left({z}_{\mathrm{q}}\right)=G\left(\mathrm{q}\left(E(x)\right)\right) $$

(2)

$$ {\mathrm{L}}_{\mathrm{VQ}}=\left(E,G,Z\right)={\left\Vert x-\hat{x}\right\Vert}_2^2+{\left\Vert \mathrm{sg}\left[E(x)-{z}_{\mathrm{q}}\right]\right\Vert}_2^2+\beta {\left\Vert \mathrm{sg}\left[{z}_{\mathrm{q}}\right]-E(x)\right\Vert}_2^2 $$

(3)

Where $ {\left\Vert x-\hat{x}\right\Vert}_2^2 $ is a reconstruction loss, sg[⋅] denotes the stop-gradient operation, and $ {\left\Vert \mathrm{sg}\left[{z}_{\mathrm{q}}\right]-E(x)\right\Vert}_2^2 $ is the so-called “commitment loss” with weighting factorβ.

VQGAN uses a discriminator and perceptual loss to maintain good perceptual quality at an increased compression rate. VQGAN replaces B in A with a discriminator D to distinguish between real and reconstructed images.

$$ {\mathrm{L}}_{\mathrm{GAN}}\left(\left\{E,G,Z\right\},D\right)=\left[ logD(x)+\mathit{\log}\left(1-D\left(\hat{x}\right)\right)\right] $$

(4)

The complete objective for finding the optimal compression model Q^∗ = {E^∗, G^∗, Z^∗} then reads:

$$ {\mathrm{Q}}^{\ast }=\underset{E,G,Z}{argmin}\underset{D}{\mathit{\max}}{\mathbbm{E}}_{x\sim p(x)}\left[{\mathrm{L}}_{\mathrm{VQ}}\left(E,G,Z\right)+\lambda {\mathrm{L}}_{\mathrm{GAN}}\left(\left\{E,G,Z\right\},D\right)\right] $$

(5)

Where we compute the adaptive weight λ according to:

$$ \lambda =\frac{\nabla_{G_L}\left[{\mathrm{L}}_{\mathrm{rec}}\right]}{\nabla_{G_L}\left[{\mathrm{L}}_{\mathrm{GAN}}\right]+\delta } $$

(6)

Where L_rec is the perceptual reconstruction loss [71], $ {\nabla}_{G_L}\left[\cdot \right] $ denotes the gradient of its input w.r.t. the last layer L of the decoder, and δ = 10⁻⁶ is used for numerical stability. With E and G available, the cervical cell images can be represented in terms of the codebook indices of their encodings. The quantized encoding of an image x is given by $ {z}_q=\mathrm{q}\left(E(x)\right)\in {\mathbb{R}}^{h\times w\times {n}_z} $ and is equivalent to a sequence s ∈ {0, …, |Z| − 1}^h × w of indices from the codebook, which is obtained by replacing each code by its index in the codebook Z:

$$ {s}_{ij}=k\boldsymbol{suchthat}{\left({z}_{\mathrm{q}}\right)}_{ij}={z}_k $$

(7)

By mapping indices of a sequence s back to their corresponding codebook entries, $ {z}_{\mathrm{q}}=\left({z}_{s_{ij}}\right) $ is readily recovered and decoded to an image $ \hat{x}=G\left({z}_{\mathrm{q}}\right) $. Thus, after choosing some ordering of the indices in s, image generation can be formulated as autoregressive next-index prediction: given indices s_<i, the transformer learns to predict the distribution of possible next indices, i.e., p(s_i|s_<i) to compute the likelihood of the full representation as $ p(s)=\prod \limits_ip\left({s}_i\left|{s}_{<i}\right.\right) $. This allows us to directly maximize the log-likelihood of the data representations (As shown in equation (8)), and Fig. 3 is the overview of the framework for the CCG-taming transformers.

$$ {\mathrm{L}}_{\boldsymbol{Transformer}}={\mathbbm{E}}_{x\sim p(x)}\left[- logp(s)\right] $$

(8)

3.2.1 SE-block and MultiRes-block

Figure 4 shows the structure of SE-block and MultiRes-block. We add the SE (squeeze-and-excitation) block [29] in encoder E of VQGAN, which can adaptively recalibrate the characteristic response of each channel by explicitly modeling the interdependence between channels. The SE-block has excellent generalization ability in different datasets and brings significant performance improvement to the convolutional neural network. To improve the image information learning ability of VQGAN for cervical cancer Pap cell smears, this paper introduces the convolutional block attention module (CBAM) [67] to enhance VQGAN’s attention to cells and nuclei.

In Fig. 4, the dimension information in the excitation block represents the output of the layer. First, the global average-pooling layer in the squeeze block is used for compression, and a bottleneck structure is formed by two fully connected layers to construct the correlation between channels and export the same weight as the input feature. After that, the feature dimension is reduced to 1/16 of the input, and then the initial feature dimension is restored through the full connection layer after the activation of ReLU. Then, a weight between 0 and 1 is obtained through the sigmoid activation function, and the weight is weighted to the features of different feature channels through the scale operation. Finally, the feature of the residual block is redefined before the addition. The structure of the decoder is the same as the structure of the encoder, the difference is that the spatial structure is opposite to the encoder structure, and down-sampling becomes up-sampling. The specific structure is shown in Fig. 5.

The introduction of the SE-block makes the classification model of cervical cancer cell images more nonlinear, which can better fit the complex correlation between channels, greatly reduce the number of parameters and calculations, and solve the problem of gradient disappearance near the input layer on the trunk, which makes the model easy to optimize and increases the generalization ability of the classification model.

We improved the convolution structure in the encoder and introduced the MultiRes-block [25], which was originally used in the image segmentation model decoder to extract image information. The MultiRes-block is introduced to improve the efficiency of extracting cervical cancer cell images and optimize the structure of the decoder to replace the redundant structure of multilayer convolution and superposition in the original decoder.

We also add a residual connection because of their efficacy in medical image information extraction and for the introduction of the 1 × 1 convolutional layers, which may allow us to comprehend some additional spatial information. We call this arrangement a MultiRes-block, as shown in Fig. 5.

Figure 5 Illustrates the MultiRes-block, where we increased the number of filters in the successive three layers gradually and added a residual connection (along with a 1 × 1 filter for conserving dimensions). Res paths of MultiRes-blocks introduce some additional processing to make the two feature maps more homogeneous

3.2.2 Layer normalization

The input of neurons in the same layer in LN has the same mean and variance, and different input samples have different mean and variance; therefore, LN does not depend on the size of the batch and the depth of the input sequence, and is used for a new word representation of Attention output. For standardization processing, adding Layer Normalization processing is to standardize the data, which is convenient for the subsequent non-linear processing of the data by the ReLU activation function in Feed Forward. Normalize the data to the active area of the ReLU activation function through Normalization, which can make the activation function work better.

LN is the normalize operation for the input of all neurons in a certain layer of the deep network according to the following formula.

$$ {\sigma}^l=\sqrt{\frac{1}{H}\sum \limits_{i=1}^H{\left({a}_i^l-\frac{1}{H}\sum \limits_{i=1}^H{a}_i^l\right)}^2} $$

(9)

For the i^th summed input in the l^th th layer, H denotes the number of hidden units in a layer. $ {a}_i^l $ is the vector representation of the summed inputs to the neurons in layer i^th,all the hidden units in a layer share the same normalization termsσ.

3.2.3 SMOTE-Tomek links

To balance the source data set and the number of samples and weights of the images we generate, we used Synthetic Minority Oversampling Technique (SMOTE) and Tomek Links Under sampling in a pipelined approach [50]. Due to random oversampling, the strategy of simply copying samples is adopted to increase the minority samples. This is easy to cause the problem of model overfitting, that is, the information learned by the model is too special and not general enough. The basic idea of the SMOTE algorithm is to analyze the minority samples and artificially synthesize new samples based on the minority samples and add them to the data set. There is a parameter that represents the percentage of oversampling, and its value represents the number of synthetic samples to be created. For each minority instance, the k nearest neighbors make them belong to the same class.

$$ k=\frac{\left(\boldsymbol{SMOTE}\%\right)}{100} $$

(10)

In this case, the minority class is oversampled with the applied ‘sampling-strategy’ parameter represented as ‘k’ (k = 0.5) i.e. keeping the sampling strategy parameter as 0.5 increases the number of minority class examples by 50%.

However, SMOTE has some blindness when selecting neighbors. Therefore, we use Tomek Links method for under-sampling to clean up noisy samples. Tomek Links can find more samples of opposite categories and delete the majority of samples in the pair. After such processing, the dividing line between the melanoma and benign has become clearer, making the existence of the minority more obvious.

A Tomek Link is the distance between two samples from two different classes say x and y such that for any sample z:

$$ d\left(x,y\right)<d\left(x,z\right)\boldsymbol{and}d\left(x,y\right)<d\left(y,z\right) $$

(11)

In a pipelined approach the minority Class is oversampled by using SMOTE followed by removing the majority class samples by Tomek Links.

3.3 Tokens-to-token vision transformers (T2T-ViT)

Tokens-to-token vision transformers (T2T-ViT) [70] can progressively tokenize an image to tokens and have an efficient backbone. T2T-ViT consists of two main components (Fig. 6): the tokens-to-token module (T2T module) is used to model the local structure information of images and reduce the token length progressively; the T2T-ViT backbone is used to draw the global attention relations on the tokens from the T2T module.

3.3.1 T2T-ViT structure

Each T2T process has two steps: restructurization and soft splitting. The T2T transformer is shown in Fig. 7:

The restructurization is shown in Fig. 7, giving a sequence of tokens T from the last layer; here, T denotes tokens from the last layer. T is transformed by the self-attention block.

$$ {T}^{\prime }=\mathrm{MLP}\left(\mathrm{MSA}(T)\right) $$

(12)

Where MSA denotes the multihead self-attention operation with layer normalization and MLP is the multilayer perceptron with layer normalization in the standard transformer [11]. Then, the tokens are reshaped as images on the spatial dimension.

$$ I=\mathrm{Reshape}\left({T}^{\prime}\right) $$

(13)

Where “Reshape” reorganizes tokens T^′ ∈ ℝ^l × c to I ∈ ℝ^{h × w × c}, where l is the length of T^′, h, w, and c are the height, width and channel, respectively, and l = h × w.

As shown in Fig. 7, after obtaining the restructurized image I, a soft split is applied to model the local structure information and reduce the token length. Specifically, to avoid information loss in generating tokens from the restructured image, we split the cervical cell image into patches with overlap. As such, each patch is correlated with surrounding patches to establish prior knowledge that there should be stronger correlations between surrounding tokens. The tokens in each split patch are concatenated as one token (tokens-to-token, Fig. 7); thus, the local information can be aggregated from surrounding pixels and patches. When conducting the soft split, the size of each patch is k × k with s overlapping and p padding on the image, where k − s is similar to the stride in the convolution operation. Therefore, for the reconstructed image I ∈ ℝ^{h × w × c}, the length of output tokens T_O after soft splitting is

$$ {l}_O=\left\lfloor \frac{h+2p-k}{k-s}+1\right\rfloor \times \left\lfloor \frac{w+2p-k}{k-s}+1\right\rfloor $$

(14)

Each split patch has size k × k × c. We flatten all patches in spatial dimensions to tokens $ {T}_O\in {\mathbb{R}}^{l_0\times {ck}^2} $. After the soft split, the output tokens are fed to the next T2T process.

3.3.2 T2T-ViT backbone and transfer learning

The T2T-ViT has two parts: the tokens-to-token (T2T) module and the T2T-ViT backbone (Fig. 6). There are various possible design choices for the T2T module. Here, we set n = 2 as shown in Fig. 8, which means there are n + 1 = 3 soft split and n = 2 restructurization in the T2T module. The patch size set for the three soft splits is P = [7, 3, 3], and the overlapping set is S = [3, 1, 1], which reduces the size of the input image from 224 × 224 to 14 × 14 according to Eq. 14.

By conducting the above restructurization and soft splitting iteratively, the T2T module can progressively reduce the token length and transform the spatial structure of the image. The iterative process in the T2T module can be formulated as:

$$ {T}_i^{\prime }=\mathrm{MLP}\left(\mathrm{MSA}\left({T}_i\right)\right) $$

(15)

$$ {I}_i=\mathrm{Reshape}\left({T}_i^{\prime}\right) $$

(16)

$$ {T}_{i+1}=\mathrm{SS}\left({I}_i\right),i=1\dots \left(n-1\right) $$

(17)

For the input cervical cell image I₀, we apply a soft split first to split the image to tokens: T₁ = SS(I₀). After the final iteration, the output tokens T_f of the T2T module have a fixed length; thus, the backbone of T2T-ViT can model the global relations on T_f. In this paper, the backbone of T2T-ViT selects the T2T-ViT-ResNeXt with the highest classification accuracy of the original T2T-ViT.

The mathematical formulation of a classifier with a training algorithm is shown below:

4 Experiment

4.1 Dataset

To evaluate the classification performance of our method, three public cervical cell image datasets are used in this paper:

The liquid-based cytology Pap smear dataset [24] consists of a total of 963 LBC images subdivided into four sets representing the four classes: NILM, LSIL, HSIL, and SCC. It comprises precancerous and cancerous lesions related to cervical cancer as per standards under the Bethesda System (TBS).

SIPaKMeD [43] consists of 4049 images of isolated cells with 5 different categories from 966 cluster cell images of Pap smear slides. The dataset comprises cells belonging to five categories: (1) dyskeratotic, (2) koilocytotic, (3) metaplastic, (4) parabasal and (5) superficial-intermediate. Classes 1–2 represent abnormal cervical cells, classes 4–5 represent normal cervical cells, and class 3 represents benign cells.

The Herlev dataset [27] consists of 917 isolated single-cell images, that is, the images contain one cervical cell. There are a total of seven classes: (1) superficial squamous epithelia, (2) superficial squamous epithelia, (3) columnar epithelia, (4) mild squamous non-keratinizing dysplasia, (5) moderate squamous nonkeratinizing dysplasia, (6) severe squamous nonkeratinizing dysplasia and (7) squamous cell carcinoma in situ intermediate. Classes 1–3 are normal cervical cells, whereas classes 4–7 are abnormal cervical cells. Tables 1, 2, 3 lists the data distribution of various samples of the three cervical cancer datasets liquid-based cytology Pap smear, SIPaKMeD, and Herlev.

Table 1 Data distribution of the liquid-based cytology Pap smear dataset

Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT

Abstract

Similar content being viewed by others

Cervical Cancer Histopathological Image Classification Using Imbalanced Domain Learning

Cervical cell classification with deep-learning algorithms

Deep Transfer Learning Model for Automated Screening of Cervical Cancer Cells Using Multi-cell Images

1 Introduction

2 Related work

3 Method

3.1 Taming transformers

3.2 CCG-taming transformers

3.2.1 SE-block and MultiRes-block

3.2.2 Layer normalization

3.2.3 SMOTE-Tomek links

3.3 Tokens-to-token vision transformers (T2T-ViT)

3.3.1 T2T-ViT structure

3.3.2 T2T-ViT backbone and transfer learning

4 Experiment

4.1 Dataset

4.2 Model training process

4.3 Evaluation metrics

4.3.1 Evaluation metrics of cervical cell classification

4.3.2 Evaluation metrics of cervical cell generation

4.4 Results

4.4.1 CCG-taming transformers generated results

4.4.2 Cervical cell classification results

5 Discussion

6 Applied technical analysis

7 Limitations

8 Conclusion

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation