1 Introduction

Cervical cancer is the fourth most common cause of death from cancer in females [48]. It is estimated that there will be 604,127 cases and 341,831 deaths worldwide in 2020, and it is the second most common cancer in women worldwide [28, 52]. At early stages of cervical cancer, the cure rate is nearly 100% [42], so the prevention, early detection and classification of cervical cancer are essential [64].

At present, cervical cancer screening methods mainly include human papillomavirus detection, cervical smear and acetic acid testing under colposcopy [72]. After the introduction of a Papanicolaou (Pap) smear [40], the standard screening test for cervical cancer and premalignant lesions is cervical cytology. As the most common screening test, cervical cytology has been extensively used and effectively reduces incidence and mortality. At present, manual screening of abnormal cells from a cervical cytology slide is still common practice. However, it is usually tedious, inefficient and expensive. Consequently, automatic screening methods have attracted increasing attention [3, 5]. Additionally, some research on cervical cell analysis shows that each independent cervical cell has intrinsic similarity. For example, superficial and intermediate cells generally have relatively small nuclei and have clear cytoplasmic and nuclear margins, while dyskeratotic and metaplastic cells have overlapping cytoplasmic and nuclear margins. In addition, koilocytotic cells have the presence of a perinuclear cavity, while other cells have a relatively thick cytoplasm [18, 34]. These observations indicate that there exists a potential relationship between cervical cell images. Therefore, accurate cervical cell classification is crucial to the automatic screening method. The analysis of Pap smear images requires low error tolerance and skilled pathologists, and the screening process is expensive and time-consuming. Therefore, an automated classification process can assist gynecologists in diagnosis and provide more objective test explanations.

Recently, deep learning has brought considerable improvements in accuracy in many applications [46]. Due to its high accuracy in many fields, deep learning has become the most advanced machine learning technology. Deep learning and CNNs have been successfully used in breast cancer detection [7], skin cancer recognition [73], and COVID-19 recognition and analysis [26]. Among them, there are many studies based on convolutional neural networks, and convolutional neural networks (CNNs) have been the standard for 3D medical image classification and segmentation. The convolutional operations used in these networks, however, inevitably have limitations in modeling the long-range dependency due to their inductive bias of locality and weight sharing [69]. To solve these problems, the transformer was created. Beginning at the end of 2020, transformer-based research has gradually increased. At present, some transformer-based research has surpassed CNN-based research in the fields of image classification, image detection, and image segmentation [33]. Dosovitskiy et al. [11] proposed a vision transformer (ViT) for the first time to be applied to image classification. They applied a method that does not focus on pixels but focuses on small areas of the image. They believe that dependence on CNN is not necessary and that applying direct pure converters based on image patch sequences can perform image classification tasks well.

Carion et al. [4] proposed a new method (DEtection TRansformer or DETR) that views object detection as a direct set prediction problem. Their approach streamlines the detection pipeline, effectively removing the need for many hand-designed components such as a nonmaximum suppression procedure or anchor generation that explicitly encode the prior knowledge about the task. Xie et al. [69] proposed a novel framework that efficiently bridges a convolutional neural network and a transformer (CoTr) for accurate 3D medical image segmentation. Compared to CNN, the self-attention in the transformer can produce a more interpretable model, from which the attention distribution can be checked, and each attention head can learn to perform different tasks. The number of operations required to calculate the association between two positions does not increase with distance.

Research on the classification of cervical cancer cells is mostly carried out on two public datasets. Herlev [27] consists of 917 images of Pap smear cells classified carefully by cytotechnicians and doctors. Each cell is described by 20 numerical features, and the cells fall into 7 classes. SIPaKMeD [43] consists of 4049 annotated cell images. The cells are classified by expert cytopathologists into five different classes. In addition to the public dataset, there is nonpublic dataset research [68]. Currently, the number of samples in the two public datasets is limited, and the classification accuracy of various studies is above 90%, which tends to overfit, and the cell types in various datasets are inconsistent. The weights are not balanced, and the clarity and quality of the samples are uneven. Regarding the nonpublic dataset, cervical cancer researchers cannot obtain the data and can only conduct research on the limited public dataset, and the research progress is limited.

To solve the above problems, this paper proposes a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on taming transformers and Tokens-to-Token Vision Transformers (T2T-ViT) for the first time. The method proposed in this paper expands its dataset sample size, balances the number and weight of each type of cervical cell, and generates high-quality sample images. A new dataset (liquid-based cytology Pap smear dataset [24]) is introduced to provide more objective materials for the cervical cancer cell generation model in this paper, and T2T-ViT is used to further improve classification accuracy, provide more objective information for gynecologists, improve the efficiency of clinical-pathological diagnosis, detect the patient’s condition in time, and improve the survival rate of cervical cancer patients. Figure 1 is an overview of the method framework of the model in this paper.

Fig. 1
figure 1

Flowchart of the generation and classification of cervical cancer cell images in this paper

The main contributions and novelty of this model are as follows:

  • We propose a cervical cancer cell sample image generation model based on taming transformers (CCG-taming transformers), to our best knowledge, this is the first model combining CNNs and Transformer and applied to cervical cancer research,

  • In CCG-taming transformers, we adjust the encoder structure of VQGAN in the taming transformers: 1) we introduce new convolutional structures MultiRes-block to better realize the extraction and analysis of the key information of the cervical cancer cell image by the encoder; 2) we introduce SE-block to alleviate the vanishing gradient problem in the neural network; 3) we introduce Layer normalization to enhance the discrimination ability of feature representations. This model mainly solves the problem that there are few public data sets in the current cervical cancer research, the samples of each class are very different, and the weights and quantities of various samples in the data sets are not balanced. The cervical cancer cell images generated by this model can provide certain reference value for cervical cancer research.

  • We introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images we generate.

  • We introduce T2T-ViT combing transfer learning to classify the cervical cancer cell image dataset, which can solve the problem that the classification model based on CNNs may lose the details of the feature map.

2 Related work

Artificial intelligence and deep learning play an important role in cell classification, medical image classification, generation and analysis [17, 59, 64]. As new technologies develop, they become cost-effective and less time-consuming. They are now more popular than traditional methods (such as Pap smears, colposcopy, and cervicography) [8]. These technologies have nothing to do with human experience. Although they cannot replace gynecologists for pathological evaluation, they can provide assistance for clinical diagnosis to a large extent, improve the diagnostic efficiency of gynecologists, and reduce the subjective components of diagnosis.

Several works classify and detect cervical cancer in the literature. Convolutional neural networks (CNNs) [35] have been proposed to automatically learn multilevel features through hierarchical deep architecture. Wieslander et al. [63] used ResNet for binary classification of benign and malignant cervical cells. Plissiti et al. [43] proposed the annotated cervical cell image dataset SIPaKMeD and applied CNN to classify five types of cervical cells. Gautam et al. [14] proposed a patch-based approach using CNN combined with transfer learning for the segmentation of nuclei in single-cell images. Ghoneim et al. [15] proposed a cervical cancer cell detection and classification system based on convolutional neural networks (CNNs) and achieved 91.2% accuracy in the classification problem (7-class). William et al. [65] proposed contrast local adaptive histogram equalization for image enhancement. Cell segmentation was achieved through a trainable Weka segmentation classifier, and a sequential elimination approach was used for debris rejection. Wang et al. [58] proposed multiscale representation for scene classification, which is realized by a global–local two-stream architecture. They also [60] explored the attention mechanism and propose a novel endto-end attention recurrent convolutional network (ARCNet) for scene classification, their research has made outstanding contributions to the classification field. Although the neural network framework based on CNN has achieved good accuracy in the classification of cervical cancer smear cell images, the training requires more computing resources, and the number of network layers needs to reach a certain depth to capture the deep level of the sample image information. Compared with the transformer, in the CNN-based model, the number of operations required to calculate the association between two positions through convolution increases with distance. The number of operations required to calculate the association between two positions based on the self-attention in the transformer is independent of the distance.

Generative adversarial networks (GANs) [1] have been successfully used to synthesize human faces, landscapes and even medical images; they are mainly used to expand the size of datasets and balance the weight and number of samples in each category. Pollastri et al. [44] presented a novel strategy that employs DC-GAN to augment data in the skin lesion segmentation task. The proposed framework generates both skin lesion images and their segmentation masks, making the data augmentation process extremely straightforward. Karras et al. [31] proposed an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes and stochastic variation in the generated images, and it enables intuitive, scale-specific control of the synthesis. Thuy et al. [54] proposed combining deep learning, transfer learning and generative adversarial networks to improve classification performance. Fine-tuning on the VGG16 and VGG19 networks was used to extract the well-discriminated cancer features from histopathological images before feeding them into the neural network for classification. Han et al. [20] proposed a 3D multiconditional GAN (MCGAN) to generate realistic/diverse nodules placed naturally on lung computed tomography images to boost sensitivity in 3D object detection. However, they ignore the potential relationships among cervical cell images during feature learning, and thus, may influence the representation ability of CNN features. Besides, GANs usually require large training datasets, which are often scarce in the medical field, and GANs are only applied to medical image synthesis at a relatively low resolution. In this paper, we use CCG-taming transformers and SMOTE-Tomek Links to balance the weight and number of samples between different categories, through this method, we can generate more samples of cervical cancer cells in different categories, providing more objective reference value for the research of cervical cancer.

Amazing results from transformer models on natural language processing have encouraged the vision community to study their research on computer vision problems [19]. Vaswani et al. [57] proposed the transformer network architecture for the first time, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Valanarasu et al. [56] proposed a gated axial attention model that extends the existing architectures by introducing an additional control mechanism in the self-attention module to train the model effectively on medical images, they also proposed a local-global training strategy (LoGo) to train the model effectively on medical images with significant performance. Zhu et al. [74] proposed deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 fewer training epochs. He et al. [22] introduced DropBlock as a regularization technique for HSI accurate classification to mitigate the overfitting problem in CNN-based HSI classification.

Chen et al. [6] studied the low-level computer vision task (e.g., denoising, superresolution and deraining) and developed a new pretrained model, namely, an image processing transformer (IPT). Wang et al. [62] investigated a simple backbone network useful for many dense prediction tasks without convolutions, and they proposed the pyramid vision transformer (PVT), which overcomes the difficulties in applying the transformer to various dense prediction tasks. The transformer-based model has achieved high accuracy in image classification, target detection and other fields. Park et al. [41] proposed a novel vision transformer by using the low-level CXR feature corpus that is obtained to extract the abnormal CXR features. Wang et al. [61] combined transfer learning with the transformer model to predict the small-dataset Heck reaction. Besides, the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency, and the redundant attention backbone design of ViT leads to limited feature richness in fixed computational budgets and limited training samples. To solve these problems, we introduced T2T-ViT, which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study.

3 Method

In this paper, we propose a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on Tokens-to-Token Vision Transformers (T2T-ViT) with transfer learning. The overall framework of the proposed model is illustrated in Fig. 2.

Fig. 2
figure 2

Overview of the framework for this model

First, we used ImageNet to train T2T-ViT, preprocessed the cervical cancer Pap cell smear dataset, used CCG-taming transformers to expand the dataset, and then balanced the weights and numbers of different classes of images. After that, the newly obtained dataset was passed to the T2T-ViT that was pretrained by ImageNet and completed the transfer learning for classification, and finally, the result of the classification was obtained. Next, we elaborate on each part of the model in Fig. 2.

3.1 Taming transformers

Taming Transformers [13] are used to generate a variety of high-quaility images; in contrast to CNNs, they contain no inductive bias that prioritizes local interactions. It uses a convolutional VQGAN to learn a codebook of context-rich visual parts, whose composition is subsequently modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of convolutional approaches to transformer based high resolution image synthesis.

3.2 CCG-taming transformers

An image x ∈ H × W × 3 is represented by the spatial collection of codebook entries as \( {z}_{\mathrm{q}}\in {\mathbb{R}}^{h\times w\times {n}_z} \), and nz represents is the dimensionality of codes. First learn a convolutional network composed of an encoder E and a decoder G, and it will learn the code to represent the image discrete codebook \( Z={\left\{{z}_k\right\}}_{k=1}^K\subset {\mathbb{R}}^{n_z} \). Image x passes through encoder E, We obtain zq using the encoding \( \hat{z}=E(x)\in {\mathbb{R}}^{h\times w\times {n}_z} \) and a subsequent elementwise quantization q(⋅) of each spatial code \( {\hat{z}}_{ij}\in {\mathbb{R}}^{n_z} \) onto its closest codebook entryzk:

$$ {z}_{\mathrm{q}}=\mathrm{q}\left(\hat{z}\right)=\left(\underset{z_k\in Z}{argmin}\left\Vert {\hat{z}}_{ij}-{z}_k\right\Vert \right)\in {\mathbb{R}}^{h\times w\times {n}_z} $$
(1)
$$ \hat{x}=G\left({z}_{\mathrm{q}}\right)=G\left(\mathrm{q}\left(E(x)\right)\right) $$
(2)
$$ {\mathrm{L}}_{\mathrm{VQ}}=\left(E,G,Z\right)={\left\Vert x-\hat{x}\right\Vert}_2^2+{\left\Vert \mathrm{sg}\left[E(x)-{z}_{\mathrm{q}}\right]\right\Vert}_2^2+\beta {\left\Vert \mathrm{sg}\left[{z}_{\mathrm{q}}\right]-E(x)\right\Vert}_2^2 $$
(3)

Where \( {\left\Vert x-\hat{x}\right\Vert}_2^2 \) is a reconstruction loss, sg[⋅] denotes the stop-gradient operation, and \( {\left\Vert \mathrm{sg}\left[{z}_{\mathrm{q}}\right]-E(x)\right\Vert}_2^2 \) is the so-called “commitment loss” with weighting factorβ.

VQGAN uses a discriminator and perceptual loss to maintain good perceptual quality at an increased compression rate. VQGAN replaces B in A with a discriminator D to distinguish between real and reconstructed images.

$$ {\mathrm{L}}_{\mathrm{GAN}}\left(\left\{E,G,Z\right\},D\right)=\left[ logD(x)+\mathit{\log}\left(1-D\left(\hat{x}\right)\right)\right] $$
(4)

The complete objective for finding the optimal compression model Q = {E, G, Z} then reads:

$$ {\mathrm{Q}}^{\ast }=\underset{E,G,Z}{argmin}\underset{D}{\mathit{\max}}{\mathbbm{E}}_{x\sim p(x)}\left[{\mathrm{L}}_{\mathrm{VQ}}\left(E,G,Z\right)+\lambda {\mathrm{L}}_{\mathrm{GAN}}\left(\left\{E,G,Z\right\},D\right)\right] $$
(5)

Where we compute the adaptive weight λ according to:

$$ \lambda =\frac{\nabla_{G_L}\left[{\mathrm{L}}_{\mathrm{rec}}\right]}{\nabla_{G_L}\left[{\mathrm{L}}_{\mathrm{GAN}}\right]+\delta } $$
(6)

Where Lrec is the perceptual reconstruction loss [71], \( {\nabla}_{G_L}\left[\cdot \right] \) denotes the gradient of its input w.r.t. the last layer L of the decoder, and δ = 10−6 is used for numerical stability. With E and G available, the cervical cell images can be represented in terms of the codebook indices of their encodings. The quantized encoding of an image x is given by \( {z}_q=\mathrm{q}\left(E(x)\right)\in {\mathbb{R}}^{h\times w\times {n}_z} \) and is equivalent to a sequence s ∈ {0, …, |Z| − 1}h × w of indices from the codebook, which is obtained by replacing each code by its index in the codebook Z:

$$ {s}_{ij}=k\boldsymbol{suchthat}{\left({z}_{\mathrm{q}}\right)}_{ij}={z}_k $$
(7)

By mapping indices of a sequence s back to their corresponding codebook entries, \( {z}_{\mathrm{q}}=\left({z}_{s_{ij}}\right) \) is readily recovered and decoded to an image \( \hat{x}=G\left({z}_{\mathrm{q}}\right) \). Thus, after choosing some ordering of the indices in s, image generation can be formulated as autoregressive next-index prediction: given indices s<i, the transformer learns to predict the distribution of possible next indices, i.e., p(si|s<i) to compute the likelihood of the full representation as \( p(s)=\prod \limits_ip\left({s}_i\left|{s}_{<i}\right.\right) \). This allows us to directly maximize the log-likelihood of the data representations (As shown in equation (8)), and Fig. 3 is the overview of the framework for the CCG-taming transformers.

$$ {\mathrm{L}}_{\boldsymbol{Transformer}}={\mathbbm{E}}_{x\sim p(x)}\left[- logp(s)\right] $$
(8)
Fig. 3
figure 3

Overview of the framework for the CCG-taming transformers

3.2.1 SE-block and MultiRes-block

Figure 4 shows the structure of SE-block and MultiRes-block. We add the SE (squeeze-and-excitation) block [29] in encoder E of VQGAN, which can adaptively recalibrate the characteristic response of each channel by explicitly modeling the interdependence between channels. The SE-block has excellent generalization ability in different datasets and brings significant performance improvement to the convolutional neural network. To improve the image information learning ability of VQGAN for cervical cancer Pap cell smears, this paper introduces the convolutional block attention module (CBAM) [67] to enhance VQGAN’s attention to cells and nuclei.

Fig. 4
figure 4

Structure diagram of the encoder and decoder in the CCG-taming transformers

In Fig. 4, the dimension information in the excitation block represents the output of the layer. First, the global average-pooling layer in the squeeze block is used for compression, and a bottleneck structure is formed by two fully connected layers to construct the correlation between channels and export the same weight as the input feature. After that, the feature dimension is reduced to 1/16 of the input, and then the initial feature dimension is restored through the full connection layer after the activation of ReLU. Then, a weight between 0 and 1 is obtained through the sigmoid activation function, and the weight is weighted to the features of different feature channels through the scale operation. Finally, the feature of the residual block is redefined before the addition. The structure of the decoder is the same as the structure of the encoder, the difference is that the spatial structure is opposite to the encoder structure, and down-sampling becomes up-sampling. The specific structure is shown in Fig. 5.

Fig. 5
figure 5

Structure diagram of MultiRes-block

The introduction of the SE-block makes the classification model of cervical cancer cell images more nonlinear, which can better fit the complex correlation between channels, greatly reduce the number of parameters and calculations, and solve the problem of gradient disappearance near the input layer on the trunk, which makes the model easy to optimize and increases the generalization ability of the classification model.

We improved the convolution structure in the encoder and introduced the MultiRes-block [25], which was originally used in the image segmentation model decoder to extract image information. The MultiRes-block is introduced to improve the efficiency of extracting cervical cancer cell images and optimize the structure of the decoder to replace the redundant structure of multilayer convolution and superposition in the original decoder.

We also add a residual connection because of their efficacy in medical image information extraction and for the introduction of the 1 × 1 convolutional layers, which may allow us to comprehend some additional spatial information. We call this arrangement a MultiRes-block, as shown in Fig. 5.

Figure 5 Illustrates the MultiRes-block, where we increased the number of filters in the successive three layers gradually and added a residual connection (along with a 1 × 1 filter for conserving dimensions). Res paths of MultiRes-blocks introduce some additional processing to make the two feature maps more homogeneous

3.2.2 Layer normalization

The input of neurons in the same layer in LN has the same mean and variance, and different input samples have different mean and variance; therefore, LN does not depend on the size of the batch and the depth of the input sequence, and is used for a new word representation of Attention output. For standardization processing, adding Layer Normalization processing is to standardize the data, which is convenient for the subsequent non-linear processing of the data by the ReLU activation function in Feed Forward. Normalize the data to the active area of the ReLU activation function through Normalization, which can make the activation function work better.

LN is the normalize operation for the input of all neurons in a certain layer of the deep network according to the following formula.

$$ {\sigma}^l=\sqrt{\frac{1}{H}\sum \limits_{i=1}^H{\left({a}_i^l-\frac{1}{H}\sum \limits_{i=1}^H{a}_i^l\right)}^2} $$
(9)

For the ith summed input in the lth th layer, H denotes the number of hidden units in a layer. \( {a}_i^l \) is the vector representation of the summed inputs to the neurons in layer ith,all the hidden units in a layer share the same normalization termsσ.

3.2.3 SMOTE-Tomek links

To balance the source data set and the number of samples and weights of the images we generate, we used Synthetic Minority Oversampling Technique (SMOTE) and Tomek Links Under sampling in a pipelined approach [50]. Due to random oversampling, the strategy of simply copying samples is adopted to increase the minority samples. This is easy to cause the problem of model overfitting, that is, the information learned by the model is too special and not general enough. The basic idea of the SMOTE algorithm is to analyze the minority samples and artificially synthesize new samples based on the minority samples and add them to the data set. There is a parameter that represents the percentage of oversampling, and its value represents the number of synthetic samples to be created. For each minority instance, the k nearest neighbors make them belong to the same class.

$$ k=\frac{\left(\boldsymbol{SMOTE}\%\right)}{100} $$
(10)

In this case, the minority class is oversampled with the applied ‘sampling-strategy’ parameter represented as ‘k’ (k = 0.5) i.e. keeping the sampling strategy parameter as 0.5 increases the number of minority class examples by 50%.

However, SMOTE has some blindness when selecting neighbors. Therefore, we use Tomek Links method for under-sampling to clean up noisy samples. Tomek Links can find more samples of opposite categories and delete the majority of samples in the pair. After such processing, the dividing line between the melanoma and benign has become clearer, making the existence of the minority more obvious.

A Tomek Link is the distance between two samples from two different classes say x and y such that for any sample z:

$$ d\left(x,y\right)<d\left(x,z\right)\boldsymbol{and}d\left(x,y\right)<d\left(y,z\right) $$
(11)

In a pipelined approach the minority Class is oversampled by using SMOTE followed by removing the majority class samples by Tomek Links.

3.3 Tokens-to-token vision transformers (T2T-ViT)

Tokens-to-token vision transformers (T2T-ViT) [70] can progressively tokenize an image to tokens and have an efficient backbone. T2T-ViT consists of two main components (Fig. 6): the tokens-to-token module (T2T module) is used to model the local structure information of images and reduce the token length progressively; the T2T-ViT backbone is used to draw the global attention relations on the tokens from the T2T module.

Fig. 6
figure 6

The overall network architecture of T2T-ViT. the input cervical cell image is first soft split as patches, and then unfolded as a sequence of tokens T1. The length of tokens are reduced gradually in the T2T, after two iterations output Tf. Then, the backbone takes the fixed tokens as input and output the predictions, PE is position embedding

3.3.1 T2T-ViT structure

Each T2T process has two steps: restructurization and soft splitting. The T2T transformer is shown in Fig. 7:

Fig. 7
figure 7

The The tokens Ti are restructurized as image Ii after a transformation and reshaping, then Ii is split with overlapping to tokens Ti + 1 again. Tokens-to-Token in the T2T process: the four tokens (1,2,4,5) of the input Ii are concatenated to form one token in Ti + 1

The restructurization is shown in Fig. 7, giving a sequence of tokens T from the last layer; here, T denotes tokens from the last layer. T is transformed by the self-attention block.

$$ {T}^{\prime }=\mathrm{MLP}\left(\mathrm{MSA}(T)\right) $$
(12)

Where MSA denotes the multihead self-attention operation with layer normalization and MLP is the multilayer perceptron with layer normalization in the standard transformer [11]. Then, the tokens are reshaped as images on the spatial dimension.

$$ I=\mathrm{Reshape}\left({T}^{\prime}\right) $$
(13)

Where “Reshape” reorganizes tokens T ∈ l × c to I ∈ h × w × c, where l is the length of T, h, w, and c are the height, width and channel, respectively, and l = h × w.

As shown in Fig. 7, after obtaining the restructurized image I, a soft split is applied to model the local structure information and reduce the token length. Specifically, to avoid information loss in generating tokens from the restructured image, we split the cervical cell image into patches with overlap. As such, each patch is correlated with surrounding patches to establish prior knowledge that there should be stronger correlations between surrounding tokens. The tokens in each split patch are concatenated as one token (tokens-to-token, Fig. 7); thus, the local information can be aggregated from surrounding pixels and patches. When conducting the soft split, the size of each patch is k × k with s overlapping and p padding on the image, where k − s is similar to the stride in the convolution operation. Therefore, for the reconstructed image I ∈ h × w × c, the length of output tokens TO after soft splitting is

$$ {l}_O=\left\lfloor \frac{h+2p-k}{k-s}+1\right\rfloor \times \left\lfloor \frac{w+2p-k}{k-s}+1\right\rfloor $$
(14)

Each split patch has size k × k × c. We flatten all patches in spatial dimensions to tokens \( {T}_O\in {\mathbb{R}}^{l_0\times {ck}^2} \). After the soft split, the output tokens are fed to the next T2T process.

3.3.2 T2T-ViT backbone and transfer learning

The T2T-ViT has two parts: the tokens-to-token (T2T) module and the T2T-ViT backbone (Fig. 6). There are various possible design choices for the T2T module. Here, we set n = 2 as shown in Fig. 8, which means there are n + 1 = 3 soft split and n = 2 restructurization in the T2T module. The patch size set for the three soft splits is P = [7, 3, 3], and the overlapping set is S = [3, 1, 1], which reduces the size of the input image from 224 × 224 to 14 × 14 according to Eq. 14.

Fig. 8
figure 8

a Images generated by depth-guided neural rendering b Samples generated from semantic layouts

By conducting the above restructurization and soft splitting iteratively, the T2T module can progressively reduce the token length and transform the spatial structure of the image. The iterative process in the T2T module can be formulated as:

$$ {T}_i^{\prime }=\mathrm{MLP}\left(\mathrm{MSA}\left({T}_i\right)\right) $$
(15)
$$ {I}_i=\mathrm{Reshape}\left({T}_i^{\prime}\right) $$
(16)
$$ {T}_{i+1}=\mathrm{SS}\left({I}_i\right),i=1\dots \left(n-1\right) $$
(17)

For the input cervical cell image I0, we apply a soft split first to split the image to tokens: T1 = SS(I0). After the final iteration, the output tokens Tf of the T2T module have a fixed length; thus, the backbone of T2T-ViT can model the global relations on Tf. In this paper, the backbone of T2T-ViT selects the T2T-ViT-ResNeXt with the highest classification accuracy of the original T2T-ViT.

The mathematical formulation of a classifier with a training algorithm is shown below:

figure f

4 Experiment

4.1 Dataset

To evaluate the classification performance of our method, three public cervical cell image datasets are used in this paper:

The liquid-based cytology Pap smear dataset [24] consists of a total of 963 LBC images subdivided into four sets representing the four classes: NILM, LSIL, HSIL, and SCC. It comprises precancerous and cancerous lesions related to cervical cancer as per standards under the Bethesda System (TBS).

SIPaKMeD [43] consists of 4049 images of isolated cells with 5 different categories from 966 cluster cell images of Pap smear slides. The dataset comprises cells belonging to five categories: (1) dyskeratotic, (2) koilocytotic, (3) metaplastic, (4) parabasal and (5) superficial-intermediate. Classes 1–2 represent abnormal cervical cells, classes 4–5 represent normal cervical cells, and class 3 represents benign cells.

The Herlev dataset [27] consists of 917 isolated single-cell images, that is, the images contain one cervical cell. There are a total of seven classes: (1) superficial squamous epithelia, (2) superficial squamous epithelia, (3) columnar epithelia, (4) mild squamous non-keratinizing dysplasia, (5) moderate squamous nonkeratinizing dysplasia, (6) severe squamous nonkeratinizing dysplasia and (7) squamous cell carcinoma in situ intermediate. Classes 1–3 are normal cervical cells, whereas classes 4–7 are abnormal cervical cells. Tables 1, 2, 3 lists the data distribution of various samples of the three cervical cancer datasets liquid-based cytology Pap smear, SIPaKMeD, and Herlev.

Table 1 Data distribution of the liquid-based cytology Pap smear dataset
Table 2 Data distribution of the SIPaKMeD dataset
Table 3 Data distribution of the Herlev dataset

4.2 Model training process

To train the cervical cancer cell classification model, the weights of our T2T-ViT were initialized by the transfer parameters from the pretrained T2T-ViT on the ImageNet dataset. All training data were first resized to 224 × 224 × 3 and divided into minibatches for training. Transfer learning is a machine learning method that uses existing knowledge to solve different but related domain problems [55]. We used the ImageNet dataset to pretrain the V2T-ViT. Finally, we trained the cervical cancer cell images through V2T-ViT pretrained by ImageNet. During training, we set the minibatch size to 32. We used AdamW [37] as the optimizer and cosine learning rate decay [36]. The experiments were run on Python 3.7.1 in a Windows operating system with an i9-9900k processor, 64 GB memory, and one GTX2080Ti graphics card. The entire experiment was based on the open-source deep learning framework PyTorch 1.0 in Anaconda3. We trained T2T-ViT for 100 epochs, 5 training-test folds cross-validation was used for the evaluation of classification performance, our experiment configurations refers to the setting of [15, 51], we use 80% of the data in iteration. The remaining 20% is used for testing. Specifically, 4 of 5 folds were used as the training set and the other as the validation set for 5 rounds. The classification evaluation metrics were obtained by averaging the results from the 5 validation sets. After five iterations, all the data were tested.

The hyperparameters maintained during training are shown in Table 4.

Table 4 Hyperparameters maintained during training

4.3 Evaluation metrics

4.3.1 Evaluation metrics of cervical cell classification

To comprehensively evaluate the classification performance of the model, accuracy (ACC), sensitivity (SE), specificity (SP), H-mean, and F1 score were used as evaluation metrics. Accuracy is the overall percentage of correctly identified cells and can be used to evaluate the ability of the classifier to judge an overall sample. Sensitivity is also called the true positive rate or recall, which reports the proportion of correctly identified abnormal cells, specificity reports the proportion of correctly identified normal cells, F1 score is the harmonic mean of precision and recall.

$$ ACC=\frac{N_{TP}+{N}_{TN}}{N_{FP}+{N}_{TP}+{N}_{TN}+{N}_{FN}} $$
(18)
$$ SE\left(\boldsymbol{recall}\right)=\frac{N_{TP}}{N_{TP}+{N}_{FN}} $$
(19)
$$ SP=\frac{N_{TN}}{N_{TN}+{N}_{FP}} $$
(20)
$$ \mathrm{H}\hbox{-} \boldsymbol{mean}=2\times \frac{\mathrm{SE}\times \mathrm{SP}}{\mathrm{SE}+\mathrm{SP}} $$
(21)
$$ {F}_1=\frac{2 SE\cdot \frac{N_{TP}}{N_{TP}+{N}_{FP}}}{SE+\frac{N_{TP}}{N_{TP}+{N}_{FP}}} $$
(22)

NTP, NTN, NFP, NFN, denote the number of true positive, true negative, false positive, false negative, respectively.

4.3.2 Evaluation metrics of cervical cell generation

The inception score (IS) [49], which is a measure of the generated image quality, uses a pretrained inception network [47] to extract the features of generated images by computing:

$$ \boldsymbol{IS}(G)=\mathit{\exp}\left({\mathbbm{E}}_{x\sim {p}_g}{D}_{KL}\left(p\left(\left.y\right|x\right)\left\Vert p(y)\right.\right)\right) $$
(23)

DKL(), p(y|x), p(y), y, and x denote the KL divergence formula, the conditional label distribution of samples, the marginal distribution obtained from all the samples, the given picture, and the main object in the picture, respectively. A high IS indicates that the sample image is similar to a particular ImageNet category.

The Fréchet inception distance (FID) [23] embeds a set of generated images into a feature space represented by a specific layer of inception or any CNN [47]. It uses a continuous multivariate Gaussian distribution to represent the embedding feature distributions of the real data and the generated data, and the Fréchet distance between these two Gaussian distributions is computed by:

$$ \mathrm{FID}\left(r,g\right)={\left\Vert {\mu}_r-{\mu}_g\right\Vert}_2^2+ Tr\left({C}_r+{C}_g-2{\left({C}_r{C}_g\right)}^{\frac{1}{2}}\right) $$
(24)

μr and Cr denote the mean and covariance of the real data, and μg and Cg denote the mean and covariance of the generated data, respectively. FID serves as a good measure for GANs due to its good discriminability, robustness, and computational efficiency.

4.4 Results

4.4.1 CCG-taming transformers generated results

We used data from each type of cervical cancer cell in three datasets (liquid-based cytology Pap smear, SIPaKMeD, and Herlev) to generate more high-quality cervical cell images and included them in the classification experiments. In this paper, the international common evaluation IS (inception score) is used to evaluate the quality of the generated cervical cancer cell image. IS is one of the most commonly used methods for evaluating images generated by GANs. The higher the value is, the more realistic the image, and the higher the quality. IS focuses on two aspects of image performance: one is the clarity of the generated image, and the other is the diversity of the generated image.

We also use the Fréchet inception distance (FID) as another cervical cell generation evaluation metric; the FID score ranges from 0 to positive infinity, and a smaller score indicates a better model. It embeds a set of generated images into a feature space represented by a specific layer of inception or any CNN [47]. The specific evaluation metrics of cervical cell generation are shown in Table 5. The comparison results with other generation models are shown in Table 11.

Table 5 The average generation metrics on three datasets: (a) liquid-based cytology Pap smear dataset, (b) SIPaKMeD and (c) Herlev

Figure 8a shows the images generated by depth-guided neural rendering on RIN with f = 16 using the sliding attention window. Figure 8b shows the images generated from semantic layouts on S-FLCKR with f = 16 using the sliding attention window. The generated images based on three different cervical cancer cell datasets (liquid-based cytology Pap smear dataset, SIPAKMeD, Herlev) are shown in Figs. 9, 10, 11. Through the CCG-taming transformers, we invited professional inspection experts to distinguish them, and the results showed that the generated cervical cancer cell images have little difference from the real images, which can thus be used as the training dataset.

Fig. 9
figure 9

Cervical cancer cell images generated by CCG-taming transformers based on the liquid-based cytology Pap smear dataset

Fig. 10
figure 10

Cervical cancer cell images generated by CCG-taming transformers based on the SIPAKMeD dataset

Fig. 11
figure 11

The cervical cancer cell images generated by CCG-taming transformers based on the Herlev dataset

This paper balances the weights and quantity of different classes and expands the number of various samples on three different cervical cancer cell datasets (liquid-based cytology Pap smear dataset, SIPAKMeD, and Herlev) by SMOTE-Tomek Links. We named these three new datasets CCG1, CCG2, and CCG3. The distribution of different classes of cervical cells is shown in Tables 6, 7, 8.

Table 6 Data distribution of the CCG1 dataset
Table 7 Data distribution of the CCG2 dataset
Table 8 Data distribution of the CCG3 dataset

4.4.2 Cervical cell classification results

In this section, we describe the classification results on three cervical cancer cell datasets: (1) the liquid-based cytology Pap smear dataset, (2) SIPaKMeD and (3) Herlev. After 150 epochs of classification in dataset (1), the whole model yielded good results, the test loss reached a minimum, and the corresponding test accuracy of dataset (1) was 98.79%. After 100 epochs of classification in datasets (2) and (3), the model’s curve fluctuations tended to be stable, the test loss reached a minimum, and the corresponding test accuracies of datasets (2) and (3) were 99.58% and 99.98%, respectively. Tables 9, 10 shows the classification ACC, SE, SP, H-mean, and F1 score for datasets (1–3) and CCG1–3. The classification results of T2T-ViT in three different datasets are significantly better than those of the other classification models. Although ViT also achieved good classification accuracy, it was not superior to all classification models based on CNN.

Table 9 The classification results on three datasets: (1) liquid-based cytology Pap smear dataset, (2) SIPaKMeD and (3) Herlev
Table 10 The classification results on CCG1–3

Figures 12, 13, 14a shows the training and test accuracy versus the number of training epochs in the (1–3) dataset, and Figs. 12, 13, 14b shows the training and test loss versus the number of training epochs in the CCG-1-3 dataset.

Fig. 12
figure 12

Accuracy and loss diagram of liquid-based cytology Pap smear dataset classification. a Training and test loss versus the number of training epochs, b training and test BMA training epochs

Fig. 13
figure 13

Accuracy and loss diagram of SIPaKMeD. a Training and test loss versus the number of training epochs, b training and test BMA training epochs

Fig. 14
figure 14

Accuracy and loss diagram of Herlev. a Training and test loss versus the number of training epochs, b training and test BMA training epochs

A confusion matrix, also known as an error matrix, is a standard format for accuracy evaluation. The cervical cancer cell classification model output is divided into different categories n, so it is represented by an n × n matrix. For the dataset (1–3), n is equal to 4, 5, and 7. For the CCG1–3 dataset, n is equal to 4, 5, and 7 (Fig. 15).

Fig. 15
figure 15

Confusion matrix of T2T-ViT-24 classification results on different datasets. a Liquid-based cytology Pap smear dataset, b SIPaKMeD, c Herlev, d CCG1, e CCG2, and f CCG3

5 Discussion

Cervical cell classification is a challenging issue for automatic screening of cervical cytology. The performance of the classification method determines whether it can bring convenience to cytoscreeners and women. As the current mainstream classification framework, a convolutional neural network (CNN) uses a hierarchical deep architecture to automatically learn high-level features.

For cervical cancer Pap cell smears, there are only two public datasets SIPaKMeD and Herlev in previous research, and the dataset has a small sample size. The number and weight of various samples are unbalanced, and the sample image quality is uneven. This limits future research on cervical cancer and the timeliness of clinical analysis of the patient’s condition. To solve these problems, this paper proposes CCG-taming transformers and generates sample images based on the existing published cervical cancer Pap cell smear dataset.

We combine CNN with the transformer, use the convolution method to efficiently learn the codebook of the context-rich visual part, and then learn its global composition model, use CNN-based VQGAN to learn the codebook of the context-rich visual part, and the components of the codebook follow the autoregressive transformer architecture to model. A discrete codebook provides the interface between these architectures, and the patch-based discriminator achieves powerful compression capabilities while maintaining high perceptual quality.

To better verify the experimental results of this paper, we used CCG-taming transformers, StyleGAN [31], PGAN [30], DC-GAN [45], LAPGAN [9] and GAN [16] to generate 1000 cervical cell images, and the samples were evaluated 20 times. The experimental results are shown in Table 11 and Figs. 16, 17, 18.

Table 11 The average generation metrics on three datasets: (a) liquid-based cytology Pap smear dataset, (b) SIPaKMeD and (c) Herlev
Fig. 16
figure 16

Abnormal images generated by various types of GANs on the SIPAKMeD dataset

Fig. 17
figure 17

Abnormal images generated by various types of GANs on the Herlev dataset

Fig. 18
figure 18

Abnormal images generated by various types of GANs on the liquid-based cytology Pap smear dataset

According to the above table, the IS value of the proposed CCG-taming transformers is the highest, reaching 3.759. The IS values (the higher the better) of the other five methods are 0.705, 1.273, 1.235, 1.431, 1.791 lower, the FID value (the lower the better) of the proposed CCG-taming transformers is the lowest, reaching 0.714, and the FID values of the other four methods are 1.414, 0.854, 0.643, 1.303 and 2.533 lower. The values of precision and recall are all from 0 to 1, and values close to 1 indicate good results. The CCG-taming transformers scores best in four metrics compared to the other four models, and it is proven to be a better image synthesis model comprehensively, which indicates that the image quality generated by the CCG-taming transformers used in this paper is higher.

It can be seen in Figs. 16, 17, 18 that images of cervical cancer cells generated based on various types of GANs have many problems. For each type of GAN, the superposition of different image styles may cause image distortion or image components of the same style (Figs. 16, 17). This is because CNN-based GAN processes images according to pixels. Different styles tend to be overly rigid when fused or simply repeat directly; the generated sample images may appear blurred (Fig. 18. row 2) because GAN’s CNN has a limited depth of structure. In the case of low resolution and general quality of the original dataset image, the GAN’s identification area will have a low standard for judging the generated samples. These low-quality generated cervical cell images will undoubtedly affect the classification model and also increase the distinguishing difficulty in clinical diagnosis by gynecologists.

The CCG-taming transformers serializes the image to analyze the image information. Through the information interaction between VQGAN and the codebook, more deep details of the image can be learned, making the transition between cervical cancer cells in the generated image more naturally, it avoids the problems that may occur in the above-generated images, and makes the cervical cancer sample images generated by CCG-taming transformers closer to the real images to provide more objective reference and valuable sample images for cervical cancer research, which should assist gynecologists in clinical diagnosis and improve the survival rate of cervical cancer patients.

We conducted ablation experiments on the innovative part (SMOTE-Tomek Links, MultiRes-block, SE-block and Layer normalization) of this paper to verify the effectiveness of the innovative part of this paper for improving the performance of the experiment. The results, averaged over the three datasets, are shown in Table 12.

Table 12 Ablation study on the influence of SMOTE-Tomek Links, MultiRes-block, SE block and Layer normalization on SE-MultiResBlock

We first consider the use of the SMOTE-Tomek Links. It is found that the BMA with the improved skip connection is 2.9% higher. Subsequently, we consider the introduction of MultiRes-block when the SMOTE-Tomek Links is used. It is seen that the BMA with MultiRes-block is 2.4% higher. Then, assuming that the SMOTE-Tomek Links and MultiRes-block are used, we consider the introduction of the SE block. It is demonstrated that with the SE block, BMA is 2.2% higher. Finally, we consider the introduction of Layer normalization. It is seen that with Layer normalization, BMA is 1.7% higher. It is thus demonstrated that the introduction of SMOTE-Tomek Links, MultiRes-block, SE-block and Layer normalization can significantly improve the BMA of the entire segmentation model.

There have been many classification studies based on CNN’s public dataset of cervical cancer Pap cells. Tables 13, 14 compare the results of this study with other CNN-based research results.

Table 13 Different models’ classification results in SIPaKMeD
Table 14 Different models’ classification results in Herlev

Tables 13, 14 show that in the public cervical cancer Pap smear dataset, the classification research based on CNN achieved high classification accuracy, but this also shows that the classification of the classification research for these public datasets approached the state of overfitting. In the published cervical cancer Pap cell smear dataset, the classification model based on T2T-ViT in this paper is better than the classification accuracy based on CNN.

Win et al. [66] develop a computer-assisted screening system for cervical cancer using digital image processing of Pap smear images; Haryanto et al. [21] aims to create the classification model of Cervical Cell Images using the Convolutional Neural Network (CNN) algorithm; Talo et al. [53] used a deep learning approach to classify cervical cell images obtained from pap smear slides. Mamunur et al. [38] proposed DeepCervix, a hybrid deep feature fusion (HDFF) technique based on DL to classify the cervical cells accurately; Dounias et al. [12] compared the performance of various intelligent methodologies in the task of pap-smear diagnosis; Marinakis et al. [39] proposed an effective genetic algorithm scheme which is combined with a number of nearest neighbor based classifiers.; Dong et al. [10] proposed a machine learning method based on feature selection algorithm for cervical cell classification.

The above methods have achieved good results in the classification task of cervical cancer. However, the main purpose of these studies is to improve the accuracy of classification, but they did not consider the impact of the limited data set and the number of samples on the classification results. It is also one of the biggest differences between the research in this article and the above studies. In addition, there are too few public sample data sets for cervical cancer, which leads to serious overfitting of cervical cancer classification studies at this stage, which seriously hinders cervical cancer research. In this paper, the accuracy of classification is improved by balancing the weight and number of samples between different categories, and can generate more samples of cervical cancer cells in different categories, providing more objective reference value for the research of cervical cancer; Second, most of the above studies use CNN-based networks to classify cervical cancer, and this paper combines CNN and Transformer to further improve the classification model’s ability to capture global contextual information on the feature images of the classified samples, this can also make up for the deficiencies of the CNN-based classification model and improve the performance of the classification model.

Considering that the potential relationship between cervical cell images is ignored in the process of CNN feature learning, our goal is to learn the relationship between cell images through the CBAM module in V2T-ViT to enhance the discrimination ability of V2T-ViT features.

To verify the effectiveness of balancing the sample size and weight of various types of samples and using high-quality samples for classification to improve classification accuracy, we conducted experiments on the influence of different sample types from three cervical cancer cell datasets: (1) the liquid-based cytology Pap smear dataset, (2) SIPaKMeD and (3) Herlev on the T2T-ViT in this paper. First, we prepared four datasets of different samples that were altered by matrix transformation: Dataset I: simply upweight the classes with few samples; Dataset II: equalize the number for each class by showing minority examples more often; Dataset III: only CCG-taming transformers generated samples shown to increase the relative weight of these classes; Dataset IV: the samples generated by CCG-taming transformers were added to the cervical cell image classes with fewer samples and then balanced with the weight of different kinds of cervical cell images; Dataset V: combined with Dataset I and Dataset II.

Then, we used the classification framework of this paper to train these four datasets and analyze the results.

According to the analysis of the experimental results obtained in Tables 15, 16, 17, we can simply overweight the classes with few samples (Dataset I) and equalize the number for each class by showing minority examples more often. Compared with the method of using CCG-taming transformers to expand fewer sample classes, Dataset II has a lower ACC and other valuating metrics, while Dataset V, formed by combining Dataset I and Dataset II, does not obtain better results. We can see that Dataset III and Dataset IV both use the method of expanding fewer sample classes through the CCG-taming transformers, and Dataset IV balances the weight of each sample class based on the Dataset III method. Finally, we find that Dataset IV requires less training time than Dataset III and obtains higher evaluation metrics.

Table 15 Classification results of Datasets I-V composed of different samples in the liquid-based cytology Pap smear dataset
Table 16 Classification results of Datasets I-V composed of different samples in the SIPaKMeD dataset
Table 17 Classification results of Datasets I-V composed of different samples in the Herlev dataset

In addition, we also conducted an image generation experiment on the ISIC2019 skin cancer dataset to verify the generalization ability of CCG-Taming Transformers. The results are shown in Table 18.

Table 18 Image generation results of different models in the ISIC2019 dataset

The CCG-Taming Transformers scores best in four metrics compared to the other four models, and it is proven to be a better image generation model comprehensively.

Considering that the potential relationship between cervical cell images is ignored in the process of CNN feature learning, our goal is to learn the relationship between cell images through the CBAM module in V2T-ViT to enhance the discrimination ability of V2T-ViT features. This is also the reason why the transformer-based classification model is better than CNN in the field of image classification, segmentation and pattern recognition at this stage. For the generation of medical images, various GANs, including CNNs, not only exhibit strong local deviations but also exhibit deviations from spatial invariance by using shared weights at all locations. If a more comprehensive understanding of the input is required, these deviations will become invalid. In this paper, CCG-taming transformers do not use pixels to represent the image but represent it as the synthesis of the perceptually rich image components of the codebook.

6 Applied technical analysis

When the model was tested and simulated, we found that it is necessary to combine the artificial intelligence system with the clinical experience of dermatological doctors when some indistinguishable cervical cancer cell images are found. The significance of computer-aided diagnosis is good in curative effect and prognosis. It can also assist the pathological judgment and diagnosis problems of primary dermatological clinicians. Our work can not only help gynecologists in diagnosis but also improve patient health regarding cervical cancer awareness and provide a valuable reference for cervical cancer cell image analysis based on artificial intelligence. The research content of this paper alleviates the problem of the lack of public cervical cancer data sets and the imbalance of the weight and quantity of various samples to a certain extent. The generated images are close to the sample quality of the source data set. The images of cervical cancer cell samples we generated can save obstetricians and gynecologists from the trouble caused by low-quality samples, and improve the accuracy and efficiency of disease analysis.

7 Limitations

Our model also has some shortcomings. There are still some technical challenges in cervical cancer cell image classification under microscopy. Just as two kinds of cervical cancer cells overlap in the same image, different stages of cervical cancer cells will also increase the classification model’s difficulty, which may lead to misclassification. Although the model shows high segmentation accuracy, these problems have not been well solved, so it is necessary to improve the classification model further and expand the dataset samples. More attention will focus on sample data expansion to provide better help for clinical medicine of skin diseases in future research.

8 Conclusion

In this paper, a cervical cancer cell sample image generation model based on taming transformers (CCG-taming transformers) was proposed to balance the unbalanced weights and numbers of various class in the cervical cancer data set, and use T2T-ViT combing transfer learning to classify the samples.

Our CCG-taming transformers improve the encoder structure by introducing and improving MultiRes-block, SE-block and Layer normalization to improve the classification model’s ability to capture the subtle differences in the feature map of cervical cancer cells and the feature information of the global context; we introduce Layer Normlization to standardize the data; we also introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images The quality of the images we generated on these three cervical cancer data sets is very close to the source data set, the final inception score (IS), Fréchet inception distance (FID), Recall and Precision are 3.75, 0.71, 0.32 and 0.65 respectively. The proposed model is superior to other GANs based model in different quantitative evaluation metrics.

The T2T-ViT was constructed combined with transfer learning. The classification accuracy in the liquid-based cytology Pap smear dataset (4-class), SIPAKMeD (5-class), and Herlev (7-class) are 98.79%, 99.58%, and 99.88%, respectively. This also proves from the side that the image quality of cervical cancer samples generated by CCG-Taming-Transformer is better, which promotes the improvement of the accuracy of the classification model.

The whole process of this paper, including cervical cancer cell image synthesis, classification model construction, the balance of various sample sizes and weights, and image classification, is applicable for other analyses of medical images, especially those datasets with intraclass-imbalanced data or insufficient labeled data. This work provides a meritorious reference for medical image analysis based on deep learning and artificial intelligence.

In future work, we will work to expand the applicability of the model in this paper, so that it can adapt to more different types of medical image data sets with a small number of samples and unbalanced weights.