Open AccessReview

Part-Prototype Models in Medical Imaging: Applications and Current Challenges

Lisa Anita De Santi

^1,2

Franco Italo Piparo

¹,

Filippo Bargagna

^1,2

Maria Filomena Santarelli

Simona Celi

and

Vincenzo Positano

^2,*

Department of Information Engineering, University of Pisa, 56122 Pisa, Italy

Bioengineering Unit, Fondazione Toscana G. Monasterio, 56124 Pisa, Italy

CNR Institute of Clinical Physiology, 56124 Pisa, Italy

Author to whom correspondence should be addressed.

BioMedInformatics 2024, 4(4), 2149-2172; https://doi.org/10.3390/biomedinformatics4040115

Submission received: 6 September 2024 / Revised: 21 October 2024 / Accepted: 23 October 2024 / Published: 28 October 2024

(This article belongs to the Special Issue Advances in Quantitative Imaging Analysis: From Theory to Practice)

Download

Browse Figures

Versions Notes

Abstract

Recent developments in Artificial Intelligence have increasingly focused on explainability research. The potential of Explainable Artificial Intelligence (XAI) in producing trustworthy computer-aided diagnosis systems and its usage for knowledge discovery are gaining interest in the medical imaging (MI) community to support the diagnostic process and the discovery of image biomarkers. Most of the existing XAI applications in MI are focused on interpreting the predictions made using deep neural networks, typically including attribution techniques with saliency map approaches and other feature visualization methods. However, these are often criticized for providing incorrect and incomplete representations of the black-box models’ behaviour. This highlights the importance of proposing models intentionally designed to be self-explanatory. In particular, part-prototype (PP) models are interpretable-by-design computer vision (CV) models that base their decision process on learning and identifying representative prototypical parts from input images, and they are gaining increasing interest and results in MI applications. However, the medical field has unique characteristics that could benefit from more advanced implementations of these types of architectures. This narrative review summarizes existing PP networks, their application in MI analysis, and current challenges.

Keywords:

deep learning; XAI; interpretability-by-design; part-prototype models; medical imaging

1. Introduction

Medical imaging (MI) represents the main tool in the diagnosis, treatment planning, and monitoring of almost all medical diseases. In clinical practice, different imaging modalities capture different aspects of anatomical structures and physiological functions. Structural imaging, mainly performed through radiography, computed tomography (CT), structural magnetic resonance imaging (sMRI), and ultrasound (US), focuses on visualizing the anatomical structures of the body, providing detailed pictures of organs, bones, and tissues. Functional imaging measures changes in metabolism, blood flow, tissue composition, and tracer absorption. Positron emission tomography (PET) and single-photon emission computed tomography (SPECT) use isotopes as probes to reflect the spatial distribution of chemical compounds within the body. Functional magnetic resonance imaging (fMRI) can monitor physiological processes using several techniques, such as blood-oxygen-level-dependent imaging (BOLD), diffusion, and assessment of blood perfusion. For almost all image modalities, diagnoses are supported by image analysis procedures carried out by software tools that automatically or semi-automatically extract information patterns from images. More recently, Artificial Intelligence (AI) algorithms were developed to analyze medical images to identify patterns that could suggest the presence of a specific disease, helping in early disease diagnosis and treatment planning. Machine and Deep Learning (ML and DL) models have demonstrated interesting performances in supporting MI analysis, but their black-box nature raises technical, ethical, and legal concerns in this high-stakes domain. In this scenario, Explainable Artificial Intelligence (XAI) offers many opportunities to foster the development of responsible and trustworthy AI systems for healthcare applications [1,2,3,4,5].

The last few decades of ML explainability research have mainly been focused on developing methods for explaining complex black-box models (post hoc interpretability), which typically include saliency-based visual methods [2,3,4]. Saliency approaches exploit the spatial information, preserved through the convolutional layers of DL models, to analyze which parts of an image lead to a resulting decision. However, these methods are often criticized for providing incorrect and incomplete representations of the black-box models’ behaviour, and explanations that are not fully interpretable for a human user [5,6,7,8].

To address the limitations of post hoc XAI, a recent trend has emerged, proposing models intentionally designed to be self-explanatory, known as ante hoc interpretability or interpretability-by-design. The interpretability-by-design approach focuses on developing and training transparent predictors: models that are simple enough to be easily understood, yet sophisticated enough to accurately capture relationships between inputs and outputs. This is achieved by ensuring that the AI model itself is interpretable, either by limiting the number of input features (referred to as sparsity or by designing decision-making processes that can be replicated by human users [9]. This includes the use of low-complexity models, such as Linear Models, decision trees, rule-based learners, generalized additive models, and Bayesian models. The primary advantage of simple models is that they are fully understandable without needing additional interpretability techniques. However, the trade-off between transparency and performance is well documented, making it limiting to rely solely on interpretable models for AI applications [9,10,11]. Recently, authors have started proposing black-box methods augmented with explainability methods as alternatives to low-level complexity ML models, overcoming the well-known trade-off between performance and interpretability [1,4,7]. Part-prototype (PP) neural networks (also known as self-explaining neural networks) constitute a promising research direction in computer vision (CV). PP networks perform image classification based on the identification of class-representative prototypical parts in the input image, a reasoning process based on the recognition-by-components theory [8,12]. These self-explaining models reproduce the human case-based reasoning process, and their explanations are faithful representations of the computations performed by the model to make its decisions [6,8,13].

A PP neural network is generally constituted by a deep neural network backbone trained to learn prototypical parts. This feature extractor is trained only using image-level labels (no additional image sub-annotations are required). Once trained, the PP model classifies an input image by detecting the image patches that are more similar to the learned prototypical parts. The model’s interpretability is provided thanks to the transparent reasoning process implemented in the form of “this looks like that”, and the direct relation between the prototypes and the classification prevents unfaithful explanation [8,13,14,15]. A representative example of the case-based reasoning process implemented by the PP networks architectures is reported in Figure 1.

Following the XAI taxonomy [5], a PP network can provide both global (the model’s overall behaviour) and local (relative to a certain prediction) explanations by, respectively, showing all the learned prototypes and the detected prototypes for the given prediction, as shown in Figure 2.

In addition to designing explainable models, properly evaluating XAI is crucial for promoting their application in real-world scenarios [5]. The evaluation process of XAI methods is currently an open challenge, and several studies have started proposing systematic evaluation frameworks to objectively assess the system developed [1,16,17]. Aspects such as the multidisciplinarity of XAI and the absence of a gold standard for a “good explanation” make the interpretability assessment a non-trivial task, and it remains unclear how XAI methods, including PP networks, should be evaluated [5].

This narrative review summarizes existing PP networks, their application in MI analysis, and their current open challenges.

2. Foundational Part-Prototype Models

To date, there have been different PP model variants proposed in the literature [8,18], including ProtoPNet [13], XProtoNet [19], ProtoTree [20], ProtoPShare [21], ProtoPool [22], and PIPNet [8]. This section aims to summarize the main theoretical characteristics of the existing PP nets. For each network, we will describe the architecture, training process, and visualization of the prototypical parts as patches of the input images.

2.1. Part-Prototype Network (ProtoPNet)

ProtoPNet [13] was the first PP network proposed in the CV domain. The ProtoPNet architecture is composed of a (1) CNN backbone f with parameters

w_{f}

; a (2) prototype layer

g_{p}

of M prototypes

P = {p_{j = 1}^{M}}

; and a final (3) fully connected layer h with weight

w_{h}

We show the ProtoPNet architecture in Figure 3.

The CNN backbone takes an input image

x

and returns as output the extracted features

z = f (x | w_{f})

with dimensions

H \times W \times D

. This could be any available CNN’s feature extractor, such as pre-trained architectures like VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, or DenseNet-161.

The network learns M prototypes

P = {p_{j = 1}^{M}}

of dimensions

H_{1} \times W_{1} \times D

, where

H_{1} \leq H

W_{1} \leq W

. Every prototype

p_{j}

represents an activation pattern of the convolutional output, corresponding to a certain prototypical image patch in the original pixel space. In this way, a prototype

p_{j}

is the latent representation of a certain image prototypical part. The prototype layer

g_{p}

performs the computation reported in Equation (1). It evaluates the squared

L^{2}

distances between the

j -

th prototype

p_{j}

and patches of z with the same shape as

p_{j}

. These distances are inverted to produce an activation map of the similarity score, which indicates how strong a prototypical part is in the image. Finally, global max-pooling of every prototype’s activation map returns a similarity score, indicating how strongly a prototypical part is present in a patch of the input image (Equation (1)). The model requires the allocation of a pre-determined number of prototypes

M_{k}

for each class

k \in {1, . . ., K}

, and the subset of prototypes associated with class k

P_{k} \subseteq P

should capture the most relevant parts for identifying images of that class.

g_{p_{j}} (z) = m a x_{\bar{z} \in p a t c h e s (z)} l o g (\frac{| | \bar{z} - p_{j} {| |}_{2}^{2} + 1}{| | \bar{z} - p_{j} {| |}_{2}^{2} + ϵ})

(1)

The last fully connected layer h multiplies the m similarity scores by the weight matrix

w_{h}

to produce the output logits, which are finally normalized with a softmax function to obtain the class probabilities. The weight

w_{h}^{(j, k)}

corresponds to the connection between the output of a class prototype unit

g_{p_{j}}

and the logit of class k. Here, a positive (and, respectively, a negative) connection between prototype j and class k means that the similarity to a class prototype j increases (decreases) the probability that the image belongs to class k.

The ProtoPNet training process requires three main steps: (1) optimization of the convolutional and prototype layers; (2) prototypes’ projection; (3) convex optimization of the fully connected layer.

The backbone and prototype layers’ optimization (1) aims to learn a meaningful latent space, where the most important patches for classifying images are clustered (in L2-distance) around semantically similar prototypes of the images’ true classes and the clusters of different classes are well separated. The weights of the CNN backbone

w_{f}

and the prototype layer

P = {p_{j = 1}^{M}}

are jointly optimized using stochastic gradient descent (SGD), while the weights of the fully connected layer

w_{h}

are fixed and set to

w_{h}^{(j, k)} = \{\begin{matrix} 1 & \forall j if p_{j} \in P_{k} \\ - 0.5 & \forall j if p_{j} \notin P_{k} \end{matrix}

(2)

In this way, the similarity to a prototype (not) of class k (decreases) increases the predicted probability that the image belongs to class k. SDG optimizes the following loss function:

m i n_{P, w_{f}} \frac{1}{n} \sum_{i = 1}^{n} C r s E n t (h \circ g_{p} \circ f (x_{i}), y_{i}) + λ_{1} C l s t + λ_{2} S e p

(3)

where the elements are defined as follows:

$C r s E n t (.)$ : Cross-entropy loss, which penalizes misclassification on the training data.
$C l s t$ : Cluster cost, which promotes every training image to have some latent patch that is close to at least one prototype of its own class.

$C l s t = \frac{1}{n} \sum_{i = 1}^{n} m i n_{j : p_{j} \in P_{y_{i}}} m i n_{z \in p a t c h e s (f (x_{i}))} | | z - p_{j} {| |}_{2}^{2}$

(4)
$S e p$ : Separation cost, which promotes every latent patch of a training image to stay away from the prototypes not of its class.

$S e p = - \frac{1}{n} \sum_{i = 1}^{n} m i n_{j : p_{j} \notin P_{y_{i}}} m i n_{z \in p a t c h e s (f (x_{i}))} | | z - p_{j} {| |}_{2}^{2}$

(5)

With the prototypes’ projection (2), every prototype

p_{j}

is projected onto the nearest latent training patch of a training image of the same class. After projection, a prototype

p_{j}

corresponds to some patch of the latent representation

f (x)

of an image in the training set

x

\forall p_{j} \leftarrow a r g m i n_{z \in Z_{j}} | | z - p_{j} {| |}_{2}

(6)

where

Z_{j} = {\bar{z} : \bar{z} \in p a t c h e s (f (x_{i})) \forall i s . t . y_{i} = k}

The fully connected layer optimization (3) adjusts the connections from the similarity score and the logit of the class to enhance the sparsity:

m i n_{w_{h}} \frac{1}{n} \sum_{i = 1}^{n} C r s E n t (h \circ g_{p} \circ f (x_{i}), y_{i}) + λ \sum_{k = 1}^{K} \sum_{j : p_{j} \notin P_{k}} | w_{h}^{(k, j)} |

(7)

that is, for k and j with

p_{j} \notin P_{k}

, we have

w_{h}^{(j, k)} \approx 0

. In doing so, we foster the model in implementing a positive reasoning process (use prototypes which add evidence for a class, not the ones whose presence decreases the evidence for the class), and further improve the accuracy without changing the learned latent space or prototypes.

Once the ProtoPNet is trained, we can visualize every prototype

p_{j}

as a patch of an image

x

in the training set. This visualization process involves the following steps:

Forwarding $x$ through a ProtoPNet to produce the activation map associated with the prototypes $p_{j}$ .
Upsampling the activation map to the dimension of the input image.
Localizing the smallest rectangular patch whose corresponding activation is at least as large as the 95th percentile of all the activation values in that same map.

We schematically show the visualization process in Figure 4.

2.2. XProtoNet

XProtoNet differs from ProtoPNet due to its ability to learn representative features within a dynamic area.

XProtoNet is composed of a (1) feature extractor with two main modules: a (1.a) feature module F and a (1.b) occurrence module

M_{p_{k}^{c}}

; a (2) similarity score measurement; and a final (3) fully connected layer.

XProtoNet takes an input image

x

and extracts the feature vector

f_{p_{k}^{c} (x)}

for each one of the learned prototypes per class

k, P_{k} = {p_{j}}_{j = 1}^{M_{k}}

with

k = {1, . . ., K}

f_{p_{k}^{c} (x)} = \sum_{u} M_{p_{k}^{c} (x), u} F_{u} (x)

(8)

where

u \in [H \times W)

denotes the spatial location of

F (x)

and

M_{p_{k}^{c} (x)}

. The feature module extracts the image latent representation

F (x) \in R^{H \times W \times D}

while the occurrence module predicts an occurrence map for each prototype

M_{p_{k}^{c} (x)} \in R^{H \times W}

, which indicates where every prototype is likely to appear. The feature vector

f_{p_{k}^{c} (x)}

then represents a certain feature in the highly activated area of the occurrence map.

Then, XProtNet uses the cosine similarity to compute a similarity score between the feature of the input image to every prototype:

s (x, p_{k}^{c}) = \frac{f_{p_{k}^{c} (x)} \cdot p_{k}^{c} (x)}{| | f_{p_{k}^{c} (x)} | | | | p_{k}^{c} (x) | |}

(9)

Finally, every prototype

p_{k}^{c}

contributes to the prediction score of the related class with an importance determined by the weights of the linear layer

w_{p_{k}^{c}}

p (y^{c} | x) = σ (\sum_{p_{k}^{c} \in P^{c}} w_{p_{k}^{c}} s (x, p_{k}^{c}))

(10)

where

σ

is a sigmoid activation function.

XProtoNet follows the ProtoPNet training scheme (backbone and prototype optimization; prototype projection; fully connected layer optimization). Its cost function is composed of three different terms:

L_{t o t a l} = L_{c l s} + λ_{c l s t} L_{c l s t} + λ_{s e p} L_{s e p} + λ_{o c c u r} L_{o c c u r}

(11)

which is defined as follows:

Classification weighted loss, $L_{c l s}^{c}$ , is used to address the imbalance in the dataset:

$L_{c l s}^{c} = - \sum_{i} \frac{1}{| N_{p o s}^{c} |} {(1 - p_{i}^{c})}^{γ} y_{i}^{c} l o g (p_{i}^{c}) - \sum_{i} \frac{1}{| N_{n e g}^{c} |} {(p_{i}^{c})}^{γ} (1 - y_{i}^{c}) l o g (1 - p_{i}^{c})$

(12)

where $p_{i}^{c} = p (y^{c} | x_{i})$ is the prediction score of the $i -$ th sample $x_{i}$ , $γ$ is a parameter for class balance, $| N_{n e g}^{c} |$ and $| N_{p o s}^{c} |$ are the numbers of negative and positive labels on disease c, and $y_{i}^{c} \in {0, 1}$ is the target label of $x_{i}$ on disease c.
Regularization for interpretability which, similarly to [13], includes two different terms, $L_{c l s t}^{c}$ and $L_{s e p}^{c}$ , to, respectively, maximize the similarity between x and $p_{k}^{c}$ for positive samples and minimize it for negative samples:

$L_{c l s t}^{c} = - y^{c} m a x_{p_{k}^{c} \in P^{c}} s (x, p_{k}^{c})$

(13)

$L_{s e p}^{c} = (1 - y^{c}) m a x_{p_{k}^{c} \in P^{c}} s (x, p_{k}^{c})$

(14)
Regularization for the occurrence map:

$L_{o c c u r}^{c} = L_{t r a n s}^{c} + \sum_{p_{k}^{c} \in P^{c}} | | M_{p_{k}^{c}} (x) {| |}_{1}$

(15)

which is defined as follows:
-
The term $L_{t r a n s}^{c}$ considers that an affine transformation $A (.)$ of an image does not change the relative location, so it should not affect the occurrence map either.

$L_{t r a n s}^{c} = \sum_{p_{k}^{c} \in P^{c}} | | A (M_{p_{k}^{c}} (x)) - M_{p_{k}^{c}} (A (x)) {| |}_{1}$

(16)

-
The term $\sum_{p_{k}^{c} \in P^{c}} | | M_{p_{k}^{c}} (x) {| |}_{1}$ regularizes to have an occurrence area as small as possible to not include unnecessary regions.

Once the model is trained, the prototype

p_{k}^{c}

is replaced with the most similar feature vector

f_{p_{k}^{c}}

from the training images.

The learned prototypes are visualized as follows:

Upsampling the occurrence maps to the input image size.
Normalizing with the maximum value of the upsampled mask.
Marking with contour the occurrence values that are greater than a factor of 0.3 of the maximum intensity.

2.3. Neural Prototype Tree (ProtoTree)

Nauta et al. [20] developed Neural Protype Tree (ProtoTree), which integrates the prototype learning approach into a hierarchical decision tree structure. The model includes a CNN followed by a binary tree structure, where each node corresponds to a prototypical part. The prototypical parts are tensors that can be visualized as a patch of a training sample learned with backpropagation during the training process without requiring additional annotations. Similarly to the PP network, the ProtoTree architecture itself explains the global reasoning process implemented (here, as a hierarchical sequence of steps), and the local explanations are constituted by the route through the tree for every single prediction.

ProtoTree consists of a (1) CNN backbone f with parameters

w_{f}

; a (2) binary soft decision tree constituted of

N

nodes,

L

leaves, and

E

edges, which takes the image latent representation and returns the class probability distribution over the K classes.

The CNN backbone takes the input image

x

and extracts the latent features

z = f (x; w_{f})

consisting of D 2-dim

(H \times W)

feature maps,

(H \times W \times D)

The binary soft decision tree takes as input the image latent representation

z

and returns the class probability distribution over K classes

\hat{y}

. Each internal node

n \in N

has two children,

n . r i g h t

and

n . l e f t

, connected, respectively, with

e (n, n . r i g h t) \in E

e (n, n . l e f t) \in E

and corresponds to a prototype

p_{n} \in P

. Here, a prototype is a trainable tensor of dimensions

H_{1} \times W_{1} \times D

, with the same depth as the convolutional output

z

and

H_{1} \leq H, W_{1} \leq W

. The prototype

p_{n}

acts as a kernel, which slides over

z

, computes the Euclidean distance between

p_{n}

and its receptive field

\tilde{z}

, and applies a minimum pooling to select the patch

H_{1} \times W_{1} \times D

closest to

p_{n}

{\tilde{z}}^{*} = a r g m i n_{\tilde{z}} | | \tilde{z} - p_{n} | |

(17)

The selected closest patch is then routed through both edges (soft routing) of the child nodes with

p_{e (n, n . r i g h t)} (z) = e x p (- | | {\tilde{z}}^{*} - p_{n} | |)

(18)

p_{e (n, n . l e f t)} = 1 - p_{e (n, n . r i g h t)}

(19)

So, the similarity between

p_{n}

and

{\tilde{z}}^{*}

determines to which extent

z

is routed to the right node child

n . r i g h t

. Each leaf node

l \in L

receives the convolutional output

z

with a probability given by the product of the edge probabilities

p_{e}

in the path

P_{l}

followed by

z

from the root node to leaf l:

π_{l} (z) = \prod_{e \in P_{l}} p_{e} (z)

(20)

Each leaf has a trainable parameter

c_{l}

, which denotes the class probability distribution over K classes and computes the class probability distribution of leaf l applying the softmax normalization

σ (c_{l})

The class probability distribution over K classes

\hat{y}

is given by the weighted contribution of all the leaves.

\hat{y} (x) = \sum_{l \in L} σ (c_{l}) \cdot π_{l} (f (x; w))

(21)

The ProtoTree’s architecture is initialized by selecting a pre-trained CNN backbone

f (x; w)

and defining the maximum height h of the binary tree, which determines the number of prototypes

| P | = 2^{h} - 1

. Then, the training process is constituted by the (1) optimization of the convolutional and prototype layers; (2) convex optimization of the decision tree’s leaves; and (3) prototypes’ projection.

The CNN backbone

w_{f}

and prototype layers’

P

are optimized (1) with backpropagation gradient descent by minimizing the cross-entropy loss between the predicted class probability distribution

\hat{y}

and ground-truth

y

The convex optimization of

c

(2) learns the leaves’ distribution using a derivative-free approach:

c_{l}^{(t + 1)} = - \frac{1}{B} \cdot c_{l}^{(t)} + \sum_{x, y \in T} (σ (c_{l}^{(t)}) ⊙ y ⊙ π_{l}) ⊘ \hat{y}

(22)

where t indexes a training epoch.

Finally, with the prototypes’ projection (3)

p_{n} \leftarrow {\tilde{z}}_{n}^{*}

, every prototype

p_{n} \in P

is replaced with its nearest latent patch present in the training data

z_{n}^{*}

z_{n}^{*} = a r g m i n_{z \in {f (x)}, \forall x \in T} | | z^{*} - p_{n} | |

(23)

Once trained, ProtoTree has a pruning step to remove ineffective prototypes; this reduces the explanation’s sizes, improving the model’s interpretability. This consists of removing leaves with nearly uniform distributions, signs of little discriminative power between classes, by defining a threshold

τ

slightly greater than

\frac{1}{K}

and pruning the ones where

m a x (σ (c_{l})) \leq τ

Considering

x_{n}^{*}

as the training image corresponding to the latent patch

{\tilde{z}}^{*}

, the prototype

p_{n}

is visualized as a patch of

x_{n}^{*}

as follows:

Forwarding $x_{n}^{*}$ through $f : z = f (x_{n}^{*})$ .
Creating a 2-dim similarity map:

$S_{n}^{(i, j)} = e x p (- | | | {\tilde{z}}^{(i, j)} - p_{n} | |)$

(24)

where $(i, j)$ denotes the location of patch $\tilde{z}$ in patches of z.
Upsampling $S_{n}$ with bicubic interpolation to the shape of $x_{n}^{*}$ .
Visualizing $p_{n}$ as a rectangular patch at the same location nearest to the latent patch ${\tilde{z}}_{n}^{*}$ .z

2.4. Protypical Part Shared Network (ProtoPShare)

Rymarczyk et al. [21] developed ProtoPShare, a part-prototype network that extended ProtoPNet, addressing two of its limitations: the (1) ability to share prototypes between the classes; the (2) ability to identify semantically similar prototypes, even with a distant representation in the latent space. The authors achieved this by implementing a data-dependent pruning algorithm based on the feature maps. This results in a model with a reduced explanation size, fostering its interpretability.

ProtoPShare shared the same architecture as that of ProtoPNet, constituted of a CNN backbone f extracting the latent representation of the image

z = f (x)

; a prototype layer g of

m_{k}

prototypes per classes

P_{k} = {p_{j} {j = 1}^{m_{k}}}

; and a fully connected layer of weight

w_{h}

, which predicts the output class.

ProtoPShared adopted the ProtoPNet training scheme followed by a data-dependent merge-pruning to obtain a network with a smaller number of shared prototypes. The pruning process consists of the following steps:

Computing the data-dependent similarity for pair of prototypes $(p, \tilde{p}) \in P^{2}$ , given by the compliance on the similarity scores for all the training input image $x \in X$ . This considers two prototypes similar if they activate alike on the training images, even if far in the latent space:

$d_{D D} (p, \tilde{p}) = \frac{1}{\sum_{x \in X} {(g_{p_{i}} (z) - g_{{\tilde{p}}_{i}} (z))}^{2}}$

(25)
Selecting a percentage $ζ$ of the most similar pairs of prototypes to merge per step $(p, \tilde{p})$ .
For each pair, removing prototype $p$ and its weights $w_{h} (p)$ and reusing prototype $\tilde{p}$ , aggregating weights $w_{h} (\tilde{p})$ and $w_{h} (p)$ .

The results obtained compared to other self-explained prototypical part models suggested an increase in interpretability and a novel ability to discover semantic similarity discovery while maintaining high accuracy.

As with ProtoPNet [13], the visualization process of the prototypes

p_{j}

is based on the localization of the smallest and most activated rectangular patch in the upsampled image activation map.

2.5. ProtoPool

Rymarczyk et al. [22] developed ProtoPool, integrating into the PP network (1) the concept of prototype soft assignment, which optimizes the model’s compactness without requiring a pruning stage in the training process, and (2) the definition of a focal similarity function to focus the model on salient features. As in ProtoPShare [21], the prototypes can be shared between classes.

ProtoPool shared the same architecture as that of ProtoPNet, constituting a CNN backbone f; a prototype layer g of m prototypes

P = {p_{j = 1}^{M} \in R^{D}}

; and a fully connected layer of weight

w_{h}

, which predicts the output class.

The image latent representation

z = f (x)

with the dimensions

H \times W \times D

is considered a set of

H \dot{W}

vectors, each one corresponding to a location in the image and with dimensions D:

Z_{x} = {z_{i} \in f (x) : z_{i} \in R^{D}, i = 1, . . ., H \dot{W}}

The prototype layer contains K slots for each class, and each slot is implemented as a distribution of the prototypes in the pool

q_{k} \in R^{M}

, where

q_{k}

is the probability of assigning successive prototypes to the slot. The layer computes on each k slot the aggregated similarity between

Z_{x}

and all the prototypes, considering their slot distribution

q_{k}

g_{k} = \sum_{i = 1}^{M} q_{k}^{i} g_{p_{i}}

(26)

Here, ProtoPool computes the activation of prototype p with respect to image x

g_{p}

using the novel introduced focal similarity function:

g_{p} = m a x_{z \in Z_{x}} g_{p} (z) - m e a n_{z \in Z_{x}} g_{p} (z)

(27)

where

g_{p} (z) = l o g (1 + \frac{1}{| | z - {p | |}^{2}})

(28)

The focal similarity has the advantage of preventing high activation due to all the elements in

Z_{x}

being similar to a prototype, which might result in the obtainment of prototypes focused on the background regions, and the gradient is passed only through the most active part of the image. ProtoPool also uses the concept of soft assignment on prototype distributions, applying the Gumbel-softmax estimator to avoid many prototypes being assigned to one slot (which might decrease the interpretability):

Gumbel - softmax (q, τ) = (y^{1}, . . ., y^{M}) \in R^{M}

(29)

where

y^{i} = \frac{e x p (q^{i} + η_{i}) / τ}{\sum_{m = 1}^{M} e x p ((q^{m} + η_{m}) / τ)}

(30)

where

e t a_{m}

for

m \in 1, . . ., M

are sampled from the standard Gumbel distribution.

ProtoPool adopted the ProtoPNet training scheme (backbone and prototype training; prototype projection; backbone, prototype and fully connected layer fine-tuning) by extending the loss function, avoiding the same prototype being assigned to many slots of one class:

L_{o r t h} = \sum_{i < j}^{K} \frac{〈 q_{i}, q_{j} 〉}{| | q_{i} {| |}_{2} \cdot | | q_{j} {| |}_{2}}

(31)

As in ProtoPNet, the prototypes are projected to replace every learned abstract prototype with the representation of the nearest training patch:

p \leftarrow a r g m i n_{z \in Z_{C}} | | z - {p | |}_{2}

(32)

where

Z_{C} = {z : z \in Z_{x} \forall (x, y) : y \in C}

; here, differently from ProtoPNet, C is the set of all the classes assigned to prototype p.

Even for ProtoPool, the visualization process of the prototypes is based on the upsampling of the activation maps, as in ProtoPNet [13].

2.6. Patch-Based Intuitive Prototype Network (PIPNet)

Nauta et al. [8] developed a part-prototype architecture that presents different advantages compared to the existing models: (1) the absence of a semantic gap between the learned prototypes and human vision; (2) compactness; and (3) the ability to handle Out-of-Distribution (OoD) data.

The lack of a semantic gap means that the prototypes learned correlate with human concepts, ground-truth object parts, and human visual perception. This was covered by implementing an extra regularization of the prototypes using Self-supervised Representation Learning. PIPNet overcomes the assumption that some of the images from the same class have the same prototypes (the regularization of interpretability at the class level).

The weights in the fully connected layer connecting the prototypes to classes are trained with a sparsity regularization, which promotes the model’s compactness. PIPNet only needs an upper bound on the number of prototypes, selecting as few prototypes as possible for good classification accuracy and allowing class-sharable prototypes. This regularization was performed by introducing a novel function that optimizes the classification performance and compactness.

Finally, PIPNet manages OoD data by abstaining from decisions when no relevant prototypes are present in the image. This was implemented enabling the assignment of near-zero scores for all classes using normalization (necessary for scale-invariance issues) in logits, where a score of zero stays the same and the prototype presence scores are allowed to behave independently of each other.

PIPNet consists of a (1) CNN backbone f with parameters

w_{f}

followed by a softmax activation; a (2) global max-pooling operation extracting the prototype layer

p

; and a final positive sparse (3) fully connected layer

w_{c}

, which acts as a scoring sheet system to predict the output classes. PIPNet classifies images based on the presence of prototypical parts in the input image. The relevantly present prototypical parts add up scores (evidence) of the model’s classes with a proportional contribution determined by the linear weights, and the model can abstain from making decisions when there is not enough evidence for any classes.

The CNN backbone takes the input image

x

and extracts the latent features

z = f (x; w_{f})

consisting of D 2-dim

(H \times W)

feature maps,

(H \times W \times D)

. The softmax activation normalizes so that

z = \sum_{d = 1}^{D} z_{h, w, d} = 1

. Here,

z_{h, w, d}

represents the probability that a patch in position

h, w \in H \times W

corresponds to the prototype d, that is, the one-hot encoding of patch

h, w

to the prototype d.

The global max-pooling 2D operation extracts D prototypes and calculates the prototype’s presence scores

p

, where

p \in {[0 ., 1 .]}^{D}

, and where

p_{j}

measures the presence of the prototype d in the input image.

Finally, a linear sparse classification layer with positive weights

w_{c} \in R_{\geq 0}^{D \times K}

connects the prototypes to the classes acting as scoring sheet systems:

o = p \cdot w_{c}

where

o

1 \times K

, and

o_{k} = \sum_{d = 1}^{D} p_{d} w_{c}^{d, k}

, where

w_{c}^{d, k}

represents the contribution of prototype d for class k.

The PIPNet training process is composed of two main steps: the (1) Self-Supervised Pre-Training of Prototypes and the (2) PIPNet training.

The prototypes’ pre-training aims to learn an image encoding p with semantic similarity independently from the classification task (the last linear layer is kept frozen). We optimize for patch alignment under a contrastive learning approach by training PIPNet to assign the same prototypes to two views of an augmented image patch.

This involves a positive pair creation

x^{'}, x^{″}

step by applying data augmentation transformation to the input image x selected so that humans consider the two views to be similar. The Adam optimizer optimizes the following loss function:

λ_{A} L_{A} + λ_{T} L_{T}

(33)

where

Alignment Loss, $L_{A}$ , optimizes for near-binary encodings, where an image patch corresponds to exactly one prototype:

$L_{A} = - \frac{1}{H W} \sum_{(h, w) \in H \times W} l o g (z_{h, w, :}^{'} \cdot z_{h, w, :}^{″})$

(34)

where the dot product $z_{h, w, :}^{'} \cdot z_{h, w, :}^{″}$ assesses the similarity between the latent patches of two views of an image patch if $z_{h, w, :}^{'} = z_{h, w, :}^{″} \Rightarrow L_{A} = 0$ .
Tanh Loss, $L_{T}$ prevents the trivial solution in which one prototype node is activated on all image patches in each image in the dataset forcing every prototype to be present at least once in a mini-batch:

$L_{T} (p) = - \frac{1}{D} \sum_{d = 1}^{D} l o g ((t a n h (\sum_{b = 1}^{B}) p_{b}) + ϵ)$

(35)

The PIPNet training optimizes the classification performances and fine-tunes the prototypes for the downstream classification task.

λ_{A} L_{A} + λ_{T} L_{T} + λ_{C} L_{C}

(36)

where the Classification Loss,

L_{C}

, is the log-likelihood loss between the ground-truth labels and the predictions. During training, the output scores are computed as

o = log ({(p w_{c})}^{2} + 1)

, acting as regularization for sparsity.

The visualization process of the prototypes

p_{j}

is based on the upsampling of the latent output based on the single most activated latent patch.

3. Application and Advances in Medical Imaging

There is increasing interest in applying the PP model in medical imaging.

Most of the existing PP nets were originally designed for general CV purposes, so they typically work by taking RGB images as the input. However, medical imaging diagnosis is often performed using the entire volumetric human anatomy, so being able to efficiently process 3D data volumes might be particularly relevant in this context. In addition, clinicians often use multiple data sources in their decision-making process [23,24], and some authors have started developing explainable models that use multiple data sources to predict outcomes.

We searched major databases (Google Scholar, Scopus, Web of Science) for articles including original research papers and preprints on PP networks applied in medical image classification published from 2019 to August 2024. We described all the collected works, grouped according to the input data type (2D images, 3D images, multimodal data), in the following sections and summarize them in Table 1.

3.1. Two-Dimensional Image Models

Singh et al. proposed three ProtoPNet variants for classifying chest images as COVID-19 positive, pneumonia-positive, or normal: Negative-Positive ProtoPNet (NP-ProtoPNet) [25], Generalized ProtoPNet (Gen-ProtoPNet) [26], and Pseudo-ProtoPNet (Ps-ProtoPNet) [27].

In NP-ProtoPNet [25], the authors modified the original fully connected layer by connecting the similarity scores to correct/incorrect classes’ logits with weights fixed, respectively, to 1 or −1. This allows the network to reason both in a “positive” way (“this looks like that”) and in a “negative” way (“this does not look like those”), i.e., by exclusion. In Gen-ProtoPNet [26], the authors introduced a generalized distance function to select prototypes of varying dimensions. They achieved accuracies of 88.99% in the first work and 87.27% in the second. They also trained black-box CNNs for the same task, showing that their explainable model performances are comparable with state-of-the-art (SOTA) black-box models. For both studies, they used a combination of two datasets, one containing frontal chest radiographs of healthy people and those of pneumonia patients (Chest X-ray [28]), and the COVID-19 Image Data Collection [29], from which they took radiographs of COVID-19 patients. The Ps-ProtoPNet [27] integrates NP-ProtoPNet and Gen-ProtoPNet. For this implementation, they used CT images from the COVIDx CT-2 Dataset available on Kaggle, achieving an accuracy of 99.24%.

Kim et al. [19] implemented X-ProtoPNet, which introduces the prediction of occurrence maps, which indicate the area where a sign of the disease (i.e., a prototype) is likely to appear, so it compares the features in the predicted area with the prototypes. This novelty can help to identify whether the model is focusing on the correct parts of the image or if it is being misled by irrelevant features. They used the NIH chest X-ray dataset [30] for a multilabel classification task (atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax consolidation, edema, emphysema, fibrosis, pleural thickening, and hernia), achieving a mean Area Under the Receiver Operating Characteristic (AUROC) of 0.822.

Mohammadjafari et al. [31] applied ProtoPNet to brain MRI scans to detect Alzheimer’s disease, achieving an accuracy of 87.17% in the OASIS dataset [32] and of 91.02% in the ADNI dataset. They used different CNN architectures as feature extractors, and ProtoPNet’s performances were demonstrated to be comparable to or slightly less than those of its black-box counterparts.

Barnett et al. [33] designed a novel PP net modifying the training process of ProtoPNet with a loss function including fine-grade expert image annotation and using top-k average pooling instead of max-pooling. They trained their model on an internal dataset of digital mammograms for mass margin classification and mass margin malignancy prediction and reported an equal or higher accuracy compared to that of PrototPNet and a black-box model.

Carloni et al. [34] used ProtoPNet to classify benign/malignant breast masses in mammogram images from a publicly available dataset (CBIS-DDSM [35]), achieving an accuracy of 68.5%. While this performance may not yet be ideal for clinical practice, the authors suggest three tasks for a qualitative evaluation of the explanations by a radiologist.

Amorim et al. [36] implemented ProtoPNet to classify histologic patches from the PatchCamelyon dataset [37] into benign and malignant using top-k average pooling instead of max-pooling to extract similarity scores from the activation maps. The authors achieved an accuracy of 98.14% by using Densenet-121 as a feature extractor. This performance was comparable with the one obtained with the black-box CNN.

Flores-Araiza et al. [38] used ProtoPNet to identify types of kidney stones (whewellite, weddellite, anhydrous uric acid, struvite, brushite, and cystine) using a simulated in vivo dataset of endoscopic images. They achieved an accuracy of 88.21% and further evaluated the explanations by perturbing the global visual characteristics of the images (hue, texture, contrast, shape, brightness, and saturation) to describe their relevance to the prototypes and the sensitivity of the model.

Kong et al. [39] designed a dual-path PP network, DP-ProtoNet, to increase the generalization performances of single networks. They applied their model to the public ISIC-HAM10000 skin disease dermatoscopic dataset [40]. Compared to ProtoPNet, DP-ProtoNet achieved a better performance while maintaining the model’s interpretability.

Santiago et al. [41] integrated ProtoPNet with content-based image retrieval to provide explanations in terms of image-level prototypes and patch-level prototypes. They applied their approach to skin lesion diagnoses in the public ISIC dermoscopic images [40], outperforming both black-box models and SOTA explainable approaches.

Cui et al. [42] proposed MBC-ProtoTree, an interpretable fine-grained image classification network based on ProtoTree. They improved ProtoTree by designing a multi-grained feature extraction network, a new background prototype removal mechanism, and a novel loss function. The improved model achieved a higher classification accuracy on the chest X-ray dataset.

Nauta et al. [14] applied their PIPNet to two open benchmark datasets, respectively, for skin cancer diagnosis (ISIC), bone X-ray abnormality detection (MURA), and two real-world datasets for hip and ankle fracture detection, respectively. From this study, the authors obtained prototypes that were generally in line with medical knowledge and demonstrated the possibility of correcting the undesired model’s reasoning process with a human-in-the-loop configuration.

Santos et al. [43] integrated ProtoPNet into a Deep Active Learning framework to predict diabetic retinopathy on the Messidor dataset [44]. The framework allows ProtoPNet to be trained on a training set of instances selected using a search strategy, and this may offer benefits in scenarios where the datasets are expensive to label. The reported performance demonstrated the success of applying interpretable models with reduced training data.

Wang et al. [45] proposed InterNRL, integrating ProtoPNet into a student–teacher framework, together with an accurate global image classifier. They applied their model for breast cancer and retinal disease diagnosis and obtained the SOTA performances, demonstrating the success of the reciprocal learning training paradigm in the medical imaging domain.

Xu et al. [46] proposed a prototype-based vision transformer applied to COVID-19 classification. They replaced the last two layers of a transformer encoder with a prototype block similar to ProtoPNet and obtained good performances on three different public datasets.

Sinhamahapatra et al. [47] proposed ProtoVerse, a PP model with a novel objective function applied to vertebral compression fracture classification. Expert radiologists evaluated the model interpretability, and the predictive performances outperformed those of ProtoPNet and other SOTA PP architectures.

Pathak et al. [18] further applied PIPNet to three different public datasets for breast cancer classification, obtaining competitive performances with respect to other SOTA black-box and prototype-based models and assessed the coherence of the model with quantitative metrics.

Gallée et al. [24] applied the Proto-Caps PP net on lung and chest CT nodule malignancy prediction and performed a human evaluation of the model with a user study.

In summary, PP networks are gaining increasing interest in the MI domain. Existing applications are based on the original or implementation variants of the identified PP foundational models (see Section 2). Most of the studies obtained comparable or higher performances than their black-box counterparts in a large variety of public and private real-world hospital datasets. These interesting results strongly support their usage to automatically classify medical images in an interpretable way.

3.2. Three-Dimensional and Multimodal Models

We observed increasing interest in extending PP-NN model applications from general computer vision tasks to 3D medical images. Wei et al. [48] proposed MProtoNet as the first medical prototype network extending PP models to 3D brain tumour classification from multiparametric MRI (mpMRI). This is based on an implementation of XProtoNet with a 3D ResNet backbone and a novel proposed soft masking and online-CAM loss function to enhance the localization of attention regions. The authors further proposed novel metrics to assess the correctness and localization coherence of the prototypes (following the Co-12 taxonomy [49]). The MProtoNet obtained a classification accuracy in line with that of the baseline CNN counterpart and ProtoPNet (re-implemented for 3D images).

Vaseli et al. [50] proposed ProtoASNet, a PP model that extracts spatio-temporal feature vectors to detect aortic stenosis from B-mode echocardiography videos. They assessed their model on both a private and a public TMED-2 dataset [51], and it outperformed the baseline ProtoPNet and XProtoNet. De Santi et al. [52] extended PIPNet to classify 3D MRI for Alzheimer’s disease. They tested the model using different CNN backbones and obtained the best result with ResNet-18 3D, and comparable results were obtained with the corresponding black-box counterpart. They further proposed two novel quantitative metrics of explainability: the prototype brain entropy to assess the covariate complexity and localization consistency to assess the consistency of prototype localization (under the Co-12 properties). They also evaluated prototypes with domain experts through a survey and assessed the coherency of prototypes in terms of the localization, pattern, and classification.

Physicians perform medical diagnoses by considering a wide range of patient information, not just image data. This includes clinical, instrumental, and other relevant examinations [23,24]. In this context, developing multimodal PP models could help create systems that implement a decision-making process closer to a real-life application scenario [23,24].

In the context of multimodality-based prototype learning, we can distinguish two main approaches in producing the prototype representation according to the way of introducing different modalities: (1) deterministic prototypes and (2) shifted prototypes [53]. The first approach typically trains different encoders for all the different modalities, and then a connection layer (cross-modal encoder) concatenates the output feature vector extracted, generating a unimodal vector into a joint space used by the PP net to perform a similarity calculation. Alternatively, some studies have introduced multimodal feature vectors to shift the prototype representations without prior fusion. Here, existing approaches leverage auxiliary modalities to enrich the embedding representation of a main modality. The shifted prototype can be explicitly modeled, or their influence can be indirectly added to the objective function used to train an embedded network [53]. Most of the multimodal PP networks proposed in general CV applications integrate visual information with text data [53]. Despite still being considered a relatively unexplored field, we can also find some applications in medical imaging.

Table 1. Part -prototype network applied in medical imaging. Here, we report predictive accuracy (Acc), balanced accuracy (Bal Acc), and Area Under the Curve (AUC) according to the provided predictive performances.

Paper	Modality	Dataset	Classes	Results
Two-dimensional image models
Singh et al. [25]	X-ray	Chest X-ray, COVID-19 Image	Normal, pneumonia, COVID−19	Acc = 88.99%
Singh et al. [26]	X-ray	Chest X-ray, COVID-19 Image	Normal, pneumonia, COVID-19	Acc = 87.27%
Singh et al. [27]	CT	COVIDx CT-2	Normal, pneumonia, COVID−19	Acc = 99.24%
Kim et al. [19]	X-ray	NIH chest X-ray	Atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumathorax, consolidation, edema, emphysema, fibrosis, pleural thickening, hernia	Mean AUC = 0.822
Mohammadjafari et al. [31]	MRI	OASIS, ADNI	Normal vs. Alzheimer’s disease	Acc: 87.17% (OASIS), 91.02% (ADNI)
Barnett et al. [33]	Mammography	Internal dataset	Mass margin classification Malignancy prediction	AUC: 0.951 (Mass-margin) 0.84 (Malignancy)
Carloni et al. [34]	Mammography	CBIS-DDSM	Benign, malignant	Acc = 68.5%
Amorim et al. [36]	Histology	PatchCamelyon	Benign, malignant	Acc = 98.14%
Flores-Araiza et al. [38]	Endoscopy	Simulated in vivo dataset	Whewellite, weddellite, anhydrous uric acid, struvite, brushite, cystine	Acc = 88.21%
Kong et al. [39]	Dermatoscopic images	ISIC-HAM10000	Actinic keratosis, intraepithelial carcinoma, nevi, basal cell carcinoma, benign keratosis-like lesions, dermatofibroma, melanoma, vascular lesions	F1 = 74.6
Santiago et al. [41]	Dermatoscopic images	ISIC-HAM10000	Actinic keratosis, intraepithelial carcinoma, nevi, basal cell carcinoma, benign keratosis-like lesions, dermatofibroma, melanoma, vascular lesions	Bal Acc = 75.0% (Highest, achieved with DenseNet)
Cui et al. [42]	X-ray	Chest X-ray	Normal, pneumonia	Acc = 91.4%
Nauta et al. [14]	Dermoscopic images, X-ray	ISIC, MURA, Hip and ankle fraction internal dataset	Benignant, malignant, normal, abnormal, fracture, no fracture	Acc: 94.1% (ISIC) 82.1% (MURA) 94.0% (Hip) 77.3% (Ankle)
Santos et al. [43]	Retinograph	Messidor	Healthy vs. diseased retinopathy	AUC = 0.79
Wang et al. [45]	Mammography, retinal OCT	Mammography internal dataset, CMMD, NEH OCT	Cancer vs. non-cancer Benignant vs. malignant Normal, drusen, and choroidal neovascularization	AUC = 91.49 (Internal) AUC = 89.02 (CMMD) Acc = 91.9 (NEH OCT)
Xu et al. [46]	X-ray Lung CT	COVIDx CXR-3 COVID-QU-Ex Lung CT scan	COVID-10, normal, pneumonia	F1: 99.2 96.8 98.5
Sinhamahapatra et al. [47]	CT	VerSe’19 dataset	Fracture vs. healthy	F1 = 75.97
Pathak et al. [18]	Mammography	CBIS, VinDir, CMMD	Benignant vs. malignant	F1 (PIP-Net model): 63 ± 3% (CBIS) 63 ± 3% (VinDir) 70 ± 1% (CMMD)
Gallée et al. [24]	Thorax CT	LIDC-IDRI	Benignant vs. malignant	Acc = 93.0%
Three-dimensional image models
Wei et al. [48]	3D mpMRI: T1, T1CE, T2, FLAIR	BraTS 2020	High-grade vs. low-grade glioma	Bal Acc = 85.8%
Vaseli et al. [50]	Echocardiography	Private dataset TMED-2	Normal vs. mild vs. severe aortic stenosis	Acc: 80.0% (Private) 79.7 (TMED-2)%
De Santi et al. [52]	3D MRI	ADNI	Normal vs. Alzheimer’s disease	Bal Acc = 82.02%
Multimodal models
Wolf et al. [54]	3D 18F-FDG PET and Tabular data	ADNI	Normal vs. Alzheimer’s disease	Bal Acc = 60.7%
Wang et al. [55]	Chest X-ray and reports	MIMIC-CXR	Atelectasis, cardiomegaly, consolidation, edema, enlarged cardiomediastinum, fracture, lung lesion, lung opacity, pleural effusion, pleural other, pneumonia, pneumothorax, support device	Mean AUC = 0.828
De Santi et al. [56]	3D MRI and Ages	ADNI	Normal vs. Alzheimer’s disease	Bal Acc = 83.04%

Wolf et al. [54] developed PANIC, a Prototypical Additive Neural Network for Interpretable Classification of Alzheimer’s Disease, which performs the classification of 3D 18F-FDG PET images integrated with tabular data as Alzheimer’s Disease or healthy. PANIC integrated the image prototypes extracted by a 3D ProtoPNet with N tabular features using an interpretable Generative Additive Model. PANIC obtained a higher balanced accuracy compared to the performances obtained with a black-box model for heterogeneous data. Wang et al. [55] proposed MProtoNet, an interpretable multimodal network that performs diagnosis using images and textual information from medical reports. The architecture combines a position embedding and a multimodal attention module applied to the chest MIMIC-CXR dataset [57], and reported improvements in predictive performances with respect to ProtoPNet without sacrificing the interpretable nature. De Santi et al. [56] proposed a Patch-based Intuitive Multimodal Prototype Network (PIMPNet), a multimodal prototype classifier that learns 3D image part-prototypes and prototypical values from structured data, to predict a patient’s cognitive level in AD from sMRI and age values. They introduce the concept of age-prototypical layer to directly learn the relevant age values. PIMPNet performs classification by concatenating the image prototypes extracted by a 3D PIPNet with age prototypes using an interpretable scoring sheet system; however, the age prototypes do not improve the predictive performance of the model trained with images only.

The unique features of the medical domain may have benefits for extending general-purpose PP models so that they can process volumetric data and data coming from different modalities. Although there are emerging proposals of this kind, the literature remains limited compared to that related to 2D image classification. In particular, it seems that an optimal solution for effectively managing data from various modalities has not yet been identified.

4. Evaluation of Prototypes

Evaluating the XAI method is crucial to verify that the explanations produced are robust, sensitive to the model and data, and consistent, but this is still considered an open and developing field [5,58].

Assessing the XAI outputs is considered challenging: there is still no agreement on what interpretability and explainability really are, and a general ground truth to compare explanations is missing. Researchers have proposed different evaluation frameworks to assess XAI methods, and despite there not being a unique solution yet, most approaches agree concerning the need to consider the multidisciplinary and multidomain nature of the explanation [1]. In this scenario, including quantitative metrics to assess the technical appropriateness and reliability of explanations is crucial. To effectively capture the whole impact of XAI on supporting decision systems in real-world scenarios, it is also fundamental to include the human aspect of explanations in the evaluation process [17,59]. The human evaluation should assess how explanations are perceived, their relationship with any expert domain background knowledge, how they affect the classification performances, and how all those aspects vary according to the presentation format of the explanations.

However, while the effectiveness of post hoc explainability methods has been investigated extensively, research on the quantitative analysis of self-explainable approaches is still in its early stage [60]. Particularly relevant is the general evaluation framework proposed by Nauta et al. [49], the so-called “Co-12 properties”, which have been further applied to PP models, providing a method for systematically evaluating this family of interpretable models [15]. This overview might be particularly relevant as most XAI evaluation methods were designed for posthoc XAI [15,60]. Here, the “explainability” is considered a non-binary characteristic, and authors have developed a categorization scheme based on three main dimensions further divided into 12 conceptual properties. This “evaluation cheat-sheet” suggested the implementation of all content, presentation and user-related experiments, involving experts from different relevant disciplines, including computer science, human–computer interaction (HCI), and user experience design. We summarized the evaluation framework, with the properties grouped by their most prominent dimension and contextualized to the family of PP networks, in Table 2.

The availability of synthetic quantitative metrics is fundamental for objectively assessing the explainability quality offered by XAI systems, and authors have started proposing different solutions in this direction.

Wei et al. [48] proposed two novel interpretability metrics to assess the correctness and localization coherence of prototypes (following the Co-12 taxonomy [49]). The correctness was evaluated by proposing the incremental deletion score (IDS), which is based on an incremental deletion process. The localization coherence was evaluated with activation precision (AP) metrics, which assessed the intersection between the activation map and the human-annotated label.

Pathak et al. [18] further developed a general framework for coherence (one of the Co-12 properties), the so-called PEF-Coh, to quantitatively measure the quality of prototypes with respect to the domain knowledge for breast cancer prediction from mammography. The PEF-Coh presents six metrics whose assessment requires a dataset annotated with regions of interest: Relevance, Specialization, Uniqueness, Coverage, Class-specific, and localization.

Gallée et al. [24] evaluated the Proto-Caps PP net, which predicts malignancy scores in lung chest CT nodules, and provides explanations in terms of predefined image feature attributes and image prototypes. They evaluated their model by performing a user study with six radiologists based on a survey to test how explanations affect the (1) user’s performance, (2) trust in the model, and if they are (3) helpful or not. They structured a questionnaire differentiating by radiologists’ experience; assessing for the malignancy; their confidence in the prediction; and the model’s output. Here, test cases were provided to the users with different levels of explainability. Finally, the questioner asked for an overall assessment of the explanations’ helpfulness. The results confirmed the findings in the existing literature in observing that (1) explanations improve performances when the model is correct, but might also convince the user of incorrect predictions; (2) the trust in the model is both influenced by the model’s accuracy and the extent of the model’s reasoning; and (3) explanations are generally perceived as useful for the user.

Recent work aims to evaluate explanations using synthetic datasets. In particular, the FunnyBirds framework introduces a synthetically generated dataset of birds with concept annotation, and a set of metrics to evaluate explanations following the Co-12 taxonomy.

Opłatek et al. [61] applied the FunnyBirds framework to evaluate prototypes produced by ProtoPNet. Their results suggested that explanations provided in the form of similarity maps instead of bounding boxes provide a more faithful representation of the model, resulting in higher-quality explanations. This study further highlights the importance of considering proper visualization techniques in PP models.

The prototype evaluation process has also recently highlighted some concerns about the correctness of the spatial localization of the prototypes [60]. The visualization process of the prototypical part extracted by PP models is generally based on the upsampling of the extracted similarity maps. Despite PP nets being claimed to implement an inherently self-explainable case-based reasoning process, recent studies have reported concerns about the faithfulness of the model-agnostic upsampling process for part-prototype visualization [60,62]. Guatam et al. [60] proposed Prototype Relevance Propagation (PRP) as a model-aware visualization strategy for prototypical parts to address the main drawbacks of PP net upsampling visualization: the low-resolution activation maps and spatially imprecise prototype explanations. Xu et al. observed that the upsampling process may incorrectly locate parts of the images, suggesting the use of what they reported as more faithful saliency methods like SmoothGrads or PRP [62].

5. Discussion

Part-prototype networks are self-explainable methods that combine the computational power of DL into an interpretable-by-design decision-making process. PP nets constitute a prominent solution in medical imaging analysis, where explainability is crucial to ensure the trustworthiness of the AI system, but post hoc explainable methods have reported several drawbacks in terms of reliability and completeness [5,18].

ProtoPNet was the first part-prototype network proposed to perform an interpretable imaging classification; then, other architectural variants have followed it [13]. These still implement a classification using a case-based reasoning process, but differ from the first one in terms of how they combine prototypical parts with base decisions [20], introduce the concept of class-shared prototypes [21], and address some of the reported ProtoPNet limitations, like the semantic gap between latent and image prototypes representations [8].

Most PP nets were originally designed for general CV applications and then shifted to the medical imaging domain. Integrating a deep feature extractor into a PP network often led to competitive performances compared to the fully black-box model counterparts [26,31,36], with the potential to also correct undesired reasoning processes by performing a prototype suppression [14]. Benefits have already been reported for medical images with the use of the general CV PP net; however, this domain might also require particular adaptations such as the ability to process 3D image scans and the multimodal data domain [48,54,56]. In particular, an optimal solution for the management of multiple data types still seems to be missing.

Evaluating XAI methods, including PP nets, with standard quantitative frameworks is still considered an open challenge, but is crucial for developing reliable and trustworthy systems. Authors generally agree that the evaluation process should be multidisciplinary and multidomain to be able to capture the multiple aspects that characterize the explanations, such as the evaluation method proposed by Nauta et al. [15]. Most MI applications of the PP net involve an evaluation process that includes domain expert user studies and quantitative metrics. Although the training of the PP net does not require further annotation with class-level labels, most of the prototype evaluation metrics proposed are based on the usage of further image-level annotations, such as the segmentation of regions of interest in the images. So, the availability of sub-annotated image datasets might also be beneficial for the development of PP models. Finally, despite the general belief in considering PP nets to be reliable XAI methods due to their self-explainable nature, recent studies have highlighted drawbacks in prototype visualization based on a model-agnostic upsampling process, and proposed alternative solutions to produce more spatially accurate prototypes [60,62].

Author Contributions

Writing—review and editing, L.A.D.S., F.I.P., F.B., M.F.S. and V.P.; supervision, S.C. and V.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AP	activation precision
AUC	Area Under the Curve
AUROC	Area Under the Receiver Operating Characteristic
CNN	Convolutional Neural Network
CT	computed tomography
CV	Computer Vision
DL	Deep Learning
IDS	incremental deletion score
MI	Medical Imaging
AP	activation precision
AUC	Area Under the Curve
AUROC	Area Under the Receiver Operating Characteristic
CNN	Convolutional Neural Network
CT	Computed Tomography
CV	Computer Vision
DL	Deep Learning
IDS	incremental deletion score
MI	Medical Imaging
ML	Machine Learning
MR	magnetic resonance
MRI	magnetic resonance imaging
OoD	Out-of-Distribution
PIPNet	Patch-based Intuitive Prototype Network
PP	part-prototype
RX	Radiography
SGD	stochastic gradient descent
SOTA	State-of-the-Art
XAI	Explainable Artificial Intelligence

References

Salahuddin, Z.; Woodruff, H.C.; Chatterjee, A.; Lambin, P. Transparency of deep neural networks for medical image analysis: A review of interpretability methods. Comput. Biol. Med. 2022, 140, 105111. [Google Scholar] [CrossRef] [PubMed]
Borys, K.; Schmitt, Y.A.; Nauta, M.; Seifert, C.; Krämer, N.; Friedrich, C.M.; Nensa, F. Explainable AI in medical imaging: An overview for clinical practitioners—Saliency-based XAI approaches. Eur. J. Radiol. 2023, 162, 110787. [Google Scholar] [CrossRef] [PubMed]
Borys, K.; Schmitt, Y.A.; Nauta, M.; Seifert, C.; Krämer, N.; Friedrich, C.M.; Nensa, F. Explainable AI in medical imaging: An overview for clinical practitioners—Beyond saliency-based XAI approaches. Eur. J. Radiol. 2023, 162, 110786. [Google Scholar] [CrossRef]
Allgaier, J.; Mulansky, L.; Draelos, R.L.; Pryss, R. How does the model make predictions? A systematic literature review on the explainability power of machine learning in healthcare. Artif. Intell. Med. 2023, 143, 102616. [Google Scholar] [CrossRef]
Longo, L.; Brcic, M.; Cabitza, F.; Choi, J.; Confalonieri, R.; Ser, J.D.; Guidotti, R.; Hayashi, Y.; Herrera, F.; Holzinger, A.; et al. Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions. Inf. Fusion 2024, 106, 102301. [Google Scholar] [CrossRef]
Li, O.; Liu, H.; Chen, C.; Rudin, C. Deep Learning for Case-Based Reasoning through Prototypes: A Neural Network that Explains Its Predictions. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2017. [Google Scholar]
Cynthia, R. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
Nauta, M.; Schlötterer, J.; van Keulen, M.; Seifert, C. PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2744–2753. [Google Scholar]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Montavon, G.; Samek, W.; Müller, K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
David, G.; Eric, V.; Yunyan, W.J.; Matt, T. DARPA’s explainable AI (XAI) program: A retrospective. Appl. Lett. 2021, 2, e61. [Google Scholar] [CrossRef]
Biederman, I. Recognition-by-Components: A Theory of Human Image Understanding. Psychol. Rev. 1987, 94, 115–147. [Google Scholar] [CrossRef]
Chen, C.; Li, O.; Tao, C.; Barnett, A.J.; Su, J.; Rudin, C. This Looks Like That: Deep Learning for Interpretable Image Recognition. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Nauta, M.; Hegeman, J.H.; Geerdink, J.; Schlötterer, J.; Keulen, M.v.; Seifert, C. Interpreting and Correcting Medical Image Classification with PIP-Net. In Artificial Intelligence, ECAI 2023 International Workshops, Proceedings of the XAI^3, TACTIFUL, XI-ML, SEDAMI, RAAIT, AI4S, HYDRA, AI4AI, Kraków, Poland, 30 September–4 October 2023; Springer: Cham, Switzerland, 2024; pp. 198–215. [Google Scholar]
Nauta, M.; Seifert, C. The Co-12 Recipe for Evaluating Interpretable Part-Prototype Image Classifiers. In Explainable Artificial Intelligence, Proceedings of the First World Conference, xAI 2023, Lisbon, Portugal, 26–28 July 2023; Longo, L., Ed.; Springer: Cham, Switzerland, 2023; pp. 397–420. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
Jin, W.; Li, X.; Fatehi, M.; Hamarneh, G. Guidelines and evaluation of clinical explainable AI in medical image analysis. Med. Image Anal. 2023, 84, 102684. [Google Scholar] [CrossRef] [PubMed]
Pathak, S.; Schlötterer, J.; Veltman, J.; Geerdink, J.; Keulen, M.V.; Seifert, C.; Pathak, S. Prototype-Based Interpretable Breast Cancer Prediction Models: Analysis and Challenges. In Explainable Artificial Intelligence, Proceedings of the Second World Conference, xAI 2024, Valletta, Malta, 17–19 July 2024; Springer: Cham, Switzerland, 2024; pp. 21–42. [Google Scholar] [CrossRef]
Kim, E.; Kim, S.; Seo, M.; Yoon, S. XProtoNet: Diagnosis in chest radiography with global and local explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15719–15728. [Google Scholar]
Nauta, M.; van Bree, R.; Seifert, C. Neural Prototype Trees for Interpretable Fine-Grained Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14933–14943. [Google Scholar]
Rymarczyk, D.; Struski, Ł.; Tabor, J.; Zieliński, B. ProtoPShare: Prototypical Parts Sharing for Similarity Discovery in In-terpretable Image Classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; Volume 11. [Google Scholar] [CrossRef]
Rymarczyk, D.; Struski, Ł.; Górszczak, M.; Lewandowska, K.; Tabor, J.; Zieliński, B. Interpretable Image Classification with Differentiable Prototypes Assignment. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; Volume 13672, pp. 351–368. [Google Scholar] [CrossRef]
van de Beld, J.J.; Pathak, S.; Geerdink, J.; Hegeman, J.H.; Seifert, C. Feature Importance to Explain Multimodal Prediction Models. a Clinical Use Case. In Explainable Artificial Intelligence, Proceedings of the Second World Conference, xAI 2024, Valletta, Malta, 17–19 July 2024; Springer: Cham, Switzerland, 2024; pp. 84–101. [Google Scholar] [CrossRef]
Gallée, L.; Lisson, C.S.; Lisson, C.G.; Drees, D.; Weig, F.; Vogele, D.; Beer, M.; Götz, M. Evaluating the Explainability of Attributes and Prototypes for a Medical Classification Model. In Explainable Artificial Intelligence, Proceedings of the Second World Conference, xAI 2024, Valletta, Malta, 17–19 July 2024; Springer: Cham, Switzerland, 2024; pp. 43–56. [Google Scholar] [CrossRef]
Singh, G.; Yow, K.C. These do not look like those: An interpretable deep learning model for image recognition. IEEE Access 2021, 9, 41482–41493. [Google Scholar] [CrossRef]
Singh, G.; Yow, K.C. An Interpretable Deep Learning Model for COVID-19 Detection with Chest X-Ray Images. IEEE Access 2021, 9, 85198–85208. [Google Scholar] [CrossRef]
Singh, G.; Yow, K.C. Object or background: An interpretable deep learning model for COVID-19 detection from CT-scan images. Diagnostics 2021, 11, 1732. [Google Scholar] [CrossRef]
Kermany, D.; Zhang, K.; Goldbaum, M. Large dataset of labeled optical coherence tomography (oct) and chest X-ray images. Mendeley Data 2018, 3. [Google Scholar] [CrossRef]
Cohen, J.P.; Morrison, P.; Dao, L. COVID-19 image data collection. arXiv 2020, arXiv:2003.11597. [Google Scholar]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2097–2106. [Google Scholar]
Mohammadjafari, S.; Cevik, M.; Thanabalasingam, M.; Basar, A.; Initiative, A.D.N. Using ProtoPNet for Interpretable Alzheimer’s Disease Classification. In Proceedings of the Canadian AI 2021, Canadian Artificial Intelligence Association (CAIAC), Vancouver, BC, Canada, 25–28 May 2021; Available online: https://caiac.pubpub.org/pub/klwhoig4 (accessed on 5 September 2024). [CrossRef]
Marcus, D.S.; Wang, T.H.; Parker, J.; Csernansky, J.G.; Morris, J.C.; Buckner, R.L. Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI data in young, middle aged, nondemented, and demented older adults. J. Cogn. Neurosci. 2007, 19, 1498–1507. [Google Scholar] [CrossRef]
Barnett, A.J.; Schwartz, F.R.; Tao, C.; Chen, C.; Ren, Y.; Lo, J.Y.; Rudin, C. A case-based interpretable deep learning model for classification of mass lesions in digital mammography. Nat. Mach. Intell. 2021, 3, 1061–1070. [Google Scholar] [CrossRef]
Carloni, G.; Berti, A.; Iacconi, C.; Pascali, M.A.; Colantonio, S. On the applicability of prototypical part learning in medical images: Breast masses classification using ProtoPNet. In Pattern Recognition, Computer Vision, and Image Processing, Proceedings of the ICPR 2022 International Workshops and Challenges, Montreal, QC, Canada, 21–25 August 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 539–557. [Google Scholar]
Lee, R.S.; Gimenez, F.; Hoogi, A.; Miyake, K.K.; Gorovoy, M.; Rubin, D.L. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci. Data 2017, 4, 170177. [Google Scholar] [CrossRef]
Amorim, J.P.; Abreu, P.H.; Santos, J.; Müller, H. Evaluating Post-hoc Interpretability with Intrinsic Interpretability. arXiv 2023, arXiv:2305.03002. [Google Scholar]
Bejnordi, B.E.; Veta, M.; Van Diest, P.J.; Van Ginneken, B.; Karssemeijer, N.; Litjens, G.; Van Der Laak, J.A.; Hermsen, M.; Manson, Q.F.; Balkenhol, M.; et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 2017, 318, 2199–2210. [Google Scholar] [CrossRef] [PubMed]
Flores-Araiza, D.; Lopez-Tiro, F.; El-Beze, J.; Hubert, J.; Gonzalez-Mendoza, M.; Ochoa-Ruiz, G.; Daul, C. Deep prototypical-parts ease morphological kidney stone identification and are competitively robust to photometric perturbations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 295–304. [Google Scholar]
Kong, L.; Gong, L.; Wang, G.; Liu, S. DP-ProtoNet: An interpretable dual path prototype network for medical image diagnosis. In Proceedings of the 2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom/BigDataSE/CSE/EUC/iSCI 2023, Exeter, UK, 1–3 November 2023; pp. 2797–2804. [Google Scholar] [CrossRef]
Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef] [PubMed]
Santiago, C.; Correia, M.; Verdelho, M.R.; Bissoto, A.; Barata, C. Global and Local Explanations for Skin Cancer Diagnosis Using Prototypes. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023 Workshops, Proceedings of the ISIC 2023, Care-AI 2023, MedAGI 2023, DeCaF 2023, Held in Conjunction with MICCAI 2023, Vancouver, BC, Canada, 8–12 October 2023; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2023; Volume 14393, pp. 47–56. [Google Scholar] [CrossRef]
Cui, J.; Gong, J.; Wang, G.; Li, J.; Liu, X.; Liu, S. An Novel Interpretable Fine-grained Image Classification Model Based on Improved Neural Prototype Tree. In Proceedings of the IEEE International Symposium on Circuits and Systems, Monterey, CA, USA, 21–25 May 2023. [Google Scholar] [CrossRef]
de A. Santos, I.B.; de Carvalho, A.C.P.L.F. ProtoAL: Interpretable Deep Active Learning with prototypes for medical imaging. arxiv 2024, arXiv:cs.CV/2404.04736. [Google Scholar]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Chen, Y.; Liu, F.; Elliott, M.; Kwok, C.F.; Pena-Solorzano, C.; Frazer, H.; Mccarthy, D.J.; Carneiro, G. An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning. IEEE Trans. Med. Imaging 2024, 43, 392–404. [Google Scholar] [CrossRef]
Xu, Y.; Meng, Z. Interpretable vision transformer based on prototype parts for COVID-19 detection. IET Image Process. 2024, 18, 1927–1937. [Google Scholar] [CrossRef]
Sinhamahapatra, P.; Shit, S.; Sekuboyina, A.; Husseini, M.; Schinz, D.; Lenhart, N.; Menze, J.; Kirschke, J.; Roscher, K.; Guennemann, S. Enhancing Interpretability of Vertebrae Fracture Grading using Human-interpretable Prototypes. J. Mach. Learn. Biomed. Imaging 2024, 2024, 977–1002. [Google Scholar] [CrossRef]
Wei, Y.; Tam, R.; Tang, X. MProtoNet: A Case-Based Interpretable Model for Brain Tumor Classification with 3D Multi-parametric Magnetic Resonance Imaging. In Proceedings of the Medical Imaging with Deep Learning, Nashville, TN, USA, 10–12 July 2023. [Google Scholar]
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; Seifert, C. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Vaseli, H.; Gu, A.N.; Amiri, S.N.A.; Tsang, M.Y.; Fung, A.; Kondori, N.; Saadat, A.; Abolmaesumi, P.; Tsang, T.S. ProtoASNet: Dynamic Prototypes for Inherently Interpretable and Uncertainty-Aware Aortic Stenosis Classification in Echocardiography. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023, Proceedings of the 26th International Conference, Vancouver, BC, Canada, 8–12 October 2023; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2023; Volume 14225, pp. 368–378. [Google Scholar] [CrossRef]
Huang, Z.; Long, G.; Wessler, B.; Hughes, M. TMED 2: A Dataset for Semi-Supervised Classification of Echocardiograms. 2022. Available online: https://www.michaelchughes.com/papers/HuangEtAl_TMED2_DataPerf_2022.pdf (accessed on 5 September 2024).
De Santi, L.A.; Schlötterer, J.; Scheschenja, M.; Wessendorf, J.; Nauta, M.; Positano, V.; Seifert, C. PIPNet3D: Interpretable Detection of Alzheimer in MRI Scans. arXiv 2024, arXiv:2403.18328. [Google Scholar]
Ma, Y.; Zhao, S.; Wang, W.; Li, Y.; King, I. Multimodality in meta-learning: A comprehensive survey. Know.-Based Syst. 2022, 250, 108976. [Google Scholar] [CrossRef]
Wolf, T.N.; Pölsterl, S.; Wachinger, C. Don’t PANIC: Prototypical Additive Neural Network for Interpretable Classification of Alzheimer’s Disease. In Proceedings of the Information Processing in Medical Imaging: 28th International Conference, IPMI 2023, San Carlos de Bariloche, Argentina, 18–23 June 2023; Proceedings. Springer: Berlin/Heidelberg, Germany, 2023; pp. 82–94. [Google Scholar] [CrossRef]
Wang, G.; Li, J.; Tian, C.; Ma, X.; Liu, S. A Novel Multimodal Prototype Network for Interpretable Medical Image Classification. In Proceedings of the Conference Proceedings—IEEE International Conference on Systems, Man and Cybernetics, Honolulu, HI, USA, 1–4 October 2023; pp. 2577–2583. [Google Scholar] [CrossRef]
De Santi, L.A.; Schlötterer, J.; Nauta, M.; Positano, V.; Seifert, C. Patch-based Intuitive Multimodal Prototypes Network (PIMPNet) for Alzheimer’s Disease classification. In Proceedings of the xAI 2024 Late-breaking Work, Demos and Doctoral Consortium co-located with the 2nd World Conference on eXplainable Artificial Intelligence (xAI 2024), Valletta, Malta, 17–19 July 2024; pp. 73–80. Available online: https://ceur-ws.org/Vol-3793/paper_10.pdf (accessed on 5 September 2024).
Johnson, A.E.W.; Pollard, T.J.; Greenbaum, N.R.; Lungren, M.P.; ying Deng, C.; Peng, Y.; Lu, Z.; Mark, R.G.; Berkowitz, S.J.; Horng, S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv 2019, arXiv:1901.07042. [Google Scholar]
van der Velden, B.H.; Kuijf, H.J.; Gilhuijs, K.G.; Viergever, M.A. Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Med. Image Anal. 2022, 79, 102470. [Google Scholar] [CrossRef] [PubMed]
Cabitza, F.; Campagner, A.; Ronzio, L.; Cameli, M.; Mandoli, G.E.; Pastore, M.C.; Sconfienza, L.M.; Folgado, D.; Barandas, M.; Gamboa, H. Rams, hounds and white boxes: Investigating human–AI collaboration protocols in medical diagnosis. Artif. Intell. Med. 2023, 138, 102506. [Google Scholar] [CrossRef]
Gautam, S.; Höhne, M.M.C.; Hansen, S.; Jenssen, R.; Kampffmeyer, M. This looks More Like that: Enhancing Self-Explaining Models by Prototypical Relevance Propagation. Pattern Recogn. 2023, 136, 109172. [Google Scholar] [CrossRef]
Opłatek, S.; Rymarczyk, D.; Zieliński, B. Revisiting FunnyBirds Evaluation Framework for Prototypical Parts Networks. In Explainable Artificial Intelligence, Proceedings of the Second World Conference, xAI 2024, Valletta, Malta, 17–19 July 2024; Springer: Cham, Switzerland, 2024; pp. 57–68. [Google Scholar] [CrossRef]
Xu-Darme, R.; Quénot, G.; Chihani, Z.; Rousset, M.C. Sanity checks for patch visualisation in prototype-based image classification. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 3691–3696. [Google Scholar] [CrossRef]

Figure 1. Part−prototype network reasoning process during prediction (normal vs. pneumonia classification task from RX images). These models learn prototypes in terms of representative image regions for the predicted class from the training set and perform the classification based on their detection of new images (prototypical regions marked with yellow boxes).

Figure 2. Global and local explanations of part-prototype network (classification of Alzheimer’s disease from MR images). The global explanation shows all the learned prototypes. The local explanation shows the model’s reasoning for a specific instance.

Figure 3. ProtoPNet architecture.

Figure 4. Prototypical part visualization in a normal vs. pneumonia classification task for a normal test image (marked with yellow box). Radiological images were displayed in the standard grayscale (windowing over the entire signal range) while activation maps and heatmaps were visualized using the same color map.

Table 2. Conceptual properties (Co-12 Properties) for quantitatively assessing quality of part-prototype network, extracted from Nauta et al. [15].

Co-12 Property	Description
Content
Correctness:	Since PP models are interpretable by design, the explanations are generated together with the prediction, and the reasoning process is correctly represented by design. Also, the faithfulness of the prototype visualization (from the latent representation to the input image patches), originally performed by carrying out bicubic upsampling, is not guaranteed by design and should be evaluated.
Completeness:	The relation between the prototypes and classes is transparently shown, so the output-completeness is fulfilled by design, but the computation performed by the CNN backbone is not taken into consideration.
Consistency:	PP models should not have random components in their designs, but nondeterminism may occur from the backbones’ initialisation and random seeds. It might be assessed by comparing explanations from models trained with different initializations or with different shuffling of the training data.
Continuity:	It should be evaluated whether slightly perturbed inputs lead to the same explanation, given that the model makes the same classification.
Contrastivity:	The incorporated interpretability of PP models results in a contrastivity incorporated by design; such a different classification corresponds to a different reasoning and, hence, to a different explanation. This evaluation might also include a target sensitivity analysis by inspecting where prototypes are detected in the test image.
Covariate complexity:	The complexity of the features present in the prototypes is assessed with the ground truth, such as predefined concepts provided by human judgements (perceived homogeneity) or with object part annotations.
Presentation
Compactness:	The number of prototypes which constitute the full classification model (global explanation size) in every input image (local explanation sizes) and the redundancy in the information content presented in different prototypes should be evaluated. The size of the explanation should be appropriate to not overwhelm the user.
Composition:	How PP can be best presented to the user, and how these prototypes can be best structured and included in the reasoning process by comparing different explanation formats or by asking users about their preferences regarding the presentation and structure of the explanation should be assessed.
Confidence:	Estimate the confidence of the explanation generation method, including measurements such as the prototype similarity scores.
User
Context:	PP models should be evaluated with application-grounded user studies, similarly to evaluations with heatmaps, to understand their needs.
Coherence:	Prototypes are often evaluated based on anecdotal evidence, with automated evaluation with an annotated dataset, or with manual evaluation. User studies might include the assessment of satisfaction, preference, and trust for part-prototypes.
Controllability:	The ability to directly manipulate the explanation and the model’s reasoning e.g., enable users to suppress or modify learned prototypes, eventually with the aid of a graphical user interface.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

De Santi, L.A.; Piparo, F.I.; Bargagna, F.; Santarelli, M.F.; Celi, S.; Positano, V. Part-Prototype Models in Medical Imaging: Applications and Current Challenges. BioMedInformatics 2024, 4, 2149-2172. https://doi.org/10.3390/biomedinformatics4040115

AMA Style

De Santi LA, Piparo FI, Bargagna F, Santarelli MF, Celi S, Positano V. Part-Prototype Models in Medical Imaging: Applications and Current Challenges. BioMedInformatics. 2024; 4(4):2149-2172. https://doi.org/10.3390/biomedinformatics4040115

Chicago/Turabian Style

De Santi, Lisa Anita, Franco Italo Piparo, Filippo Bargagna, Maria Filomena Santarelli, Simona Celi, and Vincenzo Positano. 2024. "Part-Prototype Models in Medical Imaging: Applications and Current Challenges" BioMedInformatics 4, no. 4: 2149-2172. https://doi.org/10.3390/biomedinformatics4040115

APA Style

De Santi, L. A., Piparo, F. I., Bargagna, F., Santarelli, M. F., Celi, S., & Positano, V. (2024). Part-Prototype Models in Medical Imaging: Applications and Current Challenges. BioMedInformatics, 4(4), 2149-2172. https://doi.org/10.3390/biomedinformatics4040115

Article Menu

Part-Prototype Models in Medical Imaging: Applications and Current Challenges

Abstract

1. Introduction

2. Foundational Part-Prototype Models

2.1. Part-Prototype Network (ProtoPNet)

2.2. XProtoNet

2.3. Neural Prototype Tree (ProtoTree)

2.4. Protypical Part Shared Network (ProtoPShare)

2.5. ProtoPool

2.6. Patch-Based Intuitive Prototype Network (PIPNet)

3. Application and Advances in Medical Imaging

3.1. Two-Dimensional Image Models

3.2. Three-Dimensional and Multimodal Models

4. Evaluation of Prototypes

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI