[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Trajectory Tracking of a 2-Degrees-of-Freedom Serial Flexible Joint Robot Using an Active Disturbance Rejection Controller Approach
Previous Article in Journal
Exact Moments of Residuals of Independence
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Self-Supervised Autoencoders for Visual Anomaly Detection

by
Alexander Bauer
1,2,*,
Shinichi Nakajima
1,2,3 and
Klaus-Robert Müller
1,2,4,5,*
1
Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany
2
Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
3
RIKEN Center for AIP, Tokyo 103-0027, Japan
4
Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Republic of Korea
5
Max-Planck-Institut für Informatik, Saarland Informatics Campus E1 4, 66123 Saarbrücken, Germany
*
Authors to whom correspondence should be addressed.
Mathematics 2024, 12(24), 3988; https://doi.org/10.3390/math12243988
Submission received: 18 November 2024 / Revised: 13 December 2024 / Accepted: 15 December 2024 / Published: 18 December 2024
Figure 1
<p>Anomaly detection results of our approach on a few images from the MVTec AD dataset. The first row shows the input images and the second row an overlay with the predicted anomaly heatmap.</p> ">
Figure 2
<p>Illustration of the reconstruction effect of our model trained either on the wood, carpet, or grid images (without defects) from the MVTec AD dataset.</p> ">
Figure 3
<p>Illustration of our anomaly detection process after the training. Given input <math display="inline"><semantics> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> </semantics></math>, we first, see (<b>1</b>) compute an output <math display="inline"><semantics> <mrow> <msub> <mi>f</mi> <mi mathvariant="bold-italic">θ</mi> </msub> <mrow> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>)</mo> </mrow> </mrow> </semantics></math> by replicating normal regions and replacing irregularities with locally consistent patterns. Then, see (<b>2</b>), we compute a pixel-wise squared difference <math display="inline"><semantics> <msup> <mrow> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>−</mo> <msub> <mi>f</mi> <mi mathvariant="bold-italic">θ</mi> </msub> <mrow> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> </semantics></math>, which is subsequently averaged over the color channels to produce the difference map <math display="inline"><semantics> <mrow> <mi>Diff</mi> <mrow> <mo>[</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>,</mo> <msub> <mi>f</mi> <mi mathvariant="bold-italic">θ</mi> </msub> <mrow> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>)</mo> </mrow> <mo>]</mo> </mrow> <mo>∈</mo> <msup> <mi mathvariant="double-struck">R</mi> <mrow> <mi>h</mi> <mo>×</mo> <mi>w</mi> </mrow> </msup> </mrow> </semantics></math>. In the last step, see (<b>3</b>), we apply a series of averaging convolutions <math display="inline"><semantics> <msub> <mi>G</mi> <mi>k</mi> </msub> </semantics></math> to the difference map to produce our final anomaly heatmap <math display="inline"><semantics> <mrow> <msubsup> <mi>anomap</mi> <mrow> <msub> <mi>f</mi> <mi mathvariant="bold-italic">θ</mi> </msub> </mrow> <mrow> <mi>n</mi> <mo>,</mo> <mi>k</mi> </mrow> </msubsup> <mrow> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p> ">
Figure 4
<p>Illustration of data generation for training. After randomly choosing the number and locations of the patches to be modified, we create new content by gluing the extracted patches with the corresponding replacements. Given a real-valued mask <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">M</mi> <mo>∈</mo> <msup> <mrow> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>]</mo> </mrow> <mrow> <mover accent="true"> <mi>h</mi> <mo>˜</mo> </mover> <mo>×</mo> <mover accent="true"> <mi>w</mi> <mo>˜</mo> </mover> <mo>×</mo> <mn>3</mn> </mrow> </msup> </mrow> </semantics></math> marking corrupted regions within a patch, an original image patch <math display="inline"><semantics> <mi mathvariant="bold-italic">x</mi> </semantics></math>, and a corresponding replacement <math display="inline"><semantics> <mi mathvariant="bold-italic">y</mi> </semantics></math>, we create the next corrupted patch by merging the two patches together according to the formula <math display="inline"><semantics> <mrow> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>:</mo> <mo>=</mo> <mi mathvariant="bold-italic">M</mi> <mo>⊙</mo> <mi mathvariant="bold-italic">y</mi> <mo>+</mo> <mover accent="true"> <mi mathvariant="bold-italic">M</mi> <mo stretchy="false">¯</mo> </mover> <mo>⊙</mo> <mi mathvariant="bold-italic">x</mi> </mrow> </semantics></math>. All mask shapes <math display="inline"><semantics> <mi mathvariant="bold-italic">M</mi> </semantics></math> are created by applying Gaussian distortion to the same (static) mask, representing a filled disk at the center of the patch with a smoothly fading boundary toward the exterior of the disk.</p> ">
Figure 5
<p>Illustration of our network architecture SDC-CCM including the convex combination module (CCM) marked in brown and the skip-connections represented by the horizontal arrows. Without these additional elements, we obtain our baseline architecture SDC-AE.</p> ">
Figure 6
<p>Illustration of the CCM module. The module receives two inputs: <math display="inline"><semantics> <msub> <mi mathvariant="bold-italic">x</mi> <mi>s</mi> </msub> </semantics></math> along the skip connection and <math display="inline"><semantics> <msub> <mi mathvariant="bold-italic">x</mi> <mi>c</mi> </msub> </semantics></math> from the current layer below. In the first step (image on the left), we compute the squared difference of the two and stack it together with the original values <math display="inline"><semantics> <mrow> <mo>[</mo> <msub> <mi mathvariant="bold-italic">x</mi> <mi>s</mi> </msub> <mo>,</mo> <msub> <mi mathvariant="bold-italic">x</mi> <mi>c</mi> </msub> <mo>,</mo> <msup> <mrow> <mo>(</mo> <msub> <mi mathvariant="bold-italic">x</mi> <mi>s</mi> </msub> <mo>−</mo> <msub> <mi mathvariant="bold-italic">x</mi> <mi>c</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> </semantics></math>. This combined feature map is processed by two convolutional layers. The first layer uses batch normalization with ReLU activation. The second layer uses batch normalization and a sigmoid activation function to produce a coefficient matrix <math display="inline"><semantics> <mi mathvariant="bold-italic">β</mi> </semantics></math>. In the second step (image on the right), we compute the output of the module as a (component-wise) convex combination <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="bold-italic">x</mi> <mi>o</mi> </msub> <mo>=</mo> <mi mathvariant="bold-italic">β</mi> <mo>·</mo> <msub> <mi mathvariant="bold-italic">x</mi> <mi>s</mi> </msub> <mo>+</mo> <mrow> <mo>(</mo> <mn mathvariant="bold">1</mn> <mo>−</mo> <mi mathvariant="bold-italic">β</mi> <mo>)</mo> </mrow> <mo>·</mo> <msub> <mi mathvariant="bold-italic">x</mi> <mi mathvariant="bold-italic">c</mi> </msub> </mrow> </semantics></math>, where <math display="inline"><semantics> <mn mathvariant="bold">1</mn> </semantics></math> is a tensor of ones.</p> ">
Figure 7
<p>Illustration of the general concept of the orthogonal projection <span class="html-italic">f</span> onto a data manifold <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math>. Here, anomalous samples <math display="inline"><semantics> <mrow> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>∈</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>n</mi> </msup> </mrow> </semantics></math> (red dots) are projected to points <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">x</mi> <mo>:</mo> <mo>=</mo> <mi>f</mi> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>)</mo> <mo>∈</mo> <mi mathvariant="script">D</mi> </mrow> </semantics></math> (blue dots) in a way that minimizes the distance <math display="inline"><semantics> <mrow> <mi>d</mi> <mrow> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>,</mo> <mi mathvariant="bold-italic">x</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mo movablelimits="true" form="prefix">inf</mo> <mrow> <mi mathvariant="bold-italic">y</mi> <mo>∈</mo> <mi mathvariant="script">D</mi> </mrow> </msub> <mi>d</mi> <mrow> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>,</mo> <mi mathvariant="bold-italic">y</mi> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p> ">
Figure 8
<p>Illustration of the connections between the different types of regularized autoencoders. For a small variance of the corruption noise, the DAE becomes similar to the CAE. This, in turn, gives rise to the RCAE, where the contraction is imposed explicitly on the whole reconstruction mapping. A special instance of PAE given by the orthogonal projection yields an optimal solution for the optimization problem of the RCAE. On the other hand, the training objective for PAE can be seen as an extension of DAE to more complex input modifications beyond additive noise. Finally, a common variant of the sparse autoencoder (SAE) applies an <math display="inline"><semantics> <msup> <mi>l</mi> <mn>1</mn> </msup> </semantics></math> penalty on the hidden units, resulting in saturation toward zero similar to the CAE.</p> ">
Figure 9
<p>Illustration of the conservation effect of the orthogonal projections with respect to different <math display="inline"><semantics> <msup> <mi>l</mi> <mi>p</mi> </msup> </semantics></math>-norms. Here, the anomalous sample <math display="inline"><semantics> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> </semantics></math> is orthogonally projected onto the manifold <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math> (depicted by a red ellipsoid) according to <math display="inline"><semantics> <mrow> <mrow> <mo stretchy="false">∥</mo> </mrow> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>−</mo> <msubsup> <mi mathvariant="bold-italic">y</mi> <mi>p</mi> <mo>*</mo> </msubsup> <msub> <mrow> <mo stretchy="false">∥</mo> </mrow> <mi>p</mi> </msub> <mo>=</mo> <msub> <mo movablelimits="true" form="prefix">inf</mo> <mrow> <mi mathvariant="bold-italic">y</mi> <mo>∈</mo> <mi mathvariant="script">D</mi> </mrow> </msub> <msub> <mrow> <mo stretchy="false">∥</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>−</mo> <mi mathvariant="bold-italic">y</mi> <mo stretchy="false">∥</mo> </mrow> <mi>p</mi> </msub> </mrow> </semantics></math> for <math display="inline"><semantics> <mrow> <mi>p</mi> <mo>∈</mo> <mo>{</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>∞</mo> <mo>}</mo> </mrow> </semantics></math>. The remaining three colors (green, blue, and yellow) represent rescaled unit circles around <math display="inline"><semantics> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> </semantics></math> with respect to <math display="inline"><semantics> <msup> <mi>l</mi> <mn>1</mn> </msup> </semantics></math>, <math display="inline"><semantics> <msup> <mi>l</mi> <mn>2</mn> </msup> </semantics></math> and <math display="inline"><semantics> <msup> <mi>l</mi> <mo>∞</mo> </msup> </semantics></math>-norms. The intersection points of each circle with <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math> mark the orthogonal projection of <math display="inline"><semantics> <mi mathvariant="bold-italic">x</mi> </semantics></math> onto <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math> for the corresponding norm. We can see that projections <math display="inline"><semantics> <msubsup> <mi mathvariant="bold-italic">y</mi> <mi>p</mi> <mo>*</mo> </msubsup> </semantics></math> for lower <span class="html-italic">p</span>-values better preserve the content in <math display="inline"><semantics> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> </semantics></math> according to the higher sparsity of the difference <math display="inline"><semantics> <mrow> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>−</mo> <msubsup> <mi mathvariant="bold-italic">y</mi> <mi>p</mi> <mo>*</mo> </msubsup> </mrow> </semantics></math>, which results in smaller modified regions <math display="inline"><semantics> <mrow> <mi>S</mi> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>,</mo> <msubsup> <mi mathvariant="bold-italic">y</mi> <mi>p</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> </semantics></math>.</p> ">
Figure 10
<p>Illustration of the concept of a transition set. Consider a 2D image tensor identified with a column vector <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">x</mi> <mo>∈</mo> <msup> <mi mathvariant="double-struck">R</mi> <mi>n</mi> </msup> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <msup> <mn>20</mn> <mn>2</mn> </msup> </mrow> </semantics></math>, which is partitioned according to <math display="inline"><semantics> <mrow> <mi>S</mi> <mo>⊆</mo> <mo>{</mo> <mn>1</mn> <mo>,</mo> <mo>…</mo> <mo>,</mo> <mi>n</mi> <mo>}</mo> </mrow> </semantics></math> (gray area) and <math display="inline"><semantics> <mrow> <mover accent="true"> <mi>S</mi> <mo stretchy="false">¯</mo> </mover> <mo>:</mo> <mo>=</mo> <mrow> <mo>{</mo> <mn>1</mn> <mo>,</mo> <mo>…</mo> <mo>,</mo> <mi>n</mi> <mo>}</mo> </mrow> <mo>∖</mo> <mi>S</mi> </mrow> </semantics></math> (union of light blue and dark blue areas). The transition set <span class="html-italic">B</span> (dark blue area) glues the two disconnected sets <span class="html-italic">S</span> and <math display="inline"><semantics> <mrow> <mover accent="true"> <mi>S</mi> <mo stretchy="false">¯</mo> </mover> <mo>∖</mo> <mi>B</mi> </mrow> </semantics></math> together such that <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">x</mi> <mo>∈</mo> <mi mathvariant="script">D</mi> </mrow> </semantics></math> is feasible.</p> ">
Figure 11
<p>Illustration of our anomaly segmentation results (with SDC-CCM) as an overlay of the original image and the anomaly heatmap. Each row shows three random examples from a category (carpet, grid, leather, transistor, and cable) in the MVTec AD dataset. In each pair, the first image represents the input to the model and the second image a corresponding anomaly heatmap.</p> ">
Figure A1
<p>Illustration of a counterexample for the claim that orthogonal projections maximally preserve normal regions in the inputs. Here, <math display="inline"><semantics> <mrow> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>∈</mo> <msup> <mi mathvariant="double-struck">Z</mi> <mn>5</mn> </msup> </mrow> </semantics></math> is the modified version of the original input <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">x</mi> <mo>∈</mo> <mi mathvariant="script">D</mi> </mrow> </semantics></math> according to the partition <math display="inline"><semantics> <mrow> <mi>S</mi> <mo>,</mo> <mover accent="true"> <mi>S</mi> <mo stretchy="false">¯</mo> </mover> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>f</mi> <mo>(</mo> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> <mo>)</mo> </mrow> </semantics></math> denotes the orthogonal projection of <math display="inline"><semantics> <mover accent="true"> <mi mathvariant="bold-italic">x</mi> <mo stretchy="false">^</mo> </mover> </semantics></math> onto <math display="inline"><semantics> <mi mathvariant="script">D</mi> </semantics></math> with respect to the <math display="inline"><semantics> <msub> <mi>l</mi> <mn>2</mn> </msub> </semantics></math>-norm. This example also shows that orthogonality property is dependent on our choice of the distance metric.</p> ">
Figure A2
<p>Illustration of the concept of a transition set on two examples with different shapes. Each of the two images represents an MRF <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">x</mi> <mo>=</mo> <mo>(</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>,</mo> <mo>…</mo> <mo>,</mo> <msub> <mi>x</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>∈</mo> <mi mathvariant="double-struck">N</mi> </mrow> </semantics></math> of the Markov order <math display="inline"><semantics> <mrow> <mi>K</mi> <mo>∈</mo> <mi mathvariant="double-struck">N</mi> </mrow> </semantics></math> with nodes corresponding to the individual pixels with values from a finite set of states <math display="inline"><semantics> <mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>∈</mo> <mi>I</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mo>|</mo> <mi>I</mi> <mo>|</mo> <mo>&lt;</mo> <mo>∞</mo> </mrow> </semantics></math>. The grey area marks the corrupted region <math display="inline"><semantics> <mrow> <mi>S</mi> <mo>⊆</mo> <mo>{</mo> <mn>1</mn> <mo>,</mo> <mo>…</mo> <mo>,</mo> <mi>n</mi> <mo>}</mo> </mrow> </semantics></math>, where the union of the dark blue and light blue areas is the complement <math display="inline"><semantics> <mrow> <mover accent="true"> <mi>S</mi> <mo stretchy="false">¯</mo> </mover> <mo>:</mo> <mo>=</mo> <mrow> <mo>{</mo> <mn>1</mn> <mo>,</mo> <mo>…</mo> <mo>,</mo> <mi>n</mi> <mo>}</mo> </mrow> <mo>∖</mo> <mi>S</mi> </mrow> </semantics></math> marking the normal region. The dark blue part of <math display="inline"><semantics> <mover accent="true"> <mi>S</mi> <mo stretchy="false">¯</mo> </mover> </semantics></math> corresponds to the transition set <math display="inline"><semantics> <mrow> <mi>B</mi> <mo>⊆</mo> <mover accent="true"> <mi>S</mi> <mo stretchy="false">¯</mo> </mover> </mrow> </semantics></math>. <math display="inline"><semantics> <mrow> <mi>W</mi> <mo>⩽</mo> <msup> <mrow> <mo>|</mo> <mi>I</mi> <mo>|</mo> </mrow> <mi>K</mi> </msup> </mrow> </semantics></math> denotes (loosely) the width at the thickest part of the tube <span class="html-italic">B</span> around <span class="html-italic">S</span>.</p> ">
Figure A3
<p>Illustration of the importance of modeling long-range dependencies facilitated by dilated convolutions for achieving accurate reconstruction. We can observe how the reconstruction of the model without the SDC modules (middle image) suffers from a blind spot effect toward the center of the corrupted region. This happens due to the insufficient context provided by the normal areas, forcing the model to predict an average of all possibilities.</p> ">
Figure A4
<p>Illustration of the qualitative improvement when using SDC-CCM over SDC-AE. We show six examples: three from the "cable" category and three from the "transistor" category of the MVTec AD dataset. Each row displays the original image, the reconstruction produced by SDC-CCM (reconstruction II), the reconstruction produced by SDC-AE (reconstruction I), the anomaly heatmap from SDC-CCM (anomaly heatmap II), and the anomaly heatmap from SDC-AE (anomaly heatmap I). Note the significant improvement in the quality of the heatmaps.</p> ">
Figure A5
<p>Illustration of the qualitative improvement when using SDC-CCM over SDC-AE on texture categories from the MVTec AD dataset. We show five examples, one from each of the following categories: “carpet”, “grid”, “leather”, “tile”, and “wood”. Each row displays the original image, the reconstruction produced by SDC-CCM (reconstruction II), the reconstruction produced by SDC-AE (reconstruction I), the anomaly heatmap from SDC-CCM (anomaly heatmap II), and the anomaly heatmap from SDC-AE (anomaly heatmap I).</p> ">
Versions Notes

Abstract

:
We focus on detecting anomalies in images where the data distribution is supported by a lower-dimensional embedded manifold. Approaches based on autoencoders have aimed to control their capacity either by reducing the size of the bottleneck layer or by imposing sparsity constraints on their activations. However, none of these techniques explicitly penalize the reconstruction of anomalous regions, often resulting in poor detection. We tackle this problem by adapting a self-supervised learning regime that essentially implements a denoising autoencoder with structured non-i.i.d. noise. Informally, our objective is to regularize the model to produce locally consistent reconstructions while replacing irregularities by acting as a filter that removes anomalous patterns. Formally, we show that the resulting model resembles a nonlinear orthogonal projection of partially corrupted images onto the submanifold of uncorrupted examples. Furthermore, we identify the orthogonal projection as an optimal solution for a specific regularized autoencoder related to contractive and denoising variants. In addition, orthogonal projection provides a conservation effect by largely preserving the original content of its arguments. Together, these properties facilitate an accurate detection and localization of anomalous regions by means of the reconstruction error. We support our theoretical analysis by achieving state-of-the-art results (image/pixel-level AUROC of 99.8/99.2%) on the MVTec AD dataset—a challenging benchmark for anomaly detection in the manufacturing domain.

1. Introduction

The task of anomaly detection (AD) in a broad sense corresponds to searching for patterns that considerably deviate from some concept of normality. The criteria for what is normal and what is an anomaly can be very subtle and depend heavily on the application. Visual AD specifically aims to detect and locate anomalous regions in imagery data, with practical applications in the industrial, medical, and other domains [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]. The continuous research in this area has produced a variety of methods ranging from classical unsupervised approaches like PCA [18,19,20], one-class SVM [21], SVDD [22], nearest neighbor algorithms [23,24], and KDE [25] to more recent methods including various types of autoencoders [5,12,26,27,28,29,30,31,32], deep one-class classification [33,34,35], generative models [6,13], self-supervised approaches [1,3,36,37,38,39,40,41], and others [3,7,8,9,10,11,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79].
The one-class classifiers [21,22], for example, aim at learning a tight decision boundary around the normal examples in the feature space and define a distance-based anomaly score relative to the center of the training data. The success of this approach strongly depends on the availability of suitable features. Therefore, it usually allows only for the detection of outliers that greatly deviate from the normal structure. In practice, however, we are often interested in more subtle deviations, which require a good representation of the data manifold. The same applies to combined approaches based on training deep neural networks (DNNs) with a one-class objective [35,80,81]. Although this objective encourages the network to concentrate the training data in a small region in the feature space, there is no explicit motivation for anomalous examples to be mapped outside of the decision boundary. In fact, the one-class objective gives preference to models that map its input domain to a narrower region and does not explicitly focus on separating anomalous examples from the normal data.
Recently, deep autoencoders (AEs) have been used for the task of anomaly detection in the visual domain [1,40,41]. Unlike the one-class approaches, they additionally enable the localization of anomalous regions by exploiting the pixel-wise nature of a corresponding training objective. By optimizing for the reconstruction error using anomaly-free examples, a common belief is that the corresponding network should fail to reconstruct anomalous regions in the application phase. Typically, this goal is achieved by controlling the capacity of the model either directly by reducing the size of the bottleneck layer or implicitly by imposing sparsity constraints on parts of the corresponding network [32,82,83,84,85]. However, neither of these techniques explicitly penalizes the reconstruction of anomalous regions, often resulting in poor detection. This is similar to the problem of training with the one-class objective, where no explicit mechanism exists for preventing anomalous examples from being mapped to the normal region. In fact, unsupervised trained AEs aim to compress and accurately reconstruct the input images and do not care much about the actual distinction between normal and anomalous samples. As a result, the reconstruction errors for the anomalous and normal regions can be very similar, preventing reliable detection and localization.
In this paper, we propose a self-supervised learning framework, which introduces discriminative information during training to prevent the good reconstruction of anomalous patterns in the testing phase. We begin with an observation that the abnormality of a region in the image is partially characterized by how well it can be reconstructed from the context of the surrounding pixels. Imagine an image where a small patch has been cut out by setting the corresponding pixel values to zeros. We can try to reconstruct this patch by interpolating the surrounding pixel values according to our knowledge about the distribution of the training data. If the reconstruction significantly deviates from the original content, we consider a corresponding patch to be anomalous and normal otherwise. Following this idea, we feed partially distorted images to the network during training while forcing it to recreate the original content—similar to the task of neural image completion [86,87,88,89,90,91]. However, instead of setting the individual pixel values to zeros—as for the completion task—we apply a patch transformation, which avoids introducing easily detectable artifacts. To succeed, our model must accomplish two different tasks: (a) detection of regions deviating from the expected pattern and (b) recreation of the original content from the uncorrupted areas. The second task, in particular, imposes an important regularization effect on the model. Together, they provide a powerful training signal for an accurate AD.
Technically, our approach can be seen as training a denoising autoencoder (DAE) on artificially corrupted inputs following a specific noise model. Although the general objective of DAEs allows for arbitrary stochastic corruptions, previous works [85,92,93] only considered simple unstructured noise, which can be characterized by the i.i.d. assumption on the distribution of the individual pixels. In contrast, we consider structured noise with spatial dependencies between the corruption variables implemented by a specific form of partial occlusions. Altogether, the resulting training effect regularizes the model to produce locally consistent reconstructions while replacing irregularities, therefore acting as a filter that removes anomalous patterns. Figure 1 illustrates a few examples to give a sense of visual quality of the resulting heatmaps when training according to our approach.
To support our idea, we performed a theoretical analysis of the proposed method, which provided a number of interesting insights. Specifically, we show that the resulting model approximates a mapping, which, with an increasing number of image pixels, converges (stochastically) to the orthogonal projection of partially corrupted images onto the submanifold of normal examples. This is consistent with the findings in [85] showing that, in a non-degenerate case (i.e., for distributions p ( x ) with support of nonzero volume) with i.i.d. pixel noise, the optimal reconstruction r * is given by a mapping that corrects the noisy inputs by pushing them toward the region of high probability density according to r * ( x ) = x + σ 2 log p ( x ) + o ( σ 2 ) as σ 2 0 . These results approach a point mass distribution collapsing to a lower-dimensional manifold, where we talk about projections and establish here the following complementary connection. While the gradient log p ( x ) points in the direction of the maximal increase in probability at point x , the orthogonal projection onto the data manifold is characterized by the shortest distance between x and its projected image. Therefore, we can naturally measure the abnormality of anomalous samples according to their distance from the manifold of normal examples.
Additionally, we investigate the effect of projection mappings on the segmentation accuracy of anomalous regions. Specifically, we analyze the conservation property of the orthogonal projection, which removes anomalous patterns in a way that largely preserves the original content. Furthermore, we establish a close connection of our approach to the previous autoencoding models including the contractive and denoising variants. In particular, we show that the orthogonal projection provides an optimal solution for the reconstruction contractive autoencoder.
The rest of this paper is organized as follows: In Section 2, we outline the main differences between our approach and related methods, concluding with a concise summary of our contributions. In Section 3, we formally introduce the proposed framework including training objective, data generation, and model architecture, followed by a theoretical analysis in Section 4. In Section 5, we evaluate the performance of our approach and provide a conclusion in Section 6.

2. Related Works

The plethora of existing AD methods (see [78] for an overview) can be roughly divided in three groups: probabilistic models, one-class classification methods, and reconstruction models; our approach is a member of the latter. Without going deeper into the details, we note that the first two groups appear less suitable in our case, where the data distribution is supported by a lower-dimensional embedded manifold. Due to this fact, the Lebesgue measure of the data manifold is zero excluding the existence of a density function over the input space, which takes positive values only on the manifold. On the other hand, one-class approaches implicitly assume a nonzero volume of the corresponding data in order to provide accurate detection results, which again violates our assumption about the data distribution.
Here, we focus on the reconstruction methods represented by the regularized AEs including the contractive [84] and denoising [92] variants and a number of self-supervised methods inspired by the task of neural image completion [1,40,41]. The central idea behind the contractive autoencoder (CAE), for example, is to learn a reconstruction mapping r ( x ) = d ( e ( x ) ) composed of encoder and decoder parts by adding a regularization term D x e F 2 to the objective, which implicitly promotes a reduction in the magnitude of gradients v e ( x ) in direction v pointing outside the tangent plane on the data manifold. In contrast, the denoising autoencoder (DAE) aims to minimize the reconstruction error without regularization terms but augments the training procedure by adding stochastic noise to the inputs.
Previous work [85] demonstrated that a DAE with small noise corruption of variance σ 2 is similar to a CAE with penalty coefficient λ = σ 2 but where the contraction is imposed explicitly on the whole reconstruction function rather than on the encoder part alone. Our approach is also based on training a model to reconstruct original content from modified inputs but differs from the previous works on DAE by using more involved input modifications beyond the i.i.d. corruption noise. Specifically, the corruptions introduced during training may represent additive noise, stochastic occlusions, and even deterministic modifications (e.g., geometric transformations), allowing for a wider range of anomalous patterns to be detectable during the application phase.
Another group of related methods has been inspired by the task of neural image completion [86,87,88,89,90,91]. This group includes a number of self-supervised methods like LSR [3], RIAD [39], CutPaste [40], InTra [41], DRAEM [38], and SimpleNet [94], which (similar to our approach) aim at training a model to reconstruct the original content from corrupted inputs in either the latent or the input space. The main differences of these methods from our approach are in the choice of data augmentation during training and the specific network architecture of our model, which in summary ked to higher performance in our experiments on the MVTec AD dataset [95]. Another two recent studies (PatchCore) [10] and (PNI) [68] reported impressive results on this benchmark. Both are based on the usage of memory banks comprising locally aware nominal patch-level feature representations extracted from pretrained networks. In contrast, our approach neither requires pretraining nor additional storage room for the memory bank and performs on par with the two methods. In the following, we briefly summarize our contributions.
As our first contribution, we formulated an effective AD framework for a special use case (e.g., natural images), where the normal examples exist on a lower-dimensional submanifold embedded in the input space and are restricted by an additional assumption on the covariance between the spatially close components to vanish with increasing distance. While similar ideas have been used in different domains, our approach distinguishes itself by the specific form of the noise model (implemented by partial occlusions with complex shapes) and the network architecture (SDC-AE and SDC-CCM) that together produce a solution which consistently outperforms previous methods. In particular, we achieved state-of-the-art results for both detection (AUROC of 99.8%) and localization (AUROC of 99.2%) tasks on the MVTec AD dataset—a challenging benchmark for visual anomaly detection in the manufacturing domain.
As our second contribution, we performed a rigorous theoretical analysis of our method, which provided a number of interesting insights. Specifically, we show that with an increasing number of input pixels, the corresponding model resembles the orthogonal projection of partially corrupted images onto the submanifold of uncorrupted examples. This covers various forms of input modifications including partial occlusions and additive noise (see Theorem 1 and Theorem 2). At the same time, orthogonal projection acts as a filter for irregularities by removing anomalous patterns from the inputs in a way that largely preserves the original content (see Proposition 2 and Proposition 3). Together, these properties facilitate the accurate detection and localization of the anomalous regions by means of the reconstruction error, supporting our hypothesis around the training procedure.
As our third contribution, we improved upon the previous understanding of the connections between contractive and denoising autoencoders. Specifically, we identified the nonlinear orthogonal projection as an optimal solution (see Proposition 1), minimizing the training objective of the reconstruction contractive autoencoder (RCAE). On the other hand, our training objective essentially corresponds to training a DAE with a specific noise model, imposing strong spatial correlation on the corruption variables. A corresponding model, in turn, approximates a mapping that (with the increasing number of input pixels) converges stochastically to the orthogonal projection—the optimal solution of the RCAE. In contrast to the previous results [85], our findings extend beyond the i.i.d. variable noise and are not limited by the assumption of small variance.

3. Methodology

In the following, we formally introduce our self-supervised framework for training a model to detect and localize anomalous regions in images. We provide a detailed description of the objective function, the structure of artificially generated anomalies used during training, and the rationale behind our choice of model architecture.

3.1. Training Objective

We identify an autoencoder with input and output tensors corresponding to color images with a parameterized map f θ : [ 0 , 1 ] h × w × 3 [ 0 , 1 ] h × w × 3 , h , w N . Furthermore, x [ 0 , 1 ] h × w × 3 denotes an original (anomaly-free) image, and x ^ denotes a copy of x that has been partially modified. The modified regions within x ^ are encoded through a real-valued mask M [ 0 , 1 ] h × w × 3 , while M ¯ : = 1 M denotes a corresponding complement, with 1 { 1 } h × w × 3 being a tensor of ones. Our goal is to train a model f θ that projects arbitrary inputs from the data space to the submanifold of normal images by minimizing for each triple ( x , x ^ , M ) , with the following objective:
L ( x ^ , x , M ) = ( 1 λ ) M ¯ 1 · M ¯ f θ ( x ^ ) x 2 2 + λ M 1 · M f θ ( x ^ ) x 2 2 ,
where ⊙ denotes an element-wise tensor multiplication, and λ [ 0 , 1 ] is a hyperparameter controlling the importance of the two terms during training. Here, · p denotes the l p -norm on a corresponding tensor space. In terms of supervised learning, we feed a partially corrupted image x ^ as input to the network, which is trained to recreate the original image x representing the ground truth label.
By minimizing the above objective, the corresponding autoencoder aims to interpolate between two different tasks. The first term steers the model to reproduce uncorrupted image regions M ¯ x , while the second term requires the network to correct the corrupted regions M x ^ by recreating the original content. Altogether, the objective in (1) encourages the model to produce a locally consistent reconstruction of the input while replacing irregularities, acting as a filter for anomalous patterns. Figure 2 shows a few reconstruction examples produced by our model f θ . We can see how normal regions are accurately replicated, while irregularities (e.g., scratches or threads) are replaced by locally consistent patterns. During training, we use a specific procedure to generate corrupted images x ^ from normal examples x based on randomly generated masks M marking the corrupted regions. We provide a detailed description of this process in the next subsection.
Given a trained model f θ , we can perform AD on an input image x ^ as follows: First, we compute the difference map Diff [ x ^ , f θ ( x ^ ) ] R h × w between the input x ^ and its reconstruction f θ ( x ^ ) by averaging the pixel-wise squared difference ( x ^ f θ ( x ^ ) ) 2 over the color channels according to
Diff [ x ^ , f θ ( x ^ ) ] : = 1 3 c = 1 3 ( x ^ , , c f ( x ^ ) , , c ) 2 .
The binary segmentation mask can be computed by thresholding the difference map. To obtain a more robust result, we smooth the difference map before thresholding according to the following formula:
anomap f θ n , k ( x ^ ) : = G k n ( Diff [ x ^ , f θ ( x ^ ) ] ) ,
where G k n = G k G k n - times denotes a repeated application of a convolution mapping G k defined by an averaging filter of size k × k with all entries set to 1 / k 2 . We treat the numbers n , k N , k 1 as hyperparameters, where G k 0 is the identity mapping. By thresholding anomap f θ n , k ( x ^ ) , we obtain a binary segmentation mask for the anomalous regions. We compute the anomaly score for the entire image x ^ from anomap f θ n , k ( x ^ ) by summing the scores for the individual pixels. Note that for n = 0 , if we skip the averaging step over the color-channels in (2), this reduces to x ^ f θ ( x ^ ) 2 2 . Alternatively to the summation, we could take the maximum over the pixel scores, making the anomaly score potentially less sensitive to size variations in the anomalous regions. The complete detection procedure is summarized in Figure 3. In Section 4, we investigate which image corruptions approximately preserve the orthogonality of the corresponding transformations.

3.2. Generating Artificial Anomalies for Training

We use a self-supervised approach representing the training data as input–output ( x ^ , x ) pairs. Here, the ground truth outputs are given by the original images x corresponding to the normal examples. The inputs x ^ are generated from these images by partially modifying some regions according to the procedure illustrated in Figure 4.
For each normal example x , we first randomly sample the number, the size, and the location of the patches to be modified. In the next step, each randomly selected patch is modified according to the following procedure: First, we create a real-valued mask M [ 0 , 1 ] h ˜ × w ˜ × 3 based on the elastic deformation technique to mark the corrupted regions within the selected patch. Precisely, we start with a static mask shaped as a disk at the center of the patch with a smoothly fading boundary toward the exterior of the disk. We then apply Gaussian distortion (with random parameters) to this mask, resulting in the varying shapes illustrated on the right side in Figure 4. These masks are used to smoothly merge the original patches with the new content given by the replacement patches. Here, we consider two types of replacements. On the one hand, we can use any natural image (sufficiently different from the original patch) as a potential replacement. In our experiments, we used the publicly available Describable Textures Dataset (DTD) [96], consisting of images of varying backgrounds. On the other hand, we can use the extracted patches as replacements after passing them through a Gaussian distortion process, similar to how we create the masking shapes. An important aspect here is that the corresponding corruptions approximately preserve the original color distribution.
Given an input x and a replacement y , we create a corrupted image x ^ by smoothly gluing the two images together according to the formula x ^ : = M y + M ¯ x . In the last step, the patches extracted at the beginning are replaced with their modified versions in the original image. The individual shape masks are embedded in a two-dimensional zero-array at the corresponding locations to create a global mask M [ 0 , 1 ] h × w × 3 . During testing phase, there are no input modifications, and images are passed unchanged to the network. Anomalous regions are detected by thresholding the anomaly heatmap anomap f θ n , k ( x ^ ) defined in Equation (3).

3.3. Model Architecture

We adopt a well-established encoder–decoder architecture commonly used in computer vision tasks such as semantic segmentation [97,98], image inpainting [87,88,89], representation learning [86], and anomaly detection [1,2,91]. Specifically, the encoder is composed of convolutional layers that progressively reduce spatial dimensions while increasing the number of feature maps. Conversely, the decoder gradually restores spatial dimensions while reducing the number of output channels. In contrast to common practices, we also incorporate dilated convolutions [99], which enable the model to efficiently capture long-range dependencies between individual pixels and their local neighborhoods. This modification has two key effects: it allows access to richer contextual information without increasing the network depth and helps address larger anomalous regions that would otherwise be affected by a blind spot phenomenon, as shown in Figure A3 in Appendix I. Due to the interplay between the training objective and the network architecture, the model naturally adopts a filtering behavior by replacing anomalous patterns in the input images with locally consistent reconstructions. We demonstrate the reconstruction effect with several examples in Figure 2.
The overall architecture is summarized in Table 1.
Here, after each convolution Conv (except in the last layer), we use batch normalization and rectified linear units (ReLUs) as the activation function. Max-pooling MaxPool is applied to reduce the spatial dimension of the intermediate feature maps. TranConv denotes the convolution transpose operation and also uses batch normalization with ReLUs. In the last layer, we use a sigmoid activation function without batch normalization. SDC 1 , 2 , 4 , 8 , 16 , 32 refers to a stacked dilated convolution block in which multiple dilated convolutions are stacked together. The corresponding subscript { 1 , 2 , 4 , 8 , 16 , 32 } denotes the dilation rates of the six individual convolutions in each stack. After each stack we add an additional convolutional layer with the kernel size 3 × 3 and the same number of feature maps as in the stack followed by a batch normalization and ReLU activation. We refer to this baseline architecture as SDC-AE, indicating an autoencoder (AE) that relies on stacked dilated convolutions (SDCs).
In our experiments, we observed that SDC-AE struggled to reproduce finer visual patterns, sometimes resulting in false positive detections. In order to improve the reconstruction ability, we proposed the following adjustment illustrated in Figure 5. Similar to the approach commonly used in image segmentation, we introduce skip connections into the network. However, direct access to the feature maps from earlier stages appears counterproductive, as it may interfere with other parts of the computational flow that are essential for suppressing corrupted regions in intermediate representations. Instead, we combine information from different layers of the network using a specific type of attention mechanism. Specifically, we first compute a tensor of coefficients ranging from zero to one, which we then use to calculate a component-wise convex combination of the individual feature maps. The corresponding procedure is illustrated in Figure 6. The idea behind this approach is that it provides the model with a more explicit mechanism for reusing, refining, or replacing regions of the input based on its assessment of the abnormality of a given region. We consolidate the corresponding computations into a single convex combination module (CCM), highlighted in brown in Figure 5. We refer to this architecture as SDC-CCM.
In our experiments on the MVTec AD dataset, we observed a significant performance boost in some object categories when using SDC-CCM over SDC-AE. See Figure A4 in Appendix J for a comparison of the reconstruction quality between the models.

4. Theoretical Analysis

In this section, we provide a number of theoretical insights supporting our idea behind the proposed AD method. We begin by establishing a close connection to the related regularized AEs by identifying the orthogonal projection as an optimal solution for the optimization problem of the RCAE. We then investigate the conservation properties of the orthogonal projection with respect to the segmentation accuracy in the context of AD. As our main result, we show that under certain conditions the resulting model approximates a mapping that stochastically converges (with an increasing number of input pixels) to the orthogonal projection of partially corrupted images onto the submanifold of uncorrupted examples.
For the purpose of the following analysis, we identify each projecting autoencoder (PAE) with an idempotent mapping f : U R n from an input space U R n , n N to a differentiable manifold D : = f ( U ) U , dim ( D ) < n . In the context of AD, we refer to the manifold D as the (nonlinear) subspace of normal examples, while each x ^ R n D is considered anomalous. In particular, we exploit a generalization of orthogonal projection to nonlinear mappings defined below and illustrated in Figure 7.
Definition 1.
For n N in a metric space ( R n , d ( · , · ) ) , we call a (nonlinear) mapping f : U R n , U R n the  orthogonal projection   onto D R n if it satisfies the equality
d ( x ^ , f ( x ^ ) ) = inf y D d ( x ^ , y )
for all x ^ U .
Note that the expression in (4) can be written as a set of inequalities. Namely, for d ( x ^ , f ( x ^ ) ) : = x ^ f ( x ^ ) 2 , we obtain the following alternative description:
y D : x ^ f ( x ^ ) 2 x ^ y 2 .
We use form (5) in Section 4.3.
In order to train a PAE, we minimize the following objective
L PAE [ f ] = E x , T f ( T ( x ) ) x p p ,
where p N + , and T : D R n denotes some random data transformation. That is, the expectation in (6) is taken with respect to x D and T T . Note how the above objective is related to our training objective in (1). There, we explicitly use a mask M only to balance the training with respect to the reconstruction accuracy of normal and anomalous regions. If we fix the value of M 1 and set λ = M 1 / ( M 1 + M ¯ 1 ) , the objective (1) reduces to minimizing the loss L ( x ^ , x ) = f θ ( x ^ ) x 2 2 . Here, the shape of T mainly determines the properties of the projection map to be learned, which in turn simulates the inverse mapping of the corresponding input corruptions T. Depending on our goal, T can be very specific ranging from additive noise (e.g., Gaussian noise) to partial occlusions and elastic deformations. In Section 4.3, we identify a number of input transformations that (asymptotically) preserve the orthogonality of the corresponding projections.

4.1. Connections to Regularized Autoencoders

Typically, an autoencoder is composed from two building blocks: the encoder e ( · ) , which maps the input x to some internal representation, and the decoder d ( · ) , which maps e ( x ) back to the input space. The composition f ( x ) = d ( e ( x ) ) is often referred to as the reconstruction mapping. Most of the regularized autoencoders aim to capture the structure of the training distribution based on an interplay between the reconstruction error and the regularization term. They are trained by minimizing the reconstruction loss on the training data either by directly adding a regularization term to the objective or by introducing the regularization through some kind of data augmentation.
Specifically, the contractive autoencoder (CAE) [84] is trained to minimize the following regularized reconstruction loss
L CAE [ f ] = E x f ( x ) x 2 2 + λ D x e F 2 ,
where λ R + is a weighting hyperparameter, and D x e F is the Frobenius norm of the Jacobian of the encoder e ( · ) .
The denoising autoencoder (DAE) [92], on the other hand, is trained to minimize the following denoising criterion
L DAE [ f ] = E x , ε f ( x + ε ) x 2 2 ,
where ε represents some additive noise, and the expectation is over the training distribution and the corruption noise. Specifically, the term noise includes additive (e.g., isotropic Gaussian noise) and non-additive transformation (e.g., masking pepper-noise). However, the unifying feature of the considered corruptions is that the transformations of the individual entries in the input are statistically independent. In contrast, in (6), we consider more complex modifications, which result in the strong spatial correlation of the corruption variables). Note that the general objective of the DAE allows for arbitrary stochastic corruptions. However, previous theoretical investigations including [85] focused on (mostly additive) unstructured noise, which can be characterized by the i.i.d. assumption on the distribution of the individual pixels. The results in [85] demonstrate that there is close connection between the CAE and DAE when the standard deviation of the corruption noise approaches zero σ 0 . More precisely, under some technical assumptions, the objective of the DAE can be written as
L DAE [ f ] = E x f ( x ) x 2 2 + σ 2 D x f F 2
as σ 0 . That is, for a small variance σ 2 of the corruption noise, the DAE becomes similar to a CAE with penalty coefficient λ = σ 2 but where the contraction is imposed explicitly on the whole reconstruction mapping f. This connection motivated the authors to define the RCAE, a variation of the CAE by minimizing the following objective:
L RCAE [ f ] = E x f ( x ) x 2 2 + λ D x f F 2 ,
where the regularization term affects the whole reconstruction mapping. Finally, a common variant of the SAE applies an l 1 penalty on the hidden unit activations, effectively making them saturate toward zero—similar to the effect in the CAE.
We now show that a PAE, realized by the orthogonal projection onto the submanifold of normal examples, is an optimal solution that minimizes the objective of the RCAE in (10). We provide a formal proof in Appendix F.
Proposition 1.
Let f * : U R n R n , n N be an orthogonal projection onto D : = f * ( U ) with respect to an l p -norm, p N + . Then, f * is an optimal solution to the following optimization problem:
minimize f C 1 E x D f ( x ) x 2 2 + λ D x f q 2 ,
where λ R + , and q { F , 2 } is a placeholder denoting either the Frobenius or the spectral norm.
Note that the objective in the above proposition is slightly more general than that in (10) allowing the use of the spectral norm. Although being closely related to each other, the Frobenius and the spectral norm might have different effects on the optimization. Namely, the spectral norm measures the maximal scale by which a unit vector is stretched by a corresponding linear transformation and is determined by the maximal singular value, while the Frobenius norm measures the overall distortion of the unit circle taking into account all the singular values. In a more general sense, Proposition 1 implies a close connection between the PAE and other types of regularized autoencoders (see Figure 8 for an overview).
Orthogonal projection, in particular, appears to be an appropriate choice with respect to the shared goals of the autoencoding models discussed in this section. In Section 4.3.2, we show that in the limit when the dimensionality of the inputs goes to infinity, the DAE converges stochastically to the orthogonal projection when restricted to the specific noise corruptions.

4.2. Conservation Effect of Orthogonal Projections

Here, we focus on the subtask of AD regarding the segmentation of anomalous regions in the inputs. In the following, we identify each image tensor with a column vector x R n , where S { 1 , , n } and S ¯ : = { 1 , , n } S denote a set of pixel indices and its complement, respectively. We write x S to denote a restriction of a vector x to the indices in S.
Consider an x ^ R n that has been generated by a (partial) modification x ^ : = T ( x ) , T T from an x D . We define the modified region as the set of disagreeing indices according to
S ( x ^ , x ) : = i { 1 , , n } : x ^ i x i .
There is some ambiguity about what might be considered an anomalous region, which motivates the following definition:
Definition 2.
Let D R n , n N be a data manifold. Given x ^ R n D , we define the set of anomalous regions in x ^ as the areas of smallest disagreement according to
A ( x ^ ) : = S ( x ^ , x ) : x argmin y D | S ( x ^ , y ) | .
We refer to each S A ( x ^ ) as an anomalous region to the pair ( S , x ^ S ) as an anomalous patternand to x ^ as an anomalous sample.
We now provide some explanation of the motivation of our definition of anomalous patterns. Consider an example of binary sequences x { 0 , 1 } 8 restricted by a condition that the pattern “11” is forbidden. For example, the sequence (0, 1, 0, 1, 1, 1, 1, 0) is invalid because it contains “11”. If we define the anomalous region as the smallest subset of indices that need to be corrected in order for the point to be projected to the data manifold, we encounter the following ambiguity problem. Namely, there are three different ways to correct the above example using minimal number of changes:
(0, 1, 0, 1, 0, 1, 0, 0); (0, 1, 0, 0, 1, 0, 1, 0); and (0, 1, 0, 1, 0, 0, 1, 0), which we denote as y 1 , y 2 and y 3 , respectively. All three sequences correspond to orthogonal projections to the feasible set, where x ^ y 1 2 = x ^ y 2 2 = x ^ y 3 2 = 2 . However,
i = 1 3 S ( x ^ , y i ) i = 1 3 S ( x ^ , y i ) .
That is, there are multiple anomalous patterns giving rise to the same corrupted point x ^ . In particular, the anomalous regions in x ^ are not uniquely determined and depend on the structure of D . Based on this observation, we highlight a special projection map below, which maximally preserves the content of its arguments.
Definition 3.
For n N , p N + , we call an idempotent mapping f : U R n , U R n the  conservative projection   onto D : = f ( U ) U with respect to the l p -norm if, for each pair ( x ^ , x ) U × D with f ( x ^ ) = x and S : = S ( x ^ , x ) , it satisfies the following properties:
( a ) S A ( x ^ ) ( b ) x ^ S f S ( x ^ ) p inf x D : x ^ S ¯ = x S ¯ x ^ S x S p .
The next proposition relates the conservation properties of orthogonal projections for different l p -norms. See Figure 9 for an illustration. A corresponding proof is provided in Appendix G.
Proposition 2.
Let D R n , n N be a data manifold and x ^ R n D be an anomalous sample. Furthermore, let S A ( x ^ ) and S p : = S ( x ^ , y p * ) , where y p * inf y D x ^ y p corresponds to the orthogonal projection of x ^ onto D with respect to the l p -norm. For all p , q N + , p < q , the following statements are true:
( a ) S S p S q ( b ) D R n : S 1 S 2 .
The above proposition shows that orthogonal projection with respect to the l p -norm is more conservative for lower p 1 values regarding the preservation of normal regions. However, l 2 -norm has practical advantages over the l 1 -norm regarding the optimization process and is, therefore, often a better choice. Furthermore, the orthogonal projection with respect to the l 2 -norm (unlike the conservative projection) is not maximally preserving in general. We show later, however, that within the limit of the increasing input dimensionality, the conservative and the orthogonal projections (with respect to l 2 -norm) disagree only on a zero-measure probability set.
While Proposition 2 describes the conservation properties of orthogonal projections relative to each other, the following proposition specifies (asymptotically) how accurate the corresponding reconstruction error is in describing the anomalous region. As an auxiliary concept, we here introduce the notion of a transition set (illustrated in Figure 10), which glues the disconnected parts of an input together. We provide more details in Appendix H.
Proposition 3.
Let f : [ 0 , 1 ] n D [ 0 , 1 ] n , n N be the orthogonal projection with respect to an l p -norm, p N + and x D with a finite set of states x i I [ 0 , 1 ] , | I | < . Consider an x ^ [ 0 , 1 ] n that has been generated from x via partial modification with x ^ S ¯ = x S ¯ for some S { 1 , , n } , S ¯ : = { 1 , , n } S . Furthermore, let B { 1 , , n } denote a transition set from S to S ¯ B . For all p N + , the following holds true:
f S ¯ ( x ^ ) x S ¯ p | B | 1 p ,
where | B | grows asymptotically according to O ( | S | ) .
When projecting corrupted points onto the data manifold, we would like the transition set B around the anomalous region S to be as small as possible. The inequality in (15) implicitly upper-bounds the number of entries in the normal region S ¯ that are not preserved by the projection. This is mostly practical for lower values of p and least informative in the case p = . Namely, lim n | B | 1 p = 1 provides a trivial upper bound, since x 1 for x [ 0 , 1 ] n . For p = 1 , for example, the interpretation is the simplest in the case of finite sequences of binary values. The number of disagreeing components is then upper-bounded by the number | B | , corresponding to the size of the transition set, which, in turn, is determined by the form of the underlying data distribution.
To summarize, we showed that orthogonal projection with respect to the l p -norm preserves the normal regions more accurately for smaller p-values. Furthermore, we identified the conservative projection as the one that is maximally preserving (unlike the orthogonal projection). As previously mentioned, we show in Section 4.3 that within the limit (when the dimensionality of the input vectors goes to infinity), the l 2 -conservative and the l 2 -orthogonal projection are the same up until a zero-measure probability set.

4.3. Convergence Guarantees for Input Corruptions

In the following, we specify a range of input modifications that (approximately) preserve the orthogonality of the corresponding projections. In the context of image processing, the plethora of existing data augmentation techniques can be roughly divided into five groups: affine transformations, color jittering, mixing strategies, elastic deformations, and additive noise. Here, we characterize each image transformation either as ( a ) a partial modification (of any type) or as ( b ) a modification affecting the entire image represented by the additive noise methods. Image mixing strategies like MixUp can simply be seen as a transformation with additive noise, while CutMix is an example of partial modification. On the other hand, linear and affine data transformations like shift, rotation, or color-channel permutation, in general, do not preserve orthogonality.
For the purpose of the subsequent analysis, we identify the image tensors in our input space with a multivariate random variable x R n corresponding to Markov random fields (MRFs) [100,101,102,103,104,105,106,107,108]. In particular, we assume that the variables representing the individual pixels are organized in a two-dimensional grid. Based on this view, we consider sequences D ( n ) [ 0 , 1 ] n of spaces of increasing dimensionality n N , representing images of gradually increasing size by adding new nodes along the rows and columns of the grid. Furthermore, we use the notation d G ( x i , x j ) to denote the distance between the variables x i and x j in a corresponding MRF graph G defined by the number of edges on a shortest path connecting the two nodes.

4.3.1. Partial Modification

The following theorem describes the technical conditions under which a corresponding partial modification corresponds to the orthogonal projection. We provide a formal proof in Appendix D.
Theorem 1.
Consider a pair x , y D ( n ) of independent MRFs over identically distributed variables x i , y j with finite fourth moment E [ x i 4 ] < , variance σ 2 : = Var [ x i ] and vanishing covariance
Cov [ x i k , x j l ] 0 f o r d G ( x i , x j )
for all k , l { 1 , 2 } . Furthermore, let x ^ [ 0 , 1 ] n be a copy of x that has been partially modified, where x ^ S ¯ = x S ¯ for some S { 1 , , n } , S ¯ : = { 1 , , n } S . The following is true,
lim n P x , y x ^ x 2 x ^ y 2 = 1 ,
provided the inequality
| S | 2 · σ 2 · | S ¯ |
holds for all n N .
Given two samples x , y D and a corrupted copy x ^ [ 0 , 1 ] n , Theorem 1 describes how the expression x ^ x 2 x ^ y 2 approaches a true statement when the dimensionality of the embedding space n N goes to infinity. In particular, x can be identified with the image of x ^ under the conservative projection f ( x ^ ) = x . That is,
lim n P x , y x ^ f ( x ^ ) 2 x ^ y 2 = 1 .
Therefore, the conservative projection converges stochastically to the orthogonal projection (compare to the definition of the orthogonal projection in (5)), while the inequality in (18) controls the maximal size of corrupted regions | S | by taking into account the distribution of normal examples. On the other hand, Equation (19) suggests that (in the limit) the orthogonal and the conservative projection disagree at most on a subset of the data manifold D , which has a zero probability measure under the training distribution.

4.3.2. Additive Noise

The following theorem describes the technical conditions under which the denoising process (with additive noise) corresponds to the orthogonal projection. We provide a formal proof in Appendix E.
Theorem 2.
Consider a pair x , y D ( n ) of independent MRFs over identically distributed variables x i , y j with finite fourth moment E [ x i 4 ] < , variance σ 2 : = Var [ x i ] , and vanishing covariance
Cov [ x i k , x j l ] 0 f o r d G ( x i , x j )
for all k , l { 1 , 2 } . Furthermore, let x ^ : = x + ε be another MRF, where ε = ( ϵ 1 , , ϵ n ) is an additive noise vector of i.i.d. variables with μ ε : = E [ ϵ i ] and σ ε 2 : = Var [ ϵ i ] . Then, the following holds true:
lim n P x , y x ^ x 2 x ^ y 2 = 1 ,
provided
μ ε 2 + σ ε 2 1 2 σ 2 .
In the special example of isotropic Gaussian noise ε N ( 0 , σ ε 2 · I ) , the condition in (22) is reduced to σ ε 2 1 2 σ 2 . Here, again, x can be identified with the image of the conservative projection, which removes the specific type of noise from its arguments. The equation in (21) then implies its convergence to the orthogonal projection.

5. Experiments

In our experiments, we used the publicly available MVTec AD dataset [95,109]—a popular benchmark for anomaly detection in the manufacturing domain. It provides a diverse dataset with significant variations in materials and anomaly types, organized into 5 texture categories and 10 object categories. Each category encompasses multiple classes of anomalies, with the number of training examples ranging up to several hundred images. For a detailed technical description of the dataset, we direct the reader to the original publication [109]. During training, we artificially increased the amount of data by applying simple data augmentations, such as rotation and flipping, depending on the category. We used 5% of the data as a validation set. For the sake of reproducibility, we mention that for the transistor category, we used a random image rotation by multiples of 90 degrees, in addition to the artificial corruptions, during training. This made the training procedure considerably more challenging, but it proved essential for detecting missing objects.
For each category, we trained a separate model based on the proposed architectures (SDC-AE and SDC-CCM). For both models, we used the same image resolution of 512 × 512 pixels for texture and 256 × 256 pixels for object categories and the Adam optimizer [110] with initial learning rate of 10 4 . Note that our model architecture makes no assumptions about the size of the input images. The SDC-AE architecture, in particular, is fully convolutional and can be applied to any input size that is a multiple of the network’s stride after training. However, the input size is an important hyperparameter that influences the final performance of the anomaly detection. The resolution choices of 512 × 512 and 256 × 256 pixels appeared to work well with the considered dataset.
We report the pixel-level and the image-level AUROC metric to illustrate both segmentation and recognition performance in Table 2 and Table 3, respectively. We compare our results (for SCD-AE and SDC-CCM) with a number of previous methods like AnoGAN [6], VAE [9], and LSR [3], including top-ranking algorithms such as RIAD [39], CutPaste [40], InTra [41], DRAEM [38], SimpleNet [94], PatchCore [10], MSFlow [111], and PNI [68]. Our method achieves high performance with both architectures. However, SDC-CCM significantly outperforms SDC-AE in the cable, capsule, and transistor categories due to a reduction in false positive detections, supporting our idea about the CCM module. To further validate our choice of network architecture, we describe an additional ablation study in Appendix I. This study demonstrates the importance of dilated convolutions and the use of fine-grained information from earlier layers in subsequent layers, either through skip connections or the proposed CCM module. To give a sense of the visual quality of the resulting anomaly heatmaps produced by our method, we show a few additional examples in Figure 11.

6. Conclusions

We focused on an important use case for anomaly detection, where the data distribution is supported by a lower-dimensional manifold, and the covariance between neighboring input components vanishes as the distance increases. Our self-supervised approach aims to learn a reconstruction model that repairs artificially corrupted inputs based on a specific form of stochastic occlusions. The resulting training effect regularizes the model to produce locally consistent reconstructions while replacing irregularities, thus acting as a filter that removes anomalous patterns. We demonstrated the effectiveness of our approach by achieving state-of-the-art results on the MVTec AD dataset, a challenging benchmark for visual anomaly detection in the manufacturing domain.
Additionally, we performed a theoretical analysis of the proposed method, providing several interesting insights. As our main result, we showed that as the dimensionality of the inputs increases, the corresponding model approximates a mapping that stochastically converges to the orthogonal projection of partially corrupted inputs onto the submanifold of uncorrupted examples. (According to our training objective in (1), an optimal solution is given by a model f * , such that f S * ( x ^ ) = E x D [ x S | x ^ S ¯ ] . Therefore, f * ( x ^ ) does not necessarily lie in D and the corresponding approximation error depends on the variance of x S given x S ¯ ). If the covariance between the input variables (with increasing distance) rapidly approaches zero, the orthogonal projection maps its input in a way that largely preserves the original content, supporting our intuition about the filtering behavior of the model. Therefore, we can jointly perform the detection and localization of corrupted regions using the pixel-wise reconstruction error.
Furthermore, we deepened the understanding of the relationship between regularized autoencoders. Specifically, we showed that the orthogonal projection provides an optimal solution for the RCAE, which was demonstrated to be equivalent to the DAE for small variance in the corruption noise. Our results extend this equivalence to more complex input modifications beyond i.i.d. pixel corruptions and provide a unifying perspective on the regularized autoencoders that is not limited by the assumption of small variance.

Author Contributions

Conceptualization, A.B.; Methodology, A.B.; Software, A.B.; Validation, A.B.; Formal Analysis, A.B.; Investigation, A.B., S.N. and K.-R.M.; Writing—Original Draft, A.B.; Writing—Review and Editing, A.B., S.N. and K.-R.M.; Funding Acquisition, K.-R.M. All authors have read and agreed to the published version of this manuscript.

Funding

A.B., S.N. and K.-R.M. acknowledge support from the German Federal Ministry of Education and Research (BMBF) for BIFOLD under grant Nr. BIFOLD24B. K.-R.M. was also supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grants funded by the Korea government (MSIT) (No. RS-2019-II190079, Artificial Intelligence Graduate School Program, Korea University and No. RS-2024-00457882, AI Research Hub Project) and by the German Federal Ministry for Education and Research (BMBF) under grants 01IS14013B-E and 01GQ1115.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of this manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ADAnomaly Detection
AEAutoencoder
CAEContractive Autoencoder
DAEDenoising Autoencoder
CCMConvex Combination Module
DNNDeep Neural Network
PAEProjecting Autoencoder
RCAEReconstruction Contractive Autoencoder
SDCStacked Dilated Convolutions

Appendix A. Auxiliary Statements

In order to prove the formal statements in the main body of this paper, we first introduce a set of auxiliary statements including Theorem A1 and Corollary A1, which themselves provide an additional contribution. For convenience, we extend our notation as follows: Given a vector x R n , a set of indices S { 1 , , n } , and its complement S ¯ : = { 1 , , n } S , we interpret the restriction x S in two different ways depending on the context either as an element x S R | S | or as an element x S R n , where the indices in S ¯ are set to zeros, that is, x S ¯ = 0 R | S ¯ | , and vice versa. Given such notation, we can write x = x S + x S ¯ .
Lemma A1.
Consider an x R n , n N and S { 1 , , n } , S ¯ : = { 1 , , n } S . The following holds true for all p N + :
x S ± x S ¯ p p = x S p p ± x S ¯ p p .
The proof follows directly by writing
x S ± x S ¯ p p = i S ( x i ± 0 ) p ± j S ¯ ( 0 ± x j ) p 1 p · p = x S p p + x S ¯ p p .
Lemma A2.
For all x ^ , x , y R n , n N with x ^ S ¯ = x S ¯ for some non-empty sets S { 1 , , n } , S ¯ : = { 1 , , n } S , the following holds true for all p N + :
x ^ x p x ^ y p x ^ S x S p p x ^ S y S p p x S ¯ y S ¯ p p .
Proof. 
x ^ x p x ^ y p x ^ x p p x ^ y p p ( x ^ S x S ) + ( x ^ S ¯ x S ¯ ) p p ( x ^ S y S ) + ( x ^ S ¯ y S ¯ ) p p ( * ) x ^ S x S p p x ^ S y S p p + x ^ S ¯ y S ¯ p p x ^ S x S p p x ^ S y S p p x ^ S ¯ y S ¯ p p
where, in step ( * ) , we used Lemma A1 and our assumption x ^ S ¯ = x S ¯ . □

Appendix B. Weak Law of Large Numbers for MRFs

Here, we derive a version of the weak law of large numbers adjusted to our case of MRFs with identically distributed but nonindependent variables.
Theorem A1
(Weak Law of Large Numbers for MRFs). Let ( { x 1 , , x n } ) n N be a sequence of MRFs over identically distributed variables x i with finite variance Var [ x i ] < and vanishing covariance
Cov [ x i , x j ] 0 f o r d G ( x i , x j ) .
Then the following is true:
1 n i = 1 n x i P E [ x 1 ] f o r n .
Proof. 
First, we reorganize the individual terms in the definition of variance as follows:
Var i = 1 n x i = E i = 1 n x i E i = 1 n x i 2 = E i = 1 n x i E x i 2 = i = 1 n j = 1 n Cov [ x i , x j ] = i = 1 n Var [ x i ] + 2 i = 1 n j = i + 1 n Cov [ x i , x j ] .
Let ϵ > 0 . Since Cov [ x i , x j ] 0 for d G ( x i , x j ) , there is an L N , such that for all i , j satisfying d G ( x i , x j ) > L , we obtain Cov [ x i , x j ] < ϵ . Later, we consider n . Therefore, we can assume n > L and split the inner sum of the covariance term in the last equation in two sums as follows: Here, we abbreviate the index set { j N : i + 1 j n , d G ( x i , x j ) L } and { j N : i + 1 j n , d G ( x i , x j ) > L } by d G ( x i , x j ) L and d G ( x i , x j ) > L , respectively. That is,
i = 1 n j = i + 1 n Cov [ x i , x j ] = i = 1 n j : d G ( x i , x j ) L Cov [ x i , x j ] σ 2 + j : d G ( x i , x j ) > L Cov [ x i , x j ] < ϵ < n · L · σ 2 + n 2 · ϵ ,
where, in the last step, we used, on the one hand, the Cauchy–Schwarz inequality | Cov [ x i , x j ] | Var [ x i ] · Var [ x j ] = Var [ x i ] = : σ 2 , and, on the other hand, an upper bound L on the cardinality of the running index set in the second sum. Altogether, we obtain the following estimation:
1 n 2 Var i = 1 n x i < σ 2 n + 2 · L · σ 2 n + 2 · ϵ .
Since the first two terms on the right-hand side converge to zero (for n ), and ϵ is chosen arbitrarily small, this implies
lim n 1 n 2 Var i = 1 n x i = 0 .
Finally, for all ϵ > 0 ,
P 1 n i = 1 n x i E [ x 1 ] > ϵ ( * ) 1 ϵ 2 Var 1 n i = 1 n x i = 1 ϵ 2 n 2 Var i = 1 n x i
where in step ( * ) , we used Chebyshev’s inequality. The convergence
lim n P 1 n i = 1 n x i E [ x 1 ] > ϵ = 0
follows from Equation (A5). □

Appendix C. Corollary of Theorem A1

Applying Theorem A1 to our discussion in this paper, we obtain the following useful Corollary.
Corollary A1.
Consider a pair x , y of independent MRFs over identically distributed variables x i , y j with finite fourth moment E [ x i 4 ] < and vanishing covariance according to
Cov [ x i k , x j l ] 0 for d G ( x i , x j )
for all k , l { 1 , 2 } . The following statements are true:
( a ) 1 | S | x S 2 2 P μ 2 + σ 2 for | S | ,
( b ) 1 | S | x S y S 2 2 P 2 σ 2 for | S | ,
( c ) 1 | S | x S y S P μ 2 for | S | ,
( d ) cos ϕ [ x S , y S ] P μ 2 μ 2 + σ 2 for | S | ,
where μ : = E [ x i ] , σ 2 : = Var [ x i ] , ϕ [ x S , y S ] is the angle between the two vectors, and S N denotes a subset of the variable indices.
Proof. 
First, we prove statement ( a ) in (A8). By setting z i : = x i 2 , we have
E [ z 1 ] = E [ x 1 2 ] = Var [ x 1 ] + E [ x 1 ] 2 = σ 2 + μ 2 .
On the other hand, we have
1 | S | x S 2 = 1 | S | i S x i 2 = 1 | S | i S z i .
Since we assumed E [ x i 4 ] < , it holds that
Var [ z i ] = Var [ x i 2 ] = E [ x i 4 ] E [ x i 2 ] 2 = E [ x i 4 ] ( σ 2 + μ 2 ) 2 < .
Furthermore, due to our assumption in (A7),
Cov [ z i , z j ] = Cov [ x i 2 , x j 2 ] 0 f o r d G ( x i , x j ) .
That is, the variables z i satisfy all the requirements in Theorem A1. Applying this Theorem directly to the above derivations proves the claim in (A8).
Now, we prove the statement ( b ) in (A9). For this purpose, we set z i : = ( x i y i ) 2 . It holds that
E [ z 1 ] = E [ ( x 1 y 1 ) 2 ] = E [ x 1 2 ] 2 E [ x 1 y 1 ] + E [ y 1 2 ] = 2 E [ x 1 2 ] E [ x 1 ] 2 = 2 Var [ x 1 ] = 2 σ 2 .
On the other hand, we obtain
1 | S | x S y S 2 = 1 | S | i S ( x i y i ) 2 = 1 | S | i S z i .
It remains to be shown that the corresponding variance is finite and that the covariance vanishes. The following holds:
E [ ( x i y i ) 2 ( x j y j ) 2 ] = E [ ( x i 2 2 x i y i + y i 2 ) ( x j 2 2 x j y j + y j 2 ) ] = E [ x i 2 x j 2 2 x i 2 x j y j + x i 2 y j 2 2 x i y i x j 2 + 4 x i y i x j y j 2 x i y i y j 2 + y i 2 x j 2 2 y i 2 x j y j + y i 2 y j 2 ] = E [ x i 2 x j 2 ] 2 E [ x i 2 x j y j ] + E [ x i 2 y j 2 ] 2 E [ x i y i x j 2 ] + 4 E [ x i y i x j y j ] 2 E [ x i y i y j 2 ] + E [ y i 2 x j 2 ] 2 E [ y i 2 x j y j ] + E [ y i 2 y j 2 ] = E [ x i 2 x j 2 ] 2 E [ x i 2 x j ] E [ y j ] + E [ x i 2 ] E [ y j 2 ] 2 E [ x i x j 2 ] E [ y i ] + 4 E [ x i x j ] E [ y i y j ] 2 E [ x i ] E [ y i y j 2 ] + E [ y i 2 ] E [ x j 2 ] 2 E [ x j ] E [ y i 2 y j ] + E [ y i 2 y j 2 ] = 2 E [ x i 2 x j 2 ] 8 E [ x i 2 x j ] E [ x j ] + 2 E [ x i 2 ] E [ x j 2 ] + 4 E [ x i x j ] 2 .
On the other hand,
E [ ( x i y i ) 2 ] E [ ( x j y j ) 2 ] = E [ ( x i 2 2 x i y i + y i 2 ) ] E [ ( x j 2 2 x j y j + y j 2 ) ] = ( E [ x i 2 ] 2 E [ x i y i ] + E [ y i 2 ] ) ( E [ x j 2 ] 2 E [ x j y j ] + E [ y j 2 ] ) = E [ x i 2 ] E [ x j 2 ] 2 E [ x i 2 ] E [ x j y j ] + E [ x i 2 ] E [ y j 2 ] 2 E [ x i y i ] E [ x j 2 ] + 4 E [ x i y i ] E [ x j y j ] 2 E [ x i y i ] E [ y j 2 ] + E [ y i 2 ] E [ x j 2 ] 2 E [ y i 2 ] E [ x j y j ] + E [ y i 2 ] E [ y j 2 ] = E [ x i 2 ] E [ x j 2 ] 2 E [ x i 2 ] E [ x j ] E [ y j ] + E [ x i 2 ] E [ y j 2 ] 2 E [ x i ] E [ y i ] E [ x j 2 ] + 4 E [ x i ] E [ y i ] E [ x j ] E [ y j ] 2 E [ x i ] E [ y i ] E [ y j 2 ] + E [ y i 2 ] E [ x j 2 ] 2 E [ y i 2 ] E [ x j ] E [ y j ] + E [ y i 2 ] E [ y j 2 ] = E [ x i 2 ] E [ x j 2 ] 2 E [ x i 2 ] E [ x j ] 2 + E [ x i 2 ] E [ x j 2 ] 2 E [ x i ] 2 E [ x j 2 ] + 4 E [ x i ] 2 E [ x j ] 2 2 E [ x i ] 2 E [ x j 2 ] + E [ x i 2 ] E [ x j 2 ] 2 E [ x i 2 ] E [ x j ] 2 + E [ x i 2 ] E [ x j 2 ] = 4 E [ x i 2 ] E [ x j 2 ] 8 E [ x i 2 ] E [ x j ] 2 + 4 E [ x i ] 2 E [ x j ] 2 .
Based on the above derivations, we now analyze the covariance term,
Cov [ z i , z j ] = E [ ( x i y i ) 2 ( x j y j ) 2 ] E [ ( x i y i ) 2 ] E [ ( x j y j ) 2 ]
in the limit d G ( z i , z j ) . Because of our assumption in (A7), the terms E [ x i 2 x j 2 ] , E [ x i 2 x j ] and E [ x i x j ] converge against E [ x i 2 ] E [ x j 2 ] , E [ x i 2 ] E [ x j ] and E [ x i ] E [ x j ] , respectively. It follows:
lim d G ( z i , z j ) E [ ( x i y i ) 2 ( x j y j ) 2 ] = lim d G ( z i , z j ) 2 E [ x i 2 x j 2 ] 8 E [ x i 2 x j ] E [ x j ] + 2 E [ x i 2 ] E [ x j 2 ] + 4 E [ x i x j ] 2 = 2 E [ x i 2 ] E [ x j 2 ] 8 E [ x i 2 ] E [ x j ] E [ x j ] + 2 E [ x i 2 ] E [ x j 2 ] + 4 E [ x i ] 2 E [ x j ] 2 = 4 E [ x i 2 ] E [ x j 2 ] 8 E [ x i 2 ] E [ x j ] 2 + 4 E [ x i ] 2 E [ x j ] 2 = E [ ( x i y i ) 2 ] E [ ( x j y j ) 2 ] .
Therefore, Cov [ z i , z j ] 0 f o r d G ( z i , z j ) . Finally, by writing out
Var [ z i ] = E [ z i 2 ] E [ z i ] 2 = E [ ( x i y i ) 4 ] E [ ( x i y i ) 2 ] 2
and using the independence assumption of x i and y i , we can see that the degree of each monomial is upper-bounded by 4. Due to our assumption of a finite fourth moment, this implies Var [ z i ] < . Altogether, the variables z i satisfy the requirements in Theorem A1. Applying this theorem directly to the above derivations proves the claim in (A9).
The statement ( c ) in (A10) can be proven in a similar way. Alternatively, we can use the statements in ( a ) and ( b ) and apply the limit algebra of convergence in probability
1 | S | x S y S 2 P 2 σ 2 = 1 | S | x S 2 P μ 2 + σ 2 2 | S | x S y S + 1 | S | y S 2 P μ 2 + σ 2 ,
which implies 1 | S | x S y S P μ 2 .
Similarly, we prove the statement ( d ) in (A11) as follows:
cos ϕ [ x S , y S ] · x S · y S = x S y S cos ϕ [ x S , y S ] · 1 | S | x S P μ 2 + σ 2 · 1 | S | y S P μ 2 + σ 2 = 1 | S | x S y S P μ 2 cos ϕ [ x S , y S ] P μ 2 μ 2 + σ 2 f o r | S | .

Appendix D. Proof of Theorem 1

Proof. 
Consider the case
x S ¯ y S ¯ 2 2 < x ^ S x S 2 2 x ^ S y S 2 2 x ^ S x S 2 2 | S | ,
where we assume the first inequality to be true and the last inequality holds due to x ^ i , y j [ 0 , 1 ] . This implies that
{ ( x , y ) : x S ¯ y S ¯ 2 < x ^ S x S 2 x ^ S y S 2 } { ( x , y ) : x S ¯ y S ¯ 2 < | S | }
Consider now the following derivation:
x S ¯ y S ¯ 2 < | S | 1 | S ¯ | x S ¯ y S ¯ 2 < | S | | S ¯ | 1 | S ¯ | x S ¯ y S ¯ 2 < 2 σ 2 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 0 ,
where, in the second step, we use | S | 2 σ 2 | S ¯ | . Therefore, it follows from (A16) that
{ ( x , y ) : x S ¯ y S ¯ 2 < x ^ S x S 2 x ^ S y S 2 } { ( x , y ) : 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 0 } = { ( x , y ) : 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 1 } k 2 { ( x , y ) : 1 k 1 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 1 k }
Using the above derivations, we now upper-bound the probability of the corresponding events, where P = P x , y denotes a joint probability distribution over ( x , y ) :
P x S ¯ y S ¯ 2 < x ^ S x S 2 x ^ S y S 2 P k { 1 k 1 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 1 k } = ( * ) k P 1 k 1 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 1 k ,
where, in step ( * ) , we use the fact that all the sets in the union are disjoint. Note that we use a short notation k and k , which runs over all previously defined terms (including k = 1 ). Next, we consider the above derivations in the limit n for S { 1 , , n } , S ¯ : = { 1 , , n } S with a fixed ratio | S | / | S ¯ | 2 σ 2 . The following holds:
lim n P x S ¯ y S ¯ 2 < x ^ S x S 2 x ^ S y S 2 lim n k P 1 k 1 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 1 k 1 = ( * ) k lim n P 1 k 1 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 1 k k lim n P 1 | S ¯ | x S ¯ y S ¯ 2 2 σ 2 > 1 k = 0 = 0 ,
where, in step ( * ) , we can swap the limit and the sum signs because the series converges for all n N . The last step follows from Corollary A1. To summarize, we have just shown that
lim n P x , y x ^ S x S 2 x ^ S y S 2 x S ¯ y S ¯ 2 = 1 .
According to Lemma A2, the following holds:
x ^ S x S 2 2 x ^ S y S 2 2 x S ¯ y S ¯ 2 2 x ^ x 2 x ^ y 2 .
Together, it follows from (A17) and (A18) that
lim n P x , y x ^ x 2 x ^ y 2 = 1 .
Note that x in the above derivation can be identified with the image of x ^ under the conservative projection. In order to prove the following equality (where f : = f c o n ),
lim n P x , y ( x ^ f ( x ^ ) 2 x ^ y 2 ) = 1 ,
it suffices to show that for all pairs ( x , y ) D × D , the following implication holds true:
x ^ x 2 x ^ y 2 x ^ f ( x ^ ) 2 x ^ y 2 .
Namely, the following holds:
x ^ f ( x ^ ) 2 = x ^ S f S ( x ^ ) 2 x ^ S x S 2 + x ^ S ¯ f S ¯ ( x ^ ) 2 = 0 ( * ) x ^ S x S 2 + x ^ S ¯ x S ¯ 2 = 0 = x ^ x 2 x ^ y 2 ,
where, in step ( * ) , we use our Definition 3 of the conservative projection. □

Appendix E. Proof of Theorem 2

Lemma A3.
Let x , y , x ^ , ϵ R n , where x ^ : = x + ϵ . Then, the following holds true:
ϵ 1 2 x y x ^ x x ^ y .
Proof. 
The following holds:
x ^ x x ^ y x + ϵ x x + ϵ y ϵ ( x y ) + ϵ x y ϵ ϵ x y ϵ ϵ 1 2 x y
We now prove Theorem 2. First, we show that
lim n P ϵ 1 2 x y = 1 .
It holds:
ϵ 2 1 4 x y 2 1 n ϵ 2 ( μ ϵ 2 + σ ϵ 2 ) = : X 1 4 1 n x y 2 2 σ 2 = : Y + 1 2 σ 2 ( μ ϵ 2 + σ ϵ 2 ) = : C
It follows that
{ X > Y + C } { X > Y } = { X > Y , X 0 , Y 0 } { X > Y , X 0 , Y < 0 } { X > Y , X < 0 , Y < 0 } { X > Y , X < 0 , Y 0 } = { | X | > | Y | , X 0 , Y 0 } { | X | 0 , X 0 , Y < 0 } { | X | < | Y | , X < 0 , Y < 0 } = { | X | > | Y | , X 0 , Y 0 } { | X | > 0 , X > 0 , Y < 0 } { X = 0 , | Y | > 0 } { | X | < | Y | , X < 0 , Y < 0 } { | X | > | Y | } { | X | > 0 } { | Y | > 0 } { | X | < | Y | } .
That is,
P ( X > Y + C ) P ( { | X | > | Y | } ) + P ( { | X | > 0 } ) + P ( { | Y | > 0 } ) + P ( { | X | < | Y | } ) .
We can represent { | X | > | Y | } and { | X | > 0 } as a union of disjoint sets according to
{ | X | > | Y | } = k 1 { k + 1 | Y | > k , | X | > k + 1 } k 2 { 1 k 1 | Y | > 1 k , | X | > 1 k 1 }
and
{ | X | > 0 } = k 1 { k + 1 | X | > k } k 2 { 1 k 1 | X | > 1 k } .
We obtain a similar representation for { | X | < | Y | } and { | Y | > 0 } . Now, we consider the above derivations in the limit n .
We look at the individual terms on the right-hand side of the equation in (A22) separately.
lim n P ( { | X | > | Y | } ) = lim n P ( k 1 { k + 1 | Y | > k , | X | > k + 1 } k 2 { 1 k 1 | Y | > 1 k , | X | > 1 k 1 } ) = ( * ) lim n k 1 P k + 1 | Y | > k , | X | > k + 1 1 + k 2 P 1 k 1 | Y | > 1 k , | X | > 1 k 1 1 = ( + ) k 1 lim n P k + 1 | Y | > k , | X | > k + 1 + k 2 lim n P 1 k 1 | Y | > 1 k , | X | > 1 k 1 k 1 lim n P | X | > k + 1 + k 2 lim n P | Y | > 1 k = k 2 lim n P ( 1 n ϵ 2 ( μ ϵ 2 + σ ϵ 2 ) > k + 1 ) = 0 + k 2 lim n P ( 1 n x y 2 2 σ 2 > 1 k ) = 0 = 0 ,
where, in step ( * ) , we use the fact that all the sets are disjoint. In step ( + ) , we swap the limit and the sum signs because the two sums are finite for all n N . In the last step, we use the weak law of big numbers and its variant for dependent variables with vanishing covariance. Similarly, we obtain
lim n P ( | X | < | Y | ) = lim n P ( | X | > 0 ) = lim n P ( | Y | > 0 ) = 0 .
Therefore,
lim n P ϵ 1 2 x y = lim n P X Y + C = 1 lim n P X > Y + C = 1 .
Applying Lemma A3 to the above equation completes the proof:
lim n P x ^ x x ^ y = 1 .

Appendix F. Proof of Proposition 1

In order to prove Proposition 1 in the main body of this paper, we first introduce a set of auxiliary statements.
Lemma A4.
Let f : R n U R n , n R be idempotent. Then, the equality
f ( x ) = x
holds for all x f ( U ) .
Proof. 
For any x f ( U ) , there is z U with f ( z ) = x . It follows that
f ( z ) = x f ( f ( z ) ) = f ( x ) f ( z ) = f ( x ) x = f ( x ) ,
where, in the penultimate step, we use the idempotency assumption, and, in the last step, x = f ( z ) . □
Lemma A5.
Let D R n , n N be a differentiable manifold and f , g : R n R n be differentiable mappings such that f ( x ) = g ( x ) for all x D . Then, the equality
D x f ( v ) = D x g ( v )
holds true for all x D , v T x D .
Proof. 
Let x D and v T x D { 0 } . Since D is a differentiable manifold, there exists a parametrization ϕ : U R m , x U such that f ϕ 1 , g ϕ 1 are differentiable. Because D q ϕ 1 ( R m ) = T x D for q = ϕ ( x ) holds, there exists w R m such that D q ϕ 1 ( w ) = v . It follows:
f ( x ) = g ( x ) for all x D f ϕ 1 ( q ) = g ϕ 1 ( q ) for all q ϕ ( U ) D q f ϕ 1 = D q g ϕ 1 D ϕ 1 ( q ) f D q ϕ 1 = D ϕ 1 ( q ) g D q ϕ 1 D x f D q ϕ 1 ( w ) = D x g D q ϕ 1 ( w ) for all w R m D x f ( v ) = D x g ( v ) .
Lemma A6.
Let D R n be a differentiable manifold. Consider an orthogonal projection f and another projection g onto D . The following holds true for all x D :
D x f 2 D x g 2 .
Proof. 
Let x D be fixed. Because R n = T x D ( T x D ) , each v R n can be represented as a sum of two vectors v = u + w , where u T x D , w ( T x D ) . In particular, u w = 0 holds. It follows that
D x f 2 = sup v 0 D x f ( v ) 2 v 2 = sup u + w 0 D x f ( u + w ) 2 u + w 2 = ( * ) sup u + w 0 D x g ( u ) 2 u + w 2 = ( * * ) sup u + w 0 , u 0 D x g ( u ) 2 u + w 2 ( * * * ) sup u + w 0 , u 0 D x g ( u ) 2 u 2 = sup u 0 D x g ( u ) 2 u 2 = D x g 2 ,
where, in step ( * ) , we use the orthogonality of f, according to which D x f ( w ) = 0 for all w ( T x D ) , and D x f ( u ) = D x g ( u ) for all u T x D (according to Lemma A5). In step ( ) , we use the fact that the supremum of a corresponding term is achieved for u 0 , since D x g ( 0 ) 2 = 0 , and we can always find a u 0 with D x g ( u ) 2 , u + w 2 > 0 . In step ( ) , we use the following argument: u w = 0 u + w 2 u 2 . □
Corollary A2.
Let D be a differentiable manifold. Consider an orthogonal projection f and another projection g onto D . The following holds true for all x D :
D x f F D x g F .
Proof. 
The proof follows directly from Lemma A6. Namely, based in the geometric interpretation, the singular values of matrix A correspond to the length of the major axis of the ellipsoid E A , which is given by the image of the Euclidian unit ball under the linear transformation A. If the inequality A x B x holds for all x , E A is completely contained within E B . Let λ 1 , , λ r and μ 1 , , μ k , r k denote the positive singular values (in descending order) of A and B, respectively. Then, this implies λ i μ i for all i { 1 , , r } . It follows that
A F = λ 1 2 + λ n 2 μ 1 2 + μ n 2 = B F .
In particular, since D x f ( v ) = D x g ( v ) for all v T x D , the singular vectors lying in T x D (and the corresponding singular values) are the same for both matrices. □
Now, we can prove the statement in Proposition 1. Note that any idempotent mapping f C 1 ( U ) satisfies f ( x ) x = 0 and (according to Lemma A5 and Corollary A2) D x f * 2 D x f 2 and D x f * F D x f F for all x D and f C 1 ( U ) . Since f * minimizes both terms of the sum in (11), it is an optimal solution.

Appendix G. Proof of Proposition 2

Consider the plot in Figure 9 illustrating the unit circles with respect to the l p -norm for p { 1 , 2 , } centered around the point x ^ R 2 . Now, choose some p and increase the radius of the unit circle around x ^ until it touches the set D, which it represents the submanifold of normal examples. Note that the touching points y p * correspond to the orthogonal projections of x ^ onto D with respect to the l p -norm. For p = 1 , there are two possible cases for the position of touching points independent of the shape of D. In the first case, the l 1 nunit circle touches D at one of its corners. Without loss of generality, assume that this corner is the point ( r , 0 ) for some r > 0 . This implies that x ^ 1 y 1 * and x ^ 2 = y 2 * ; that is, S 1 = { 1 } . However, since all norms are equal at the corners of the l 1 circle, it also holds that 1 S p for p 2 . On the other hand, if the touching point lies on a side (or on an edge in the higher-dimensional case) of the l 1 circle, it implies S 1 = { 1 , 2 } . Obviously, we also obtain S = { 1 , 2 } . Consider now an orthogonal projection of D onto R 2 spanned by a pair of axes. Note that each l p circle for 2 p < at such a touching point has a nonzero curvature toward the interior of the circle. See Figure 9 for an illustration. This implies that the touching point with respect to the l p -norm must lie within the rectangular triangle shaped by the touching points y 1 * , y * and the lines through these points parallel to the two axes. In particular, for all p < q , the touching point y q * lies in a smaller triangle shaped by the points y p * , y * and the lines through these points parallel to the coordinated axes. For the plot in Figure 9, this means that y q * lies on a red curve between the points y p * and y * implying S q = { 1 , 2 } . Since this argument is independent of our choice of the axes, it proves statement ( a ) .
In order to prove the statement in ( b ) , we provide a counterexample for the equality. Consider an example of a Markov chain describing a simple random walk on the one-dimensional grid of integers illustrated in Figure A1. Precisely, we consider a sequence of i.i.d variables ( ξ i ) i N with values in { 1 , 1 } and define a Markov chain ( x n ) n N according to x n : = i = 1 n ξ i with x 0 : = 0 . Finally, we define the set of normal examples D { ( x n + 1 , , x n + 5 ) Z 5 | n N } as the set of all feasible configurations for subsequences of length 5. We can see that there is a mismatch between S ( x ^ , x ) and S ( x ^ , f ( x ^ ) ) . This provides a counterexample for the claim that orthogonal projection maximally preserves uncorrupted regions.
Figure A1. Illustration of a counterexample for the claim that orthogonal projections maximally preserve normal regions in the inputs. Here, x ^ Z 5 is the modified version of the original input x D according to the partition S , S ¯ and f ( x ^ ) denotes the orthogonal projection of x ^ onto D with respect to the l 2 -norm. This example also shows that orthogonality property is dependent on our choice of the distance metric.
Figure A1. Illustration of a counterexample for the claim that orthogonal projections maximally preserve normal regions in the inputs. Here, x ^ Z 5 is the modified version of the original input x D according to the partition S , S ¯ and f ( x ^ ) denotes the orthogonal projection of x ^ onto D with respect to the l 2 -norm. This example also shows that orthogonality property is dependent on our choice of the distance metric.
Mathematics 12 03988 g0a1

Appendix H. Proof of Proposition 3

In the following, we naturally generalize some properties of Markov chains to the two-dimensional case of MRFs, where the nodes are organized in a grid-like structure similar to that of the Ising model. Here, we omit a formal definition and provide only an idea, which is sufficient for our purposes. Informally, we introduce the notion of the order of an MRF by relating the definition for Markov chains to the width of the corresponding Markov blanket. Furthermore, we generalize the notion of irreducible Markov chains to irreducible MRFs by assuming that each state is reachable from any other state on the chain subgraphs of the MRF, where each node x j on the chain is farther (with respect to the topology of the MRF graph G) to the start node x 1 than all of the previous nodes x i according to the metric d G ( · , · ) . That is, i < j d G ( x i , x 1 ) < d G ( x j , x 1 ) . Note that the notion of a state at position i on the path now also involves the values of the variables in the neighborhood (corresponding to the Markov blanket) of x i in the MRF. Essentially, this property is covered by our assumption on the variables being identically distributed with vanishing covariance. Based on this informal extension, we introduce an auxiliary concept that we refer to as the transition set. Consider first a Markov chain of the order K N with a finite set of possible states x i I , | I | < . Then, the maximal number of steps required to reach a state j I from a state i I is upper-bounded by | I | K , that is, by a constant independent of the graph size of the MRF. Namely, when traveling from one node to another, we see at most | I | K combinations of states ( x n K , , x n 1 , x n ) with P ( x n | x n K , , x n 1 ) > 0 , which locally affect our path. The shortest path would have no redundant configurations ( x n K , , x n 1 ) . Therefore, given two patterns ( x m + 1 , , x m + r 1 ) and ( x m + r , , x m + n ) , we can always find (for sufficiently large numbers r , n N ) a sequence B = ( x m + r b , , x m + r 1 ) , b | I | K such that P ( x m + 1 , , x m + r b + 1 , x m + r b , , x m + r 1 , x m + r , , x m + n ) > 0 . This example can be generalized to the MRF by considering the transitions between the individual nodes that respect the values of the surrounding variables in their neighborhoods (or Markov blankets). Basically, this corresponds to considering Markov chains with an increased set of states upper-bounded by | I | K 2 . We refer to the set of pixels B as a transition set. See Figure A2 for an illustration.
Figure A2. Illustration of the concept of a transition set on two examples with different shapes. Each of the two images represents an MRF x = ( x 1 , , x n ) , n N of the Markov order K N with nodes corresponding to the individual pixels with values from a finite set of states x i I , | I | < . The grey area marks the corrupted region S { 1 , , n } , where the union of the dark blue and light blue areas is the complement S ¯ : = { 1 , , n } S marking the normal region. The dark blue part of S ¯ corresponds to the transition set B S ¯ . W | I | K denotes (loosely) the width at the thickest part of the tube B around S.
Figure A2. Illustration of the concept of a transition set on two examples with different shapes. Each of the two images represents an MRF x = ( x 1 , , x n ) , n N of the Markov order K N with nodes corresponding to the individual pixels with values from a finite set of states x i I , | I | < . The grey area marks the corrupted region S { 1 , , n } , where the union of the dark blue and light blue areas is the complement S ¯ : = { 1 , , n } S marking the normal region. The dark blue part of S ¯ corresponds to the transition set B S ¯ . W | I | K denotes (loosely) the width at the thickest part of the tube B around S.
Mathematics 12 03988 g0a2
Now, we can prove the statement in Proposition 3. In the following, we use the notation y * : = f ( x ^ ) . Consider now y : = ( y S * , y B , x S ¯ B ) for some feasible y B . Provided min { | S | , | S ¯ | } is sufficiently greater than K, we can always find a feasible transition set B such that y D . It follows that
x ^ y * p x ^ y p A 1 x ^ S y S * p p + x ^ S ¯ y S ¯ * p p x ^ S y S * p p + x ^ B y B p p + x ^ S ¯ B x S ¯ B p p = 0 x ^ S ¯ y S ¯ * p p x ^ B y B p p | B | x S ¯ f S ¯ ( x ^ ) p p | B | ,
where, in the penultimate step, we use x i , y i [ 0 , 1 ] .
Now, we show | B | O ( | S | ) . Without loss of generality, we assume that S and B have a square shape with equal-length sides, as illustrated in the right part in Figure A2. That is, each side of the grey square has length | S | . The area of the region corresponding to B is then given by | B | = 4 | S | W + 4 W 2 , where W is a graph-independent constant. Furthermore, by putting the last equation in the inequality | B | < | S | and solving the resulting quadratic equation, we obtain a criterion min { | S | , | S ¯ | } > ( 2 + 8 ) 2 W 2 that guarantees the existence of a feasible transition set B.

Appendix I. Ablation Study on the Impact of Architectural Components

To highlight the benefits of our model architecture, we performed an additional ablation study on several categories of the MVTec AD dataset. In this study, we compared different architectures, including U-net, SDC-AE, SDC-AE with skip connections, and SDC-CCM. For consistency, we used the same hyperparameters k = 15 , n = 3 in Equation (3) when evaluating performance, except for the ’transistor’ category, where n = 15 . The corresponding results are presented in the tables below. Note that U-net-small refers to a simplified version of U-net [98], where the middle layers with the smallest resolution were removed to roughly match the number of layers in SDC-AE and SDC-CCM. This modification resulted in a significant performance boost, suggesting that the original U-net may be too deep. The evaluation results underscore the importance of utilizing fine-grained information from the earlier layers in the network, either through skip connections or the proposed CCM module.
Table A1. Experimental results for anomaly segmentation measured with pixel-level AUROC on the MVTec AD dataset.
Table A1. Experimental results for anomaly segmentation measured with pixel-level AUROC on the MVTec AD dataset.
CategoryU-NetU-Net-SmallSDC-AESDC-AE (+Skip)SDC-CCM
carpet65.7 ± 0.8199.4 ± 0.0299.6 ± 0.0299.4 ± 0.0399.4 ± 0.09
grid70.3 ± 1.9499.6 ± 0.0099.6 ± 0.0099.6 ± 0.0099.6 ± 0.00
leather80.3 ± 0.8099.6 ± 0.0399.4 ± 0.0899.1 ± 0.3699.4 ± 0.07
tile58.6 ± 1.1698.4 ± 0.1097.9 ± 0.0098.5 ± 0.0898.4 ± 0.25
cable89.4 ± 0.1096.8 ± 0.0994.5 ± 0.0597.7 ± 0.2698.1 ± 0.16
transistor87.1 ± 0.4988.5 ± 0.2089.2 ± 0.1791.0 ± 0.2791.3 ± 0.84
avg. all75.2 ± 11.2597.05 ± 0.0796.7 ± 3.7997.6 ± 2.9997.7 ± 2.98
Table A2. Experimental results for anomaly recognition measured with image-level AUROC on the MVTec AD dataset.
Table A2. Experimental results for anomaly recognition measured with image-level AUROC on the MVTec AD dataset.
CategoryU-NetU-Net-SmallSDC-AESDC-AE (+Skip)SDC-CCM
carpet39.5 ± 0.5899.2 ± 0.2499.3 ± 0.0499.4 ± 0.0099.6 ± 0.25
grid84.2 ± 0.13100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
leather77.4 ± 3.4598.0 ± 1.2499.0 ± 0.1997.9 ± 0.8299.4 ± 0.15
tile81.6 ± 0.7898.9 ± 0.1898.5 ± 0.0599.6 ± 0.4199.6 ± 0.25
cable57.0 ± 0.6693.6 ± 0.2367.4 ± 1.6494.2 ± 0.0396.2 ± 0.16
transistor64.5 ± 1.9497.6 ± 0.0281.2 ± 1.5497.2 ± 0.1996.6 ± 0.38
avg. all67.4 ± 15.7097.9 ± 0.3290.9 ± 12.498.1 ± 1.9898.6 ± 1.55
Finally, we note that dilated convolutions allow access to more context within the image, which in some cases improves upon the reconstruction quality. In particular, models without dilated convolutions (without SDC) suffer from the blind spot illustrated in Figure A3. This might occur in larger anomalous regions due to the insufficient context provided by the normal areas, ultimately resulting in predictions that average all possibilities.
Figure A3. Illustration of the importance of modeling long-range dependencies facilitated by dilated convolutions for achieving accurate reconstruction. We can observe how the reconstruction of the model without the SDC modules (middle image) suffers from a blind spot effect toward the center of the corrupted region. This happens due to the insufficient context provided by the normal areas, forcing the model to predict an average of all possibilities.
Figure A3. Illustration of the importance of modeling long-range dependencies facilitated by dilated convolutions for achieving accurate reconstruction. We can observe how the reconstruction of the model without the SDC modules (middle image) suffers from a blind spot effect toward the center of the corrupted region. This happens due to the insufficient context provided by the normal areas, forcing the model to predict an average of all possibilities.
Mathematics 12 03988 g0a3

Appendix J. Illustration of Qualitative Improvement in Reconstruction When Using SDC-CCM over SDC-AE

As outlined in the main body of this paper, in order to achieve high detection and localization performance based on the reconstruction error, the model must replace corrupted regions in the input images with a different content. At the same time, it is important to reproduce uncorrupted regions as accurately as possible to reduce the chance of false positive detections. In this sense, the overall reconstruction quality of the original content during training correlates strongly with the ability of the corresponding model to detect anomalous samples. We compared the reconstruction performance of two model architecture, which we refer to as SDC-AE and SDC-CCM in this paper. During our empirical evaluation, we observed that while SDC-AE generally achieved good reconstruction, it could sometimes result in a higher number of false positive detections compared to SDC-CCM. For example, it struggled to reproduce normal regions that were characterized by frequent changes in the gradient along neighboring pixel values in two object categories, “cable” and “transistor”, from the MVTec AD dataset. In contrast, SDC-CCM accurately reconstructs these regions, resulting in a significant reduction in false positive detections. We show several examples in Figure A4. Similarly, we observed an improvement in reconstruction quality for texture categories. However, the corresponding heatmaps show similar quality. We provide a few examples in Figure A5.
Figure A4. Illustration of the qualitative improvement when using SDC-CCM over SDC-AE. We show six examples: three from the "cable" category and three from the "transistor" category of the MVTec AD dataset. Each row displays the original image, the reconstruction produced by SDC-CCM (reconstruction II), the reconstruction produced by SDC-AE (reconstruction I), the anomaly heatmap from SDC-CCM (anomaly heatmap II), and the anomaly heatmap from SDC-AE (anomaly heatmap I). Note the significant improvement in the quality of the heatmaps.
Figure A4. Illustration of the qualitative improvement when using SDC-CCM over SDC-AE. We show six examples: three from the "cable" category and three from the "transistor" category of the MVTec AD dataset. Each row displays the original image, the reconstruction produced by SDC-CCM (reconstruction II), the reconstruction produced by SDC-AE (reconstruction I), the anomaly heatmap from SDC-CCM (anomaly heatmap II), and the anomaly heatmap from SDC-AE (anomaly heatmap I). Note the significant improvement in the quality of the heatmaps.
Mathematics 12 03988 g0a4
Figure A5. Illustration of the qualitative improvement when using SDC-CCM over SDC-AE on texture categories from the MVTec AD dataset. We show five examples, one from each of the following categories: “carpet”, “grid”, “leather”, “tile”, and “wood”. Each row displays the original image, the reconstruction produced by SDC-CCM (reconstruction II), the reconstruction produced by SDC-AE (reconstruction I), the anomaly heatmap from SDC-CCM (anomaly heatmap II), and the anomaly heatmap from SDC-AE (anomaly heatmap I).
Figure A5. Illustration of the qualitative improvement when using SDC-CCM over SDC-AE on texture categories from the MVTec AD dataset. We show five examples, one from each of the following categories: “carpet”, “grid”, “leather”, “tile”, and “wood”. Each row displays the original image, the reconstruction produced by SDC-CCM (reconstruction II), the reconstruction produced by SDC-AE (reconstruction I), the anomaly heatmap from SDC-CCM (anomaly heatmap II), and the anomaly heatmap from SDC-AE (anomaly heatmap I).
Mathematics 12 03988 g0a5

References

  1. Haselmann, M.; Gruber, D.P.; Tabatabai, P. Anomaly Detection Using Deep Learning Based Image Completion. In Proceedings of the 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, 17–20 December 2018; Wani, M.A., Kantardzic, M.M., Mouchaweh, M.S., Gama, J., Lughofer, E., Eds.; IEEE: Piscataway, NJ, USA, 2018; pp. 1237–1242. [Google Scholar]
  2. Bergmann, P.; Löwe, S.; Fauser, M.; Sattlegger, D.; Steger, C. Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2019, Volume 5: VISAPP, Prague, Czech Republic, 25–27 February 2019; Trémeau, A., Farinella, G.M., Braz, J., Eds.; SciTePress: Setúbal Municipality, Portugal, 2019; pp. 372–380. [Google Scholar]
  3. Wang, L.; Zhang, D.; Guo, J.; Han, Y. Image Anomaly Detection Using Normal Data only by Latent Space Resampling. Appl. Sci. 2020, 10, 8660. [Google Scholar] [CrossRef]
  4. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed Students: Student-Teacher Anomaly Detection with Discriminative Latent Embeddings. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 4182–4191. [Google Scholar]
  5. Venkataramanan, S.; Peng, K.; Singh, R.V.; Mahalanobis, A. Attention Guided Anomaly Localization in Images. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII; Lecture Notes in Computer Science. Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12362, pp. 485–503. [Google Scholar]
  6. Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Schmidt-Erfurth, U.; Langs, G. Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. In Proceedings of the Information Processing in Medical Imaging—25th International Conference, IPMI 2017, Boone, NC, USA, 25–30 June 2017; Proceedings; Lecture Notes in Computer Science. Niethammer, M., Styner, M., Aylward, S.R., Zhu, H., Oguz, I., Yap, P., Shen, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10265, pp. 146–157. [Google Scholar]
  7. Napoletano, P.; Piccoli, F.; Schettini, R. Anomaly Detection in Nanofibrous Materials by CNN-Based Self-Similarity. Sensors 2018, 18, 209. [Google Scholar] [CrossRef]
  8. Böttger, T.; Ulrich, M. Real-time texture error detection on textured surfaces with compressed sensing. Pattern Recognit. Image Anal. 2016, 26, 88–94. [Google Scholar] [CrossRef]
  9. Liu, W.; Li, R.; Zheng, M.; Karanam, S.; Wu, Z.; Bhanu, B.; Radke, R.J.; Camps, O.I. Towards Visually Explaining Variational Autoencoders. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 8639–8648. [Google Scholar]
  10. Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P.V. Towards Total Recall in Industrial Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 14298–14308. [Google Scholar]
  11. Wan, Q.; Gao, L.; Li, X.; Wen, L. Industrial Image Anomaly Localization Based on Gaussian Clustering of Pretrained Feature. IEEE Trans. Ind. Electron. 2022, 69, 6182–6192. [Google Scholar] [CrossRef]
  12. Chen, X.; Konukoglu, E. Unsupervised Detection of Lesions in Brain MRI using constrained adversarial auto-encoders. arXiv 2018, arXiv:1806.04972. [Google Scholar]
  13. Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Langs, G.; Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [CrossRef]
  14. Tan, J.; Hou, B.; Day, T.; Simpson, J.M.; Rueckert, D.; Kainz, B. Detecting Outliers with Poisson Image Interpolation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021—24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part V; Lecture Notes in Computer Science. de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12905, pp. 581–591. [Google Scholar]
  15. Zimmerer, D.; Isensee, F.; Petersen, J.; Kohl, S.; Maier-Hein, K.H. Unsupervised Anomaly Localization Using Variational Auto-Encoders. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2019—22nd International Conference, Shenzhen, China, 13–17 October 2019; Proceedings, Part IV; Lecture Notes in Computer Science. Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P., Khan, A.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11767, pp. 289–297. [Google Scholar]
  16. Abati, D.; Porrello, A.; Calderara, S.; Cucchiara, R. Latent Space Autoregression for Novelty Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 481–490. [Google Scholar]
  17. Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Proceedings, Part IV; Lecture Notes in Computer Science. Bimbo, A.D., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12664, pp. 475–489. [Google Scholar]
  18. Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
  19. Schölkopf, B.; Smola, A.J.; Müller, K. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998, 10, 1299–1319. [Google Scholar] [CrossRef]
  20. Hoffmann, H. Kernel PCA for novelty detection. Pattern Recognit. 2007, 40, 863–874. [Google Scholar] [CrossRef]
  21. Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
  22. Tax, D.M.J.; Duin, R.P.W. Support Vector Data Description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
  23. Knorr, E.M.; Ng, R.T.; Tucakov, V. Distance-Based Outliers: Algorithms and Applications. VLDB J. 2000, 8, 237–253. [Google Scholar] [CrossRef]
  24. Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; Chen, W., Naughton, J.F., Bernstein, P.A., Eds.; ACM: New York, NY, USA, 2000; pp. 427–438. [Google Scholar]
  25. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Statist 1962, 33, 106–1076. [Google Scholar] [CrossRef]
  26. Principi, E.; Vesperini, F.; Squartini, S.; Piazza, F. Acoustic novelty detection with adversarial autoencoders. In Proceedings of the 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, 14–19 May 2017; pp. 3324–3330. [Google Scholar]
  27. Chalapathy, R.; Menon, A.K.; Chawla, S. Robust, Deep and Inductive Anomaly Detection. In Proceedings of the Machine Learning and Knowledge Discovery in Databases—European Conference, ECML PKDD 2017, Skopje, Macedonia, 18–22 September 2017; Proceedings, Part I; Lecture Notes in Computer Science. Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Dzeroski, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10534, pp. 36–51. [Google Scholar]
  28. Kieu, T.; Yang, B.; Guo, C.; Jensen, C.S. Outlier Detection for Time Series with Recurrent Autoencoder Ensembles. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019; Kraus, S., Ed.; 2019; pp. 2725–2732. [Google Scholar]
  29. Zhou, C.; Paffenroth, R.C. Anomaly Detection with Robust Deep Autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; ACM: New York, NY, USA, 2017; pp. 665–674. [Google Scholar]
  30. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  31. Kim, K.H.; Shim, S.; Lim, Y.; Jeon, J.; Choi, J.; Kim, B.; Yoon, A.S. RaPP: Novelty Detection with Reconstruction along Projection Pathway. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  32. Deng, J.; Zhang, Z.; Marchi, E.; Schuller, B.W. Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, Geneva, Switzerland, 2–5 September 2013; pp. 511–516. [Google Scholar]
  33. Erfani, S.M.; Rajasegarar, S.; Karunasekera, S.; Leckie, C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognit. 2016, 58, 121–134. [Google Scholar] [CrossRef]
  34. Kim, M.; Kim, J.; Yu, J.; Choi, J.K. Active anomaly detection based on deep one-class classification. Pattern Recognit. Lett. 2023, 167, 18–24. [Google Scholar] [CrossRef]
  35. Ruff, L.; Görnitz, N.; Deecke, L.; Siddiqui, S.A.; Vandermeulen, R.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018; Proceedings of Machine Learning Research. Dy, J.G., Krause, A., Eds.; 2018; Volume 80, pp. 4390–4399. [Google Scholar]
  36. Golan, I.; El-Yaniv, R. Deep Anomaly Detection Using Geometric Transformations. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; pp. 9781–9791. [Google Scholar]
  37. Tack, J.; Mo, S.; Jeong, J.; Shin, J. CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; [Google Scholar]
  38. Zavrtanik, V.; Kristan, M.; Skocaj, D. DRÆM—A discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 8310–8319. [Google Scholar]
  39. Zavrtanik, V.; Kristan, M.; Skocaj, D. Reconstruction by inpainting for visual anomaly detection. Pattern Recognit. 2021, 112, 107706. [Google Scholar] [CrossRef]
  40. Li, C.; Sohn, K.; Yoon, J.; Pfister, T. CutPaste: Self-Supervised Learning for Anomaly Detection and Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021; pp. 9664–9674. [Google Scholar]
  41. Pirnay, J.; Chai, K. Inpainting Transformer for Anomaly Detection. In Proceedings of the Image Analysis and Processing—ICIAP 2022—21st International Conference, Lecce, Italy, 23–27 May 2022; Proceedings, Part II; Lecture Notes in Computer Science. Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13232, pp. 394–406. [Google Scholar]
  42. Lee, S.; Lee, S.; Song, B.C. CFA: Coupled-Hypersphere-Based Feature Adaptation for Target-Oriented Anomaly Localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]
  43. Kim, D.; Park, C.; Cho, S.; Lee, S. FAPM: Fast Adaptive Patch Memory for Real-time Industrial Anomaly Detection. arXiv 2022, arXiv:2211.07381. [Google Scholar]
  44. Bae, J.; Lee, J.; Kim, S. Image Anomaly Detection and Localization with Position and Neighborhood Information. arXiv 2022, arXiv:2211.12634. [Google Scholar]
  45. Tsai, C.; Wu, T.; Lai, S. Multi-Scale Patch-Based Representation Learning for Image Anomaly Detection and Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, 3–8 January 2022; pp. 3065–3073. [Google Scholar]
  46. Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; Dabeer, O. SPot-the-Difference Self-supervised Pre-training for Anomaly Detection and Segmentation. In Proceedings of the Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXX; Lecture Notes in Computer Science. Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13690, pp. 392–408. [Google Scholar]
  47. Li, N.; Jiang, K.; Ma, Z.; Wei, X.; Hong, X.; Gong, Y. Anomaly Detection Via Self-Organizing Map. In Proceedings of the 2021 IEEE International Conference on Image Processing, ICIP 2021, Anchorage, AK, USA, 19–22 September 2021; pp. 974–978. [Google Scholar]
  48. Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution Knowledge Distillation for Anomaly Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021; pp. 14902–14912. [Google Scholar]
  49. Deng, H.; Li, X. Anomaly Detection via Reverse Distillation from One-Class Embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 9727–9736. [Google Scholar]
  50. Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; Wandt, B. Asymmetric Student-Teacher Networks for Industrial Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, 2–7 January 2023; pp. 2591–2601. [Google Scholar]
  51. Cao, Y.; Wan, Q.; Shen, W.; Gao, L. Informative knowledge distillation for image anomaly segmentation. Knowl. Based Syst. 2022, 248, 108846. [Google Scholar] [CrossRef]
  52. Zhang, K.; Wang, B.; Kuo, C.J. PEDENet: Image anomaly localization via patch embedding and density estimation. Pattern Recognit. Lett. 2022, 153, 144–150. [Google Scholar] [CrossRef]
  53. Wan, Q.; Gao, L.; Li, X.; Wen, L. Unsupervised Image Anomaly Detection and Segmentation Based on Pretrained Feature Mapping. IEEE Trans. Ind. Inform. 2023, 19, 2330–2339. [Google Scholar] [CrossRef]
  54. Wan, Q.; Cao, Y.; Gao, L.; Shen, W.; Li, X. Position Encoding Enhanced Feature Mapping for Image Anomaly Detection. In Proceedings of the 18th IEEE International Conference on Automation Science and Engineering, CASE 2022, Mexico City, Mexico, 20–24 August 2022; pp. 876–881. [Google Scholar]
  55. Zheng, Y.; Wang, X.; Deng, R.; Bao, T.; Zhao, R.; Wu, L. Focus Your Distribution: Coarse-to-Fine Non-Contrastive Learning for Anomaly Detection and Localization. In Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  56. Gudovskiy, D.A.; Ishizaka, S.; Kozuka, K. CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, 3–8 January 2022; pp. 1819–1828. [Google Scholar]
  57. Kim, Y.; Jang, H.; Lee, D.; Choi, H. AltUB: Alternating Training Method to Update Base Distribution of Normalizing Flow for Anomaly Detection. arXiv 2022, arXiv:2210.14913. [Google Scholar]
  58. Yi, J.; Yoon, S. Patch SVDD: Patch-Level SVDD for Anomaly Detection and Segmentation. In Proceedings of the Computer Vision—ACCV 2020—15th Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020; Revised Selected Papers, Part VI; Lecture Notes in Computer Science. Ishikawa, H., Liu, C., Pajdla, T., Shi, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12627, pp. 375–390. [Google Scholar]
  59. Hu, C.; Chen, K.; Shao, H. A Semantic-Enhanced Method Based On Deep SVDD for Pixel-Wise Anomaly Detection. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo, ICME 2021, Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
  60. Yang, M.; Wu, P.; Feng, H. MemSeg: A semi-supervised method for image surface defect detection using differences and commonalities. Eng. Appl. Artif. Intell. 2023, 119, 105835. [Google Scholar] [CrossRef]
  61. Yan, Y.; Wang, D.; Zhou, G.; Chen, Q. Unsupervised Anomaly Segmentation Via Multilevel Image Reconstruction and Adaptive Attention-Level Transition. IEEE Trans. Instrum. Meas. 2021, 70, 5015712. [Google Scholar] [CrossRef]
  62. Collin, A.; Vleeschouwer, C.D. Improved anomaly detection by training an autoencoder with skip connections on images corrupted with Stain-shaped noise. In Proceedings of the 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event/Milan, Italy, 10–15 January 2021; pp. 7915–7922. [Google Scholar]
  63. Tao, X.; Zhang, D.; Ma, W.; Hou, Z.; Lu, Z.; Adak, C. Unsupervised Anomaly Detection for Surface Defects with Dual-Siamese Network. IEEE Trans. Ind. Inform. 2022, 18, 7707–7717. [Google Scholar] [CrossRef]
  64. Liu, T.; Li, B.; Zhao, Z.; Du, X.; Jiang, B.; Geng, L. Reconstruction from edge image combined with color and gradient difference for industrial surface anomaly detection. arXiv 2022, arXiv:2210.14485. [Google Scholar]
  65. Kim, D.; Jeong, D.; Kim, H.; Chong, K.; Kim, S.; Cho, H. Spatial Contrastive Learning for Anomaly Detection and Localization. IEEE Access 2022, 10, 17366–17376. [Google Scholar] [CrossRef]
  66. Huang, C.; Xu, Q.; Wang, Y.; Wang, Y.; Zhang, Y. Self-Supervised Masking for Unsupervised Anomaly Detection and Localization. arXiv 2022, arXiv:2205.06568. [Google Scholar] [CrossRef]
  67. Liznerski, P.; Ruff, L.; Vandermeulen, R.A.; Franks, B.J.; Kloft, M.; Müller, K.R. Explainable Deep One-Class Classification. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
  68. Bae, J.; Lee, J.; Kim, S. PNI: Industrial Anomaly Detection using Position and Neighborhood Information. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; pp. 6350–6360. [Google Scholar]
  69. Schlüter, H.M.; Tan, J.; Hou, B.; Kainz, B. Natural Synthetic Anomalies for Self-supervised Anomaly Detection and Localization. In Proceedings of the Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXXI; Lecture Notes in Computer Science. Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13691, pp. 474–489. [Google Scholar]
  70. Dehaene, D.; Frigo, O.; Combrexelle, S.; Eline, P. Iterative energy-based projection on a normal data manifold for anomaly localization. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  71. Höner, J.; Nakajima, S.; Bauer, A.; Müller, K.R.; Görnitz, N. Minimizing Trust Leaks for Robust Sybil Detection. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; Proceedings of Machine Learning Research. Precup, D., Teh, Y.W., Eds.; 2017; Volume 70, pp. 1520–1528. [Google Scholar]
  72. Song, J.W.; Kong, K.; Park, Y.I.; Kim, S.G.; Kang, S. AnoSeg: Anomaly Segmentation Network Using Self-Supervised Learning. arXiv 2021, arXiv:2110.03396. [Google Scholar]
  73. Kohlbrenner, M.; Bauer, A.; Nakajima, S.; Binder, A.; Samek, W.; Lapuschkin, S. Towards Best Practice in Explaining Neural Network Decisions with LRP. In Proceedings of the 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, UK, 19–24 July 2020; pp. 1–7. [Google Scholar]
  74. Lee, Y.; Kang, P. AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-Based Encoder-Decoder. IEEE Access 2022, 10, 46717–46724. [Google Scholar] [CrossRef]
  75. Jiang, J.; Zhu, J.; Bilal, M.; Cui, Y.; Kumar, N.; Dou, R.; Su, F.; Xu, X. Masked Swin Transformer Unet for Industrial Anomaly Detection. IEEE Trans. Ind. Inform. 2023, 19, 2200–2209. [Google Scholar] [CrossRef]
  76. Wu, J.; Chen, D.; Fuh, C.; Liu, T. Learning Unsupervised Metaformer for Anomaly Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 4349–4358. [Google Scholar]
  77. Jiang, X.; Liu, J.; Wang, J.; Nie, Q.; Wu, K.; Liu, Y.; Wang, C.; Zheng, F. SoftPatch: Unsupervised Anomaly Detection with Noisy Data. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  78. Ruff, L.; Kauffmann, J.R.; Vandermeulen, R.A.; Samek, W.; Kloft, M.; Dietterich, T.G.; Müller, K.R. A Unifying Review of Deep and Shallow Anomaly Detection. Proc. IEEE 2021, 109, 756–795. [Google Scholar] [CrossRef]
  79. Kauffmann, J.; Müller, K.R.; Montavon, G. Towards explaining anomalies: A deep Taylor decomposition of one-class models. Pattern Recognit. 2020, 101, 107198. [Google Scholar] [CrossRef]
  80. Chong, P.; Ruff, L.; Kloft, M.; Binder, A. Simple and Effective Prevention of Mode Collapse in Deep One-Class Classification. In Proceedings of the 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, UK, 19–24 July 2020; pp. 1–9. [Google Scholar]
  81. Hu, C.; Feng, Y.; Kamigaito, H.; Takamura, H.; Okumura, M. One-class Text Classification with Multi-modal Deep Support Vector Data Description. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, 19–23 April 2021; Merlo, P., Tiedemann, J., Tsarfaty, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3378–3390. [Google Scholar]
  82. Ranzato, M.A.; Boureau, Y.l.; Cun, Y. Sparse Feature Learning for Deep Belief Networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, Canada, 3–6 December 2007; Platt, J., Koller, D., Singer, Y., Roweis, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2007; Volume 20. [Google Scholar]
  83. Le, Q.V.; Ngiam, J.; Coates, A.; Lahiri, A.; Prochnow, B.; Ng, A.Y. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning ICML 2011, Bellevue, WA, USA, 28 June–2 July 2011; Getoor, L., Scheffer, T., Eds.; pp. 265–272. [Google Scholar]
  84. Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; Bengio, Y. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, WI, USA, 28 June–2 July 2011; Omnipress: Madison, WI, USA, 2011; pp. 833–840. [Google Scholar]
  85. Alain, G.; Bengio, Y. What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 2014, 15, 3563–3593. [Google Scholar]
  86. Pathak, D.; Krähenbühl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context Encoders: Feature Learning by Inpainting. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
  87. Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. 2017, 36, 107:1–107:14. [Google Scholar] [CrossRef]
  88. Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative Image Inpainting with Contextual Attention. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5505–5514. [Google Scholar]
  89. Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.; Tao, A.; Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XI; Lecture Notes in Computer Science. Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11215, pp. 89–105. [Google Scholar]
  90. Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-Form Image Inpainting with Gated Convolution. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4470–4479. [Google Scholar]
  91. Bhattad, A.; Rock, J.; Forsyth, D.A. Detecting Anomalous Faces with ‘No Peeking’ Autoencoders. arXiv 2018, arXiv:1802.05798. [Google Scholar]
  92. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P. Extracting and composing robust features with denoising autoencoders. In Proceedings of the Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, 5–9 June 2008; ACM International Conference Proceeding, Series. Cohen, W.W., McCallum, A., Roweis, S.T., Eds.; ACM: New York, NY, USA, 2008; Volume 307, pp. 1096–1103. [Google Scholar]
  93. Kascenas, A.; Pugeault, N.; O’Neil, A.Q. Denoising Autoencoders for Unsupervised Anomaly Detection in Brain MRI. In Proceedings of the 5th International Conference on Medical Imaging with Deep Learning, PMLR, Zurich, Switzerland, 6–8 July 2022; Volume 172, pp. 653–664. [Google Scholar]
  94. Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. SimpleNet: A Simple Network for Image Anomaly Detection and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
  95. Bergmann, P.; Batzner, K.; Fauser, M.; Sattlegger, D.; Steger, C. The MVTec Anomaly Detection Dataset: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. Int. J. Comput. Vis. 2021, 129, 1038–1059. [Google Scholar] [CrossRef]
  96. Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing Textures in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  97. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  98. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015—18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III; Lecture Notes in Computer Science. Navab, N., Hornegger, J., III, Wells, W.M., Frangi, A.F., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar]
  99. Schuster, R.; Wasenmüller, O.; Unger, C.; Stricker, D. SDC—Stacked Dilated Convolution: A Unified Descriptor Network for Dense Matching Tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 2556–2565. [Google Scholar]
  100. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques—Adaptive Computation and Machine Learning; The MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  101. Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. Trends Mach. Learn. 2008, 1, 1–305. [Google Scholar] [CrossRef]
  102. Lafferty, J. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Morgan Kaufmann: Burlington, MA, USA, 2001; pp. 282–289. [Google Scholar]
  103. Bauer, A.; Görnitz, N.; Biegler, F.; Müller, K.R.; Kloft, M. Efficient Algorithms for Exact Inference in Sequence Labeling SVMs. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 870–881. [Google Scholar] [CrossRef] [PubMed]
  104. Bauer, A.; Braun, M.L.; Müller, K.R. Accurate Maximum-Margin Training for Parsing with Context-Free Grammars. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 44–56. [Google Scholar] [CrossRef]
  105. Bauer, A.; Nakajima, S.; Müller, K.R. Efficient Exact Inference with Loss Augmented Objective in Structured Learning. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2566–2579. [Google Scholar] [CrossRef]
  106. Bauer, A.; Nakajima, S.; Görnitz, N.; Müller, K.R. Optimizing for Measure of Performance in Max-Margin Parsing. IEEE Trans. Neural Netw. Learn. Syst. 2019, 21, 2680–2684. [Google Scholar] [CrossRef] [PubMed]
  107. Bauer, A.; Nakajima, S.; Görnitz, N.; Müller, K.R. Partial Optimality of Dual Decomposition for MAP Inference in Pairwise MRFs. In Proceedings of the Machine Learning Research, Naha, Japan, 16–18 April 2019; Proceedings of Machine Learning Research. Chaudhuri, K., Sugiyama, M., Eds.; Volume 89, pp. 1696–1703. [Google Scholar]
  108. Bauer, A.; Nakajima, S.; Müller, K.R. Polynomial-Time Constrained Message Passing for Exact MAP Inference on Discrete Models with Global Dependencies. Mathematics 2023, 11, 2628. [Google Scholar] [CrossRef]
  109. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 9592–9600. [Google Scholar]
  110. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Bengio, Y., LeCun, Y., Eds.; Conference Track Proceedings. [Google Scholar]
  111. Zhou, Y.; Xu, X.; Song, J.; Shen, F.; Shen, H.T. MSFlow: Multi-Scale Flow-based Framework for Unsupervised Anomaly Detection. arXiv 2023, arXiv:2308.15300. [Google Scholar]
Figure 1. Anomaly detection results of our approach on a few images from the MVTec AD dataset. The first row shows the input images and the second row an overlay with the predicted anomaly heatmap.
Figure 1. Anomaly detection results of our approach on a few images from the MVTec AD dataset. The first row shows the input images and the second row an overlay with the predicted anomaly heatmap.
Mathematics 12 03988 g001
Figure 2. Illustration of the reconstruction effect of our model trained either on the wood, carpet, or grid images (without defects) from the MVTec AD dataset.
Figure 2. Illustration of the reconstruction effect of our model trained either on the wood, carpet, or grid images (without defects) from the MVTec AD dataset.
Mathematics 12 03988 g002
Figure 3. Illustration of our anomaly detection process after the training. Given input x ^ , we first, see (1) compute an output f θ ( x ^ ) by replicating normal regions and replacing irregularities with locally consistent patterns. Then, see (2), we compute a pixel-wise squared difference ( x ^ f θ ( x ^ ) ) 2 , which is subsequently averaged over the color channels to produce the difference map Diff [ x ^ , f θ ( x ^ ) ] R h × w . In the last step, see (3), we apply a series of averaging convolutions G k to the difference map to produce our final anomaly heatmap anomap f θ n , k ( x ^ ) .
Figure 3. Illustration of our anomaly detection process after the training. Given input x ^ , we first, see (1) compute an output f θ ( x ^ ) by replicating normal regions and replacing irregularities with locally consistent patterns. Then, see (2), we compute a pixel-wise squared difference ( x ^ f θ ( x ^ ) ) 2 , which is subsequently averaged over the color channels to produce the difference map Diff [ x ^ , f θ ( x ^ ) ] R h × w . In the last step, see (3), we apply a series of averaging convolutions G k to the difference map to produce our final anomaly heatmap anomap f θ n , k ( x ^ ) .
Mathematics 12 03988 g003
Figure 4. Illustration of data generation for training. After randomly choosing the number and locations of the patches to be modified, we create new content by gluing the extracted patches with the corresponding replacements. Given a real-valued mask M [ 0 , 1 ] h ˜ × w ˜ × 3 marking corrupted regions within a patch, an original image patch x , and a corresponding replacement y , we create the next corrupted patch by merging the two patches together according to the formula x ^ : = M y + M ¯ x . All mask shapes M are created by applying Gaussian distortion to the same (static) mask, representing a filled disk at the center of the patch with a smoothly fading boundary toward the exterior of the disk.
Figure 4. Illustration of data generation for training. After randomly choosing the number and locations of the patches to be modified, we create new content by gluing the extracted patches with the corresponding replacements. Given a real-valued mask M [ 0 , 1 ] h ˜ × w ˜ × 3 marking corrupted regions within a patch, an original image patch x , and a corresponding replacement y , we create the next corrupted patch by merging the two patches together according to the formula x ^ : = M y + M ¯ x . All mask shapes M are created by applying Gaussian distortion to the same (static) mask, representing a filled disk at the center of the patch with a smoothly fading boundary toward the exterior of the disk.
Mathematics 12 03988 g004
Figure 5. Illustration of our network architecture SDC-CCM including the convex combination module (CCM) marked in brown and the skip-connections represented by the horizontal arrows. Without these additional elements, we obtain our baseline architecture SDC-AE.
Figure 5. Illustration of our network architecture SDC-CCM including the convex combination module (CCM) marked in brown and the skip-connections represented by the horizontal arrows. Without these additional elements, we obtain our baseline architecture SDC-AE.
Mathematics 12 03988 g005
Figure 6. Illustration of the CCM module. The module receives two inputs: x s along the skip connection and x c from the current layer below. In the first step (image on the left), we compute the squared difference of the two and stack it together with the original values [ x s , x c , ( x s x c ) 2 ] . This combined feature map is processed by two convolutional layers. The first layer uses batch normalization with ReLU activation. The second layer uses batch normalization and a sigmoid activation function to produce a coefficient matrix β . In the second step (image on the right), we compute the output of the module as a (component-wise) convex combination x o = β · x s + ( 1 β ) · x c , where 1 is a tensor of ones.
Figure 6. Illustration of the CCM module. The module receives two inputs: x s along the skip connection and x c from the current layer below. In the first step (image on the left), we compute the squared difference of the two and stack it together with the original values [ x s , x c , ( x s x c ) 2 ] . This combined feature map is processed by two convolutional layers. The first layer uses batch normalization with ReLU activation. The second layer uses batch normalization and a sigmoid activation function to produce a coefficient matrix β . In the second step (image on the right), we compute the output of the module as a (component-wise) convex combination x o = β · x s + ( 1 β ) · x c , where 1 is a tensor of ones.
Mathematics 12 03988 g006
Figure 7. Illustration of the general concept of the orthogonal projection f onto a data manifold D . Here, anomalous samples x ^ R n (red dots) are projected to points x : = f ( x ^ ) D (blue dots) in a way that minimizes the distance d ( x ^ , x ) = inf y D d ( x ^ , y ) .
Figure 7. Illustration of the general concept of the orthogonal projection f onto a data manifold D . Here, anomalous samples x ^ R n (red dots) are projected to points x : = f ( x ^ ) D (blue dots) in a way that minimizes the distance d ( x ^ , x ) = inf y D d ( x ^ , y ) .
Mathematics 12 03988 g007
Figure 8. Illustration of the connections between the different types of regularized autoencoders. For a small variance of the corruption noise, the DAE becomes similar to the CAE. This, in turn, gives rise to the RCAE, where the contraction is imposed explicitly on the whole reconstruction mapping. A special instance of PAE given by the orthogonal projection yields an optimal solution for the optimization problem of the RCAE. On the other hand, the training objective for PAE can be seen as an extension of DAE to more complex input modifications beyond additive noise. Finally, a common variant of the sparse autoencoder (SAE) applies an l 1 penalty on the hidden units, resulting in saturation toward zero similar to the CAE.
Figure 8. Illustration of the connections between the different types of regularized autoencoders. For a small variance of the corruption noise, the DAE becomes similar to the CAE. This, in turn, gives rise to the RCAE, where the contraction is imposed explicitly on the whole reconstruction mapping. A special instance of PAE given by the orthogonal projection yields an optimal solution for the optimization problem of the RCAE. On the other hand, the training objective for PAE can be seen as an extension of DAE to more complex input modifications beyond additive noise. Finally, a common variant of the sparse autoencoder (SAE) applies an l 1 penalty on the hidden units, resulting in saturation toward zero similar to the CAE.
Mathematics 12 03988 g008
Figure 9. Illustration of the conservation effect of the orthogonal projections with respect to different l p -norms. Here, the anomalous sample x ^ is orthogonally projected onto the manifold D (depicted by a red ellipsoid) according to x ^ y p * p = inf y D x ^ y p for p { 1 , 2 , } . The remaining three colors (green, blue, and yellow) represent rescaled unit circles around x ^ with respect to l 1 , l 2 and l -norms. The intersection points of each circle with D mark the orthogonal projection of x onto D for the corresponding norm. We can see that projections y p * for lower p-values better preserve the content in x ^ according to the higher sparsity of the difference x ^ y p * , which results in smaller modified regions S ( x ^ , y p * ) .
Figure 9. Illustration of the conservation effect of the orthogonal projections with respect to different l p -norms. Here, the anomalous sample x ^ is orthogonally projected onto the manifold D (depicted by a red ellipsoid) according to x ^ y p * p = inf y D x ^ y p for p { 1 , 2 , } . The remaining three colors (green, blue, and yellow) represent rescaled unit circles around x ^ with respect to l 1 , l 2 and l -norms. The intersection points of each circle with D mark the orthogonal projection of x onto D for the corresponding norm. We can see that projections y p * for lower p-values better preserve the content in x ^ according to the higher sparsity of the difference x ^ y p * , which results in smaller modified regions S ( x ^ , y p * ) .
Mathematics 12 03988 g009
Figure 10. Illustration of the concept of a transition set. Consider a 2D image tensor identified with a column vector x R n , n = 20 2 , which is partitioned according to S { 1 , , n } (gray area) and S ¯ : = { 1 , , n } S (union of light blue and dark blue areas). The transition set B (dark blue area) glues the two disconnected sets S and S ¯ B together such that x D is feasible.
Figure 10. Illustration of the concept of a transition set. Consider a 2D image tensor identified with a column vector x R n , n = 20 2 , which is partitioned according to S { 1 , , n } (gray area) and S ¯ : = { 1 , , n } S (union of light blue and dark blue areas). The transition set B (dark blue area) glues the two disconnected sets S and S ¯ B together such that x D is feasible.
Mathematics 12 03988 g010
Figure 11. Illustration of our anomaly segmentation results (with SDC-CCM) as an overlay of the original image and the anomaly heatmap. Each row shows three random examples from a category (carpet, grid, leather, transistor, and cable) in the MVTec AD dataset. In each pair, the first image represents the input to the model and the second image a corresponding anomaly heatmap.
Figure 11. Illustration of our anomaly segmentation results (with SDC-CCM) as an overlay of the original image and the anomaly heatmap. Each row shows three random examples from a category (carpet, grid, leather, transistor, and cable) in the MVTec AD dataset. In each pair, the first image represents the input to the model and the second image a corresponding anomaly heatmap.
Mathematics 12 03988 g011
Table 1. Our network architecture SDC-AE.
Table 1. Our network architecture SDC-AE.
Layer TypeNumber of FiltersFilter SizeOutput Size
Conv64 3 × 3 512 × 512 × 64
Conv64 3 × 3 512 × 512 × 64
MaxPool 2 × 2 /2 256 × 256 × 64
Conv128 3 × 3 256 × 256 × 128
Conv128 3 × 3 256 × 256 × 128
MaxPool 2 × 2 /2 128 × 128 × 128
Conv256 3 × 3 128 × 128 × 256
Conv256 3 × 3 128 × 128 × 256
MaxPool 2 × 2 /2 64 × 64 × 256
SDC 1 , 2 , 4 , 8 , 16 , 32 64 × 6 3 × 3 64 × 64 × 384
SDC 1 , 2 , 4 , 8 , 16 , 32 64 × 6 3 × 3 64 × 64 × 384
SDC 1 , 2 , 4 , 8 , 16 , 32 64 × 6 3 × 3 64 × 64 × 384
SDC 1 , 2 , 4 , 8 , 16 , 32 64 × 6 3 × 3 64 × 64 × 384
TranConv256 3 × 3 /2 128 × 128 × 256
Conv256 3 × 3 128 × 128 × 256
Conv256 3 × 3 128 × 128 × 256
TranConv128 3 × 3 /2 256 × 256 × 128
Conv128 3 × 3 256 × 256 × 128
Conv128 3 × 3 256 × 256 × 128
TranConv64 3 × 3 /2 512 × 512 × 64
Conv64 3 × 3 512 × 512 × 64
Conv64 3 × 3 512 × 512 × 64
Conv3 1 × 1 512 × 512 × 3
Table 2. Experimental results for anomaly segmentation measured with pixel-level AUROC on the MVTec AD dataset.
Table 2. Experimental results for anomaly segmentation measured with pixel-level AUROC on the MVTec AD dataset.
CategoryAnoGANVAELSRRIADCutPasteInTraDRAEMSimpleNetPatchCoreMSFlowPNISDC-AESDC-CCM
carpet54789496.398.399.295.598.298.799.499.499.799.8
grid58739998.897.598.899.798.898.899.499.299.799.8
leather64959999.499.599.598.699.299.399.799.699.799.7
tile50808889.190.594.499.297.096.398.298.499.299.2
wood62778785.895.588.796.494.595.297.197.098.498.4
avg. tex.57.680.693.493.996.396.197.997.597.798.898.799.399.4
bottle86879598.497.697.199.198.098.699.098.998.698.9
cable86879594.290.091.094.797.698.798.599.198.298.5
capsule84749392.897.497.794.398.999.199.199.399.199.1
hazelnut87989596.197.398.399.797.998.898.799.498.999.1
metal nut76949192.593.193.399.598.899.099.399.398.598.5
pill87839195.795.798.397.698.698.698.899.099.399.3
screw80979698.896.799.597.699.399.599.199.699.799.7
toothbrush90949798.998.198.998.198.598.998.599.199.199.4
transistor80939187.793.096.190.997.697.198.398.098.698.9
zipper78789897.899.399.298.898.999.099.299.499.599.6
avg. obj.83.488.594.695.395.896.997.098.498.798.899.199.099.1
avg. all74.885.994.294.896.096.797.398.198.498.899.099.199.2
Table 3. Experimental results for anomaly recognition measured with image-level AUROC on the MVTec AD dataset.
Table 3. Experimental results for anomaly recognition measured with image-level AUROC on the MVTec AD dataset.
CategoryAnoGANVAELSRRIADCutPasteInTraDRAEMSimpleNetPatchCoreMSFlowPNISDC-AESDC-CCM
carpet49787184.293.198.897.099.798.2100100100100
grid51739199.699.910099.999.798.399.898.4100100
leather529596100100100100100100100100100100
tile51809593.493.498.299.699.898.9100100100100
wood68779693.098.697.599.110099.910099.6100100
avg. textures54.280.689.894.097.098.999.199.899.099.999.6100100
bottle69879999.998.310099.2100100100100100100
cable53907281.980.670.391.899.999.799.599.898.099.9
capsule58746888.496.286.598.597.798.199.299.798.298.8
hazelnut50989483.397.395.7100100100100100100100
metal nut50948388.599.396.998.7100100100100100100
pill62836883.892.490.298.999.097.199.696.999.599.6
screw35978084.586.395.793.998.299.097.899.598.998.9
toothbrush57949210098.310010099.798.910099.7100100
transistor67937390.995.595.893.110099.710010098.8100
zipper59789798.199.499.410099.999.710099.9100100
avg. objects56.088.882.689.994.493.197.499.599.299.699.599.399.7
avg. all55.486.185.091.395.295.098.099.699.299.799.699.699.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bauer, A.; Nakajima, S.; Müller, K.-R. Self-Supervised Autoencoders for Visual Anomaly Detection. Mathematics 2024, 12, 3988. https://doi.org/10.3390/math12243988

AMA Style

Bauer A, Nakajima S, Müller K-R. Self-Supervised Autoencoders for Visual Anomaly Detection. Mathematics. 2024; 12(24):3988. https://doi.org/10.3390/math12243988

Chicago/Turabian Style

Bauer, Alexander, Shinichi Nakajima, and Klaus-Robert Müller. 2024. "Self-Supervised Autoencoders for Visual Anomaly Detection" Mathematics 12, no. 24: 3988. https://doi.org/10.3390/math12243988

APA Style

Bauer, A., Nakajima, S., & Müller, K. -R. (2024). Self-Supervised Autoencoders for Visual Anomaly Detection. Mathematics, 12(24), 3988. https://doi.org/10.3390/math12243988

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop