1. Introduction
SAR automatic target recognition (ATR) is defined as employing a computerized tool to predict the class of a target in SAR images or to describe certain attributes of interest for the target, such as the geometric and physical properties of the target in the absence of direct manual intervention. A standard architecture of SAR ATR proposed by the MIT Lincoln Laboratory was described as three stages: detection, discrimination, and classification [
1,
2]. Detection is to extract candidate targets from SAR images using a false alarm rate (CFAR) detector. The output might include not only targets of interests but also false alarm clutter. At the following discrimination stage, in order to eliminate false alarms, several features are selected to train a discriminator to solve the two-class (target and clutter) problem. Finally, the classifier is utilized to categorize each input as a specific target type. In this paper, ‘recognition’ means the third stage, that is, classification of different types. In some applications, there is a more advanced process called the identification process, which is not discussed in this paper. Factors such as imaging angles, target configuration and background conditions have significant impacts on the SAR image classification. Therefore, extracting good feature representations that are insensitive to the above factors is particularly important to develop an effective SAR ATR system.
The basic development process of feature extraction can be divided into three stages: feature definition, feature expression and feature learning. The underlying features are initially defined based on researchers’ empirical knowledge or their own understanding of the obtained images. For example, the image is usually described by color, texture, shape or pixel statistical distribution characteristics. On the basis of the underlying features, better feature expression can be extracted by carrying out vector quantization, coding or kernel description. Dictionary learning and sparse coding are frequently used algorithms. However, features extracted from these two stages are designed or selected according to specific tasks, which requires a wealth of empirical knowledge and a deep understanding of the image. It is also time consuming and has low generalization ability.
Extracted features are then used for target classification, which includes three mainstream paradigms: template matching, a model-based method, and machine learning. In template matching, the distance is measured between the target image and the template database. A semi-automated image intelligence processing (SAIP) system is proposed [
3], in which the mean square error classifier is used to find the best match between the target data and the template database. The classification accuracy of the SAIP is satisfactory if the target configuration is similar to those in the template database. There are two main disadvantages in this method: one is low efficiency in distance calculation and target searching, and the other is its performance degrades significantly when the target changes. For a model-based method, target CAD (computer-aided design) model and electromagnetic simulation software are utilized to predict SAR images of different categories and poses, which are produced by the SAR image chip to be identified. Finally a set of predicted features are compared with those extracted from the actual SAR image chips [
4].
Manually designed features are also utilized in target classification. For example, Lance designed an extended fractal (EF) feature [
5], which is calculated at different scales of SAR images with an ordered statistics filter for detection and a high accuracy achieved. Different from SAR detection features that are more traditional and distinguish target pixels from the background only on the basis of contrast, EF feature is sensitive to both the contrast and the size of objects.
With the rapid development of the machine learning, popular methods like support vector machine (SVM) [
6], AdaBoost [
7] and convolutional neural networks (CNN) [
8] are all adopted to SAR ATR, which made great promotions in performance. In 2006, Hinton [
9] proposed an effective training method to classify the deep belief networks (DBN), which are stacked by several Restricted Boltzmann Machines (RBM), and achieved an accuracy of 98.8% on the handwritten digital dataset. Since then, the machine learning field has risen up a flood of studies on deep learning. Feature extraction has also developed into the stage of feature learning, which is to first build different sizes of network models, then use different learning methods to automatically learn the features from a large number of target samples, and finally use the classifier for classification or identification without manual intervention. Feature learning eliminates the process for task-specific feature extraction, and methods are also versatile for different tasks.
However, deep learning techniques require a large amount of training data to achieve reliable feature extraction. This is sometimes not feasible for SAR ATR where data resources are scarcem in particular for some targets of interest [
10]. There are researchers and scholars working on SAR image data set development, such as the ship images of GF-3 satellite [
11]. Hence, we have to seek alternative non-data-intensive approaches. One way is to take good use of a priori knowledge while designing the neural network, such as the designed feature filters, SDDLRC [
12], and the shape prior models [
13]. For image classification, the ideal feature representation should be invariant to translation, rotation, and scale transformation, and have stability to perturbations and minor deformations. At the same time, the designed features should be quite similar between the same categories of targets, and distinct among different categories. Fourier transform has invariance for translation, but does not have Lipschitz continuity for deformation and especially has no stability for local deformation of high frequency parts. Wavelet transform can overcome the instability of Fourier transform, but it has covariance for translation. In order to extract features that are not only invariant to translation, rotation and scale transformation but also insensitive to perturbations and minor deformations, Mallat et al. [
14] proposed a scattering operator based on wavelet transform in 2012. Features extracted by this operator are invariant to affine transformation and elastic deformation, and insensitive to light. In 2013, they proposed a wavelet-scattering convolution network (WSCN) based on wavelet transform scattering operator [
15]. This network has a multilayer structure, which is similar to the deep convolutional neural network (CNN). Each layer has to perform both linear and nonlinear operations. The convolutional linear operation of the predefined complex wavelet filters with the input signal is first performed on each layer, and then the modulus nonlinearity on the previous calculation result is applied. Finally, the local average is calculated by a low-pass filter. The wavelet-scattering convolution network achieves very good classification results on handwritten digits recognition and texture classification [
15]. In 2015, they proposed a deep roto-translation scattering network that has invariance for both local translation and rotation, and achieved comparable classification results for complex object image databases Caltech and CIFAR [
16].
The roto-translation scattering network employs Morlet wavelets as convolutional filters to detect invariant features over spatial and angular variable to represent the images. The hierarchical Morlet wavelets family cascades are computed with a filter bank of dilated and rotated with no orthogonally property. Features vector extracted by them are stable and invertible when the rotated and scaled over the frequency plane [
10]. As mentioned above, good feature representations will greatly improve the performance of SAR image classification. Features extracted by the deep roto-translation scattering network are invariant to local translation and linearized variation along rotation angles, and have stability for perturbations and minor deformations. Most importantly, the WSCN structure takes advantage of a priori knowledge to reduce the unknown parameters of the network and thus reduce its dependences on the volume and variety of training data.
In this paper, we report a study of applying the deep roto-translation WSCN algorithm to SAR ATR with the MSTAR benchmark dataset. The major objective is to demonstrate the superiority of WSCN as applied to SAR ATR through extensive experiments. It uses the roto-translation scattering convolution network to extract the target scattering energy characteristics of the SAR image, and then utilizes the extracted features to train Gaussian kernel support vector machine (SVM) for classification. The major contribution of this paper is in three folds:
It adapts the roto-translational invariant WSCN, for the first time, for SAR ATR tasks and tested its performance on the benchmark dataset;
It conducted extensive data experiments with the designed algorithm and evaluated the merits of WSCN under both standard and extended operation conditions;
It reveals that employment of roto-translational invariant features can increase the robustness of ATR and reduce its dependency on the number of training data, which is one of the major hindrances in deep learning-based SAR ATR.
The remainder of this paper is organized as follows.
Section 2 introduces the architecture of the proposed ATR network. In
Section 3, experimental results on the MSTAR dataset are presented and discussed. Finally,
Section 4 discusses the proposed methods by comparing with several state-of-art methods.
Section 5 concludes the paper.
2. Wavelet-Scattering Convolution Network for SAR ATR
2.1. Scattering Wavelet
Wavelet transform is a type of multiscale filter. A multiresolution wavelet function can be obtained by applying scale and rotation on the band-pass filter
Ψ:
where
(
G is a finite rotation group).
j characterizes the change in scale, and
r represents the change in direction. The operation of wavelet transform on signal
x can be expressed as:
If the Fourier transform
is centered at a frequency
, then
, which is centered at
with its bandwidth proportional to
. The Fourier transform
is shown in
Figure 1.
Wavelet transform is a mapping of local information, which represents the local features of the image, but the convolution operation is covariant to translations. Therefore, these local features are not translational invariant. To build a translational invariant representation, it is necessary to introduce a nonlinearity
M. After this nonlinear transformation,
, should be stable for deformation. At the same time, the nonlinear transformation operator
M must be nonexpansive, so as to ensure the stability to additive noise. While satisfying the above conditions, it is also necessary to retain the energy information of the signal, resulting the translational invariant coefficients are then
norms:
The
norms
are a rough signal representation, which show the sparsity of wavelet coefficients. Although the process of performing a modulus will lose phase information of the wavelet transform, but the loss of information is not from this process. It has been proved that x can be reconstructed from the modulus of its wavelet coefficients
[
17]. The loss of information actually comes from the integration of
. This process removes all nonzero frequencies and then recovered when calculating the wavelet coefficients
of
. The
norms of
λ1 and
λ2 define a deeper representation of the translational invariance:
By further iterating on the wavelet transform and modulus operators, more translational invariant coefficients can be computed. Let
, along a path sequence
, an ordered product of nonlinear and noncommuting operators is computed:
with
. The scattering transformation along path
p is defined as follows:
The scattering coefficient has translational invariance for x. It can be seen from Equation (6) that the transform has many similarities with the Fourier transform modulus, but the wavelet scattering coefficients have Lipschitz continuity for the deformation, as opposed to the Fourier transform modulus.
In terms of classification, the extracted local features are usually described as having translational invariance when the scale is less than the predefined scale 2
J, while maintaining a spatial variability when the scale is greater than 2
J. This requires a spatial window
to localize the scattering integral, thus defining a windowed scattering transform:
And hence
with
. The convolution process with
is essentially an average down-sampling process at a scale of 2
J. The windowed scattering operator has local translational invariance and is stable to deformation.
This paper uses Morlet wavelet as an example of complex wavelets, which is given by
where
is adjusted so that
. The averaging
filter is a scaled Gaussian.
Figure 2 shows the Morlet wavelet with σ = 0.85 and ξ = 3π/4.
2.2. Scattering Convolution Network
If
is a path of length
m, then
is the
m-order windowing scattering coefficient, calculated at the
m-th layer of the network. By further iterating on wavelet transform and modulus operators, scattering transform can compute higher order coefficients. Images are real-valued signals, so it is sufficient to consider “positive” rotations
with angles in [0, π]:
with
. It should be noted that 2
J and 2
j are spatial scale variables, while
is a frequency index giving the location of the frequency support of
. So that the following wavelet modulus propagator can be obtained:
A wavelet modulus propagator keeps the low-frequency averaging and computes the modulus of complex wavelet coefficients. High frequency information is lost because of an average pooling, but can be recovered at the next layer as the wavelet coefficients [
9]. Therefore, it is important to build a multilayer network structure. Iterating on
can construct a multilayer wavelet-scattering convolution network. This process can be illustrated as applying
to all propagated signals
of the
m-th layer
Pm, and the network will output all scattering signals
and compute all propagated signals
on the next layer
Pm+1:
The wavelet-scattering convolution network is very different from the conventional convolution network. Conventional convolution network outputs the results only on the last layer, and the parameters of the filter banks need to be learned from a large number of data samples, while the scattering coefficients of the wavelet-scattering convolution network are distributed at each layer, and the parameters of the filter banks are pre-defined [
18,
19]. The wavelet-scattering convolution network only needs to learn the parameters of the final supervised classifier. The related literature has shown that the energy of the scattering convolution network is concentrated in a few paths, and will approach zero as the path increases. In addition, first three layers of the scattering convolution network concentrate most of the image energy [
20]. When
m = 3, the structure of the scattering convolution network is shown in
Figure 3. The downward arrow is the process of scattering propagation, and the upward arrow outputs the extracted scattering coefficients.
Approximating the scattering process by a cosine basis along the scale and rotation variables, paths can be parameterized by .
The following is an example of a texture image, which is used to explain the wavelet scattering network. The input signal in the example is a 2-D texture picture, as shown in
Figure 4a. On layer 0, the scattering coefficients is
, as shown in
Figure 4b. Scattering coefficients outputted on layer 1 and layer 2 are also shown in
Figure 5 and
Figure 6 respectively.
In this example, J = 5, L = 6. The scaling factor of the wavelet function satisfies , , the rotation angle .
The final output
of the wavelet-scattering convolution network is useful for classification, and can be expressed as:
Mallat et al. has shown in the literature [
15] that the wavelet scattering coefficients have the following properties:
Preservation of energy: ;
Stable to additive noise: ;
Translation covariance: the wavelet scattering coefficients will translate the same distance with the signal: ;
Local translation invariance: ;
Sensitive to rotation: ;
Stable to slight deformation: ;
The scattering coefficient is insensitive to local translation, noise, and slight deformation, eliminating some of the factors that cause interference to the signal classification. In summary, the wavelet scattering coefficient is a good choice of feature representation, which requires no training but preserve a hierarchical structure.
2.3. Deep Roto-Translation Scattering Network
The wavelet coefficients
in the previous subsection only satisfy the local translation invariance, but cannot reduce the interference caused by the rotation change on the signal classification. The wavelet-scattering convolution network (WSCN) can flexibly set the wavelet basis function so that the final output is insensitive to rotation changes. In 2015, Mallat proposed a deep roto-translation scattering network [
16], which was insensitive to both local translation and rotation changes. The main idea is that for a two-layer wavelet scattering network, the first layer calculates a 2-D wavelet transform along the spatial variable
to realize local translation invariance:
The second layer calculates a 3-D wavelet transform along both the spatial variable
and the angle variable
θ to realize local rotation invariance:
The specific process is described in detail as follows:
For the first layer of the wavelet-scattering convolution network, the wavelet function is the rotation and scale transform of band-pass filter
Ψ:
The Morlet wavelet is still chosen here. The original input signal is computed convolution and modulus with , and then subsampled at intervals of , where .
The intermediate result for the first layer of the network is:
For the second layer of the wavelet-scattering convolution network, a 3-D wavelet function is selected:
where
,
β is the rotation angle parameter,
is a 1-D wavelet function with the variable
θ, and its scale is 2
k(1 ≤
k ≤
K < log
2 L).
For any
, the intermediate result
is computed convolution and modulus with the 3-D wavelet function
along the spatial variable
u and the rotation angle variable
θ, and then subsampled along both variables. The final intermediate result for the second layer is:
The final output
is achieved by averaging the input
x, the first layer intermediate result
, and the second intermediate result
with a spatial convolution with
:
The wavelet scattering coefficients at this time have local translation and rotation invariance, and are not sensitive to perturbations and slight deformations. reduces the adverse effects of the rotation change on the signal classification, and helps to improve the accuracy of complex signal classification.
This paper then trains the Gaussian kernel support vector machine using the wavelet scattering coefficients
to realize SAR image automatic target recognition. The overall architecture is depicted in
Figure 7.
3. Experiments on the MSTAR benchmark dataset
The experimental data used in this paper were collected by Sandia National Laboratory (SNL) SAR sensors. The data were collected under the moving and stationary target acquisition and recognition (MSTAR) project [
4], which was jointly sponsored by Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL). The project collected hundreds of thousands of SAR images containing ground military targets, including different target classes, aspect and depression angles, barrel steering, configuration changes and variants, but only a small portion of which can be available on the website for the open access [
21]. The released MSTAR data set contains 10 classes of ground military targets listed in
Table 1. These images are collected by the X-band SAR sensor in a 0.3 m resolution spotlight mode; full aspect coverage (range from 0° to 360°), with a relative flat grass or exposed soil background. It should be also noticed that the released data were all stationary targets.
Figure 8 shows the optical images of 10 classes of military targets and the corresponding SAR images at the same aspect angle.
The algorithm is tested under both standard operating conditions (SOC) and extended operating conditions (EOC) in order to completely assess the robustness of the method. The standard operating conditions refer to the configuration and serial number of the training and testing SAR images are the same, but the depression and aspect angle for the images are different. The extended operating conditions refer to the significant differences between the training and testing SAR images, which is mainly due to the great change of the depression angle and configuration, as well as version variants. Configuration changes mean the addition or removal of discrete components on the target, such as auxiliary fuel barrel changes.
3.1. SOC Results
Under standard operating conditions, the method is tested for the classification of 10 classes. The serial number of the training and test set, the depression angle, and the number of samples for each class are shown in
Table 2. The same target class has the same serial number in the training set and the test set, but the depression and aspect angle are different. The training SAR images are collected at 17° depression angle, while the test SAR images are collected at 15° depression angle. No image preprocessing algorithm is applied to the SAR images.
Table 3 shows the correct classification coefficient and confusion matrix for the classification of 10 classes of targets under SOC. Each row in the confusion matrix represents the actual target class, and each column denotes the class predicted by the network. Percent Correctly Classified (
) is used to assess the performance of the ATR, which is defined as
, where the
is the number of correct classified positive samples, and the
is the total number of samples. It can be seen that the proposed method can achieve state-of-the-art performance for the classification of MSTAR 10 classes of targets under standard operating conditions. The proposed method achieves an overall accuracy of 97.63% for the SOC dataset. The kappa coefficient is 0.97, which indicates this method is stable for 10 classes of targets. Correct classification coefficients are all over 96%, except target 2S1, a part of which was misclassified as T-62 and BTR-70. From the images shown in
Figure 8, structure of T-62 and 2S1 is similar; further, barrel part is hardly seen in SAR images, thus SAR images of 2S1 and BTR-70 are also similar.
3.2. EOC Results
A SAR image is quite sensitive to the change of depression angle, and even a slight change will result in a very different image. As shown in
Table 4, only four classes of targets in the MSTAR data set contain SAR images at a 30° depression angle and they are 2S1, BRDM-2, T-72, and ZSU-234. Therefore, SAR images with these four classes of targets at a 17° depression angle are used for training, and those at 30° depression angle are used for testing. The correct classification coefficient and confusion matrix for the significant change of depression angle denoted as EOC-1 are shown in
Table 5. The overall accuracy is 82.46% under EOC-1, and its kappa coefficient is 0.766. As we all know that SAR is sensitive to incidence angle, which EOC-1 means significant variance of depression angle. Therefore, the feature of SAR image changes, which leads to the degradation of correct classification coefficient.
The extended operating conditions also include configuration variants and version variants. Configuration variants mainly refer to whether both sides of the tank track have the installation of protective plate, or whether the tank tail is installed with fuel barrels, as well as the rotation of the turrets and the barrels, while version variants refer to different versions, denoted as EOC-2. The algorithm is tested under this condition to evaluate the classification performance. SAR images of the four classes of targets, namely BMP-2, BRDM-2, BTR-70, and T-72, at a 17° depression angle are used as training set as shown in
Table 2. Two version variants of BMP-2 and ten version variants of T-72 collected at 17° and 15° depression angles are listed in
Table 6 and
Table 7, respectively, as two groups of test sets. It is worth mentioning that the training set does not include the serial number of the test set. The correct classification coefficient and confusion matrix are listed in
Table 8 and
Table 9. WSCN shows its stable performance for the configuration variants of T-72 and BMP-2. The correct classification coefficient is obtained at 94.14% for five version variants T-72, and 89.76% for five versions T-72 and two version variants BMP-2.
It can be seen that the significant change of depression angle has a great influence on the classification result. Details of the EOC-1 data are shown in
Table 10, and the correct classification coefficient and confusion matrix using 10-class targets to train the network and 4-class to test are shown in
Table 11. Due to the large difference of the train and test data, the accuracy decreases to 74.37% from the 82.46% of the original EOC-1 experiment. As showing in the
Table 11, some ZSU-234 are classified to D7 and leading to lower accuracy.
4. Discussion
The performance of WSCN is compared with several widely cited methods and recently proposed methods, as well as our previous work [
1] in
Table 12. The methods include conditional Gaussian model (Cond Gauss) [
22], monogenic scale space (MSS) [
23], and modified polar mapping classifier (M-PMC) [
24], and information-decoupled representation (IDR) [
25]. Note that the testing samples used in MSS and IDR under EOC-1 only contains three classes, but ours contains four classes. While the testing samples used in M-MPC under EOC-2 contains the samples with the depression angle both 15 and 17 degrees. The classification performance of A-ConvNets [
1] is slightly better. It is reasonable because A-ConvNets is regarded a fully trainable network including the feature extraction part, while our approach employs a fixed feature extraction network. There are some inherent shortcomings for fully-trainable approaches such as A-ConvNet. Firstly, a large number of training samples are needed to avoid overfitting. Secondly, there are many hyper parameters needed to be optimized through multiple times of manual trial. Finally, deep neural network as blackbox is known to be difficult to understand and diagnose, as the parameters are often initialized randomly and then optimized only depending on the train samples, the network’s procedure and final state is unknown and unpredictable. While the proposed WSCN is fully based on rational design backed by mathematical theory. In these regards, the proposed WSCN is preferable albeit it’s slightly worse performance. An additional experiment of A-ConvNets is conducted on the same dataset of the WSCN, The results indicate that WSCN can efficiently recognize the target with configuration changes, but sensitive to the angles. As opposed to deep neural networks, filters of each layer in wavelet-scattering convolution network are predefined except the final supervised classifier. Therefore, the parameters needed to be learned from the training samples are greatly reduced, thus reducing the probability of overfitting and the number of training samples. Moreover, the number of tests is reduced because the hyper parameters that require manual adjustment are very limited. In addition, mathematical theory can prove that by constructing a specific wavelet function, the output scattering coefficients of wavelet-scattering convolution network can be invariant to local translation and rotation, as well as insensitive to perturbation and slight deformation.
5. Conclusions
This paper presents a SAR automatic target classification method based on a wavelet-scattering convolution network. By introducing a deep roto-translation scattering network with complex wavelet filters over spatial and angular variables, robust feature representations can be extracted across multiple scales and multiple angles. Parameters of WSCN are predefined rather than randomly initialized parameters as deep neural network. It does not require any training samples. CNN is trained with the back-propagation algorithm, which optimizes the parameters according to the train samples, thus the parameters end up at an unknown and unpredictable state, and the optimization is uncontrollable which only depends on the input samples for each train step. Unlike CNN, the design of the WSCN is purely based on the priori knowledge and mathematical principles. The proposed algorithm was verified on MSTAR benchmark dataset under both SOC and EOC cases. Experimental results show that 97.63% accuracy was obtain in SOC, and 82.46% for significant change of depression angle from 17° to 30°, and 94.14% for configuration variants of T-72 tank, and 89.76% for version variants of T-72 and BMP-2. The proposed method shows robustness on the variants of configuration, and acceptable accuracy on significant variance of depression angle. Experimental results indicate the proposed method can yield comparable results with state-of-the-art deep neural network method which, on the other hand, requires a significant amount of training samples. In this paper, the training samples of proposed WSCN are less than 1/10 of those in previous A-ConvNets.
The time consumption of proposed method mainly includes three parts: the features extraction, the features dimension reduction and classification. The experiments are conducted by MATLAB 2015b in an Ubuntu 14.04 operation system. The computer has an Intel Core i7-5930K CPU and its memory is 128 G. The experiment under the SOC-1 can be finished in 23 min. The computing time is 0.062 s per image for scattering features extraction, and 0.207 s per image for dimension reduction. The classification of whole 2425 test images only costs 0.172 s. It should be noticed the classifier can be trained offline, which could significantly reduce the time cost. Furthermore, in this paper, the roto-translation of SAR images and the feature dimension reduction are carried out by MATLAB code, which could be further optimized by other high efficient program languages.