1. Introduction
The expansion of modern construction industry and material technology means many metal components are being applied in domestic appliances. In the event of a fire, experts hope to find some important clues from the scene. Metal is very important for this situation and thus was investigated. In fire scene, metal components are retained for their inflammability. Meanwhile, special marks are left on the surface of metal component due to physical and chemical changes when being heated. The conditions of fire scene are complicated and marks on metal components are influenced by heating temperature, heating duration, cooling mode, etc. These attributes are very important in fire science since they are useful indications to analyze the location, source and situation of fire. Different attribute conditions result in different oxidation reactions on metal surface. The fire scene is distinctive and is very hard to be restored. Therefore, it is sensible to recognize heated metal attributes based on its mark image.
Traditional methods use knowledge of physics and chemistry to analyze heated metal attribute by human experts.
Table 1 demonstrates the inspection for trace and physical evidences from fire scene (a National standard of People’s Republic of China) [
1]. It is basically a relationship between color of heated metal and its heating temperature. Changes of metallographic organization due to heating temperature was studied by Wu et al. [
2,
3]. Macro-inspection and micro-analytical were used to record the attribute value on object surface by Xu et al. [
4]. Stereo microscope and electron microscope have been used to find changes of chemical composition and organization structure of Zn-Fe when being heated. However, these methods have two main drawbacks: (1) they are based on human expert for qualitative analysis; and (2) they are impractical to implement as they have less automation. To solve these problems, this paper presents a correlation construction between heated metal mark image and attributes based on machine learning technology and a completely data-driven method.
Image recognition is a traditional problem and has been studied for decades in computer vision and machine learning fields. The image object is first represented by feature vectors, and then a classifier can be learned in feature space with training data.
Feature extraction and representation are key roles and many works have been published. To deal with problems of image translation, scale variant, rotation, illumination and distortion, many expert-designed features are proposed.
(scale invariant feature transform descriptor) was introduced by Lowe [
5,
6]. Gradient direction of local image can be expressed and an image patch was encoded with 128-D feature vector.
(histograms of oriented gradients) was proposed by Navneet and Bill [
7]. It is computed from a group of gradient orientation histograms on image sub-regions. The sizes of block and cell are assigned the dimension of
descriptor. To improve the efficiency of
,
(speeded-up robust features) was proposed by Bay et al. [
8]. Scale-space extrema of the determinant of Hessian matrix is used to compute interest point and
feature is determined with
wavelets.
(Local binary patterns) descriptor was introduced by Wang et al. [
9]. It defines an 8-bit length number to record the difference between a pixel and its eight neighbors. The frequencies of all 8-bit numbers are counted to represent the feature vector of an image’s local region. Based on these basic feature extraction and representation methods, various improvement works are also proposed. Global feature organization methods are designed.
(bag of visual words) model proposed by Li et al. [
10], is one of most widely adopted methods. Each local patch of an image is mapped to a clustered visual word, and the histogram of visual word frequency is used to represent the whole image.
(spatial pyramid matching) was proposed by Grauman and Darrell [
11]. It improves
by dividing an image into multi-resolutions.
Since 2012, with the advent of large scale labeled dataset and GPU, deep learning, especially convolutional neural networks (CNNs), have achieved great successes. It is essentially a multi-layered neural network with cascade nonlinear processing units for feature extraction and representation. Its excellent performances are derived from: (1) complex model representation with millions of parameters; and (2) completely automatic model optimization and adjustment. LeCun et al. [
12,
13] designed
, a successful small scale CNNs model, is used in handwritten mail zip code recognition. A medium-scale CNNs,
, proposed by Krizhevsky et al. [
14], won ImageNet 2012 competition by significant promotion over non-deep learning methods. More powerful models, e.g.,
,
,
and
, etc., are designed successively by Zeiler et al. [
15,
16,
17,
18]. They improve CNNs by using more layers, small convolutional filters, flexible convolutional filter size, combined width and depth of model, and optimal and robust model training methods.
has excellent performance of Top-5 error (
) and it outperformed humans for the first time in the ImageNet 2015 classification task.
Some relevant works are reported. A rail surface defects type recognition method was introduced based on CNNs model by Shahrzad et al. [
19]. The model contains three convolutional layers, three max-pooling layers and two fully connected layers. It collects and labels 24,408 object images. A bearing fault diagnosis method is proposed for 10-type fault classification based on CNNs and an improved Dempster–Shafer algorithm by Li et al. [
20]. The CNN model used in this work contains only three convolutional layers and one fully connected layer. By model ensemble, the final result is combined with various evidence. A steel defect area characterization method was proposed by Psuj [
21], utilizing a magnetic multi-sensor matrix transducer. The basic model has three convolutional layers, three max-pooling layers and one fully connected layer. Three combined models are adopted for classification. In total, 35,000 simulated images are generated in this work. A hot-rolled steel sheet surface defect classification was designed by Zhou et al. [
22]. Eight surface defect types are defined and 14,400 sample images are used. A civil infrastructure damage detection method was designed by Cha et al. [
23]. The structure of the model used in this method is the same as in Ref. [
21]. Small patches are cut with manually annotated crack or intact and there are 40,000 sample images in the dataset. A structural surface damage detection method is proposed based on
r-
in Ref. [
24]. Five types of surface damage atr defined and
os selected as backbone model. In total, 2366 images are collected as the dataset in this study. These works are similar to ours. However, they only model applications and simple CNN model structures are used. Models containing fewer than eight layers in Ref. [
19,
20,
21,
22,
23] are used and
is adopted in Ref. [
24]. The CNN model structures are relatively small and state-of-the-art models are not adopted. On the other hand, complete experimental evaluation and analysis are needed, including of training parameters, various CNN structures, etc.
Our previous work performs a case study on heated metal attribute recognition based on CNNs [
25]. We analyzed and selected seven heated metal attributes. Raw image set was generated with special capture devices (vacuum resistance furnace, muffle furnace, and gasoline burner; test chamber with constant temperature and humidity; and microscope) and a benchmark dataset was organized (900 image samples, each labeled with seven attributes). The relationship between attributes and mark image were trained with state-of-the-art CNNs models (
-
,
-
,
,
, etc). Experimental evaluations were conducted according to various model structure, batch size, data augmentation, and training algorithms. This work is a continued research. In this study, the benchmark dataset was first further expanded with 900 images. Then, compressed CNNs models were analyzed to increase the model efficiency. Because a heated metal mark image contains seven attributes, a multi-label training based model was devised to accomplish the recognition task in one-time completion. Moreover, the compressed CNNs models were deployed on Android platforms. Finally, experiments were evaluated from various aspects.
The main contributions of this paper are threefold: (1) The benchmark image dataset was further expanded (doubled). (2) Compressed CNNs models were adopted and a new model training method was proposed based on multi-label. (3) Models were deployed and tested on Android platforms.
4. Methodology
In our study, deep learning models were deployed on mobile or embedded products. Its importance lies in the facts that: (1) It is practical for expert to investigate using mobile intelligent equipment in the fire scene instead of using bulky server in the lab. Doing investigation off-site is not a bad choice. Wee want recognize attribute of heated metal mark without destroying the fire scene as far as possible. The mark of heated metal may change if we take it back to the lab. (2) Many applications are usually very sensitive to the response time of the program, even a small delay in service response has a significant impact for users. As more and more applications are provided with core functions by deep learning models, low latency inference becomes increasingly, important whether we deploy models on cloud or on mobile side.
One way to solve this problem is committed to performing model inference on high-performance cloud servers and transferring model inputs and outputs between clients and servers. However, this solution poses many problems, such as high computing costs, massive data migration over mobile networks, user privacy and increased latency. Model compression technology adopts an alternative way for these scenarios, which requires fewer resources to perform inference. This was the focus of our research. Key technologies, top compressed CNNs models and a proposed multi-label classification method are described in this section.
4.1. Technologies for Model Compression
4.1.1. Weight Pruning
Network weight pruning-based methods explore the redundancy in model parameters and try to remove noncritical ones. Weight pruning curtails redundant parameters completely from neural networks so that one can even skip computations for pruned weights.
Srinivas and Babu [
26] explored the data-free pruning method. Han et al. [
27] proposed a method to reduce the total parameters and operations. In Ref. [
28], all convolutional filters are ranked with
-norm regularization at each pruning iteration, and m filters with minimum value are deleted. Anwar et al. [
29] adopted N Particle Filters for N convolutional layers. Each convolutional unit is set with a value according to its accuracy on a small validation dataset, and the lower one is removed. Pruning is considered as a combinational optimization problem in Ref. [
30]. In Ref. [
31], each sparse convolutional layer can be performed with a few convolution kernels followed by a sparse matrix multiplication. Lebedev and Lempitsky [
32] imposed group sparsity constraints on convolutional filters to prune entries of the convolution kernels in a group-wise fashion. In Ref. [
33], a group-sparse regularizer on neurons is introduced during training stage to learn compact CNNs with reduced filters. The method in Ref. [
34] adds a structured sparsity regularizer on each layer to reduce trivial filters, channels, or even layers. In filter-level pruning, all of the aforementioned works use
-norm regularizers.
4.1.2. Quantization and Sharing
Network weight quantization compresses the model by reducing the number of bits required to represent each weight. It generally divides continuous variation data into discrete values and assigns each specific datum to a fixed value. For example, if a weight is represented with a 32-bit floating-point number and we want to indicate a weight with 100 quantified values, then 7-bit representation is sufficient.
Generally, K-means clustering is a simple and convenient solution to solve the problem of quantization of CNNs weights [
35], which is shown in Equation (
2). C = {
,
,…,
} denotes the cluster centers we want to compute, and
w means original weight. The objective function is to minimize the squared error between all weight and center it belongs to. As a result, each
w is quantized to one cluster center. If the number of cluster centers is set with
k, then
bits are used to represent the weight value. Vanhoucke et al. [
36,
37] proposed 8-bit quantization and 16-bit fixed-point representation. They brought significant speedup, reduce memory usage and decrease loss in accuracy. There were also many methods that directly train CNNs with binary weights, e.g., Binary-Connect [
38], BinaryNet [
39], and XNORNetworks [
40]. The main idea was to learn binary weights or activations during the model training directly. The method in Ref. [
41] reduced the precision of weights to ternary values. A HashedNets model was proposed, in which the low cost hash function is used to group weights into hash buckets for sharing [
42]. In Ref. [
43], a simple regularization method based on soft weight-sharing was proposed.
4.1.3. Matrix Factorization
To reduce the time complexity, tensor factorization is a commonly used method. It is usually based on low rank approximation theory, and a high-dimension tensor can be approximated by multiple one-dimensional tensor products.
Lebedev et al. [
44] proposed a canonical polyadic (CP)-decomposition based method that decomposing one network layer into five layers with low complexity. The optimal solution was hard to compute with Stochastic gradient descent (SGD) weight fine-tuning. Denton et al. [
45] exploited redundancy of convolutional layer and a tensor decomposition method was devised. It treated two-dimensional tensor decomposition as singular decomposition, and three-dimensional tensor decomposition as two-dimensional decomposition.
Zhang et al. [
46] used Singular Value Decomposition (SVD) decomposition for parameter matrix, and proposed a nonlinear optimization method with non-SGD. The cumulative reconstruction error of previous layer is considered in asymmetric reconstruction. Jaderberg et al. [
47] used rank 1 convolutional filter to generate M independent basic feature map, and then K × K convolutional filters can be decomposed into 1 × K and K × 1 filters. The output is linearly reconstructed with learned weights. Tai et al. [
48] proposed a method for training low rank constraint network. A global optimizer is used for matrix factorization and the redundancy of convolutional filter can be reduced.
Kim et al. [
49] proposed a model with one or more tensor trained layer. Tensor is trained for tensor compressing and its filters are generated based on SVD approximation. According to redundancy inside and among channels, sparse decomposition was conducted on channels [
31]. Convolutional operation with high cost can be transformed into matrix multiplication. The matrix is then sparsified with regularization term.
Among these model compression methods, matrix factorization based methods are most widely used. Many top compressed CNNs models focus on this, which is illustrated in the next subsection.
4.2. Top Compressed CNNs Models
In this subsection, some state-of-the-art compressed CNNs models used in our study are introduced.
4.2.1. MobileNet
was first designed for mobile and embedded vision applications. It was built primarily from depthwise separable convolutional operations [
50]. It factorizes a standard convolution into a depthwise convolution and a pointwise convolution.
applies a single filter to each input channel, and then pointwise convolution combines the outputs with linear combination.
A standard convolution operation has the following computational cost:
where
M and
N are number of input and output channels,
is the size of filters, and
represents the size of feature map.
splits this into two separate operations, one for filtering and one for combining. Batch normalization and ReLU nonlinearity are used in each layer. The depthwise convolution operation has the following computational cost:
A linear combination of the output of depthwise convolution via
convolution is needed to generate new features. Thus, the total computation cost of depthwise separable convolution is as follows:
The ratio of computational cost decrease can be shown as follows:
Moreover, to make the model smaller and faster, two hyper-parameters are proposed, width multiplier
and resolution multiplier
, which represent the ratio of reduced channels and the size of reduced feature maps, respectively. Finally, The computational cost of depthwise convolution operation with parameters
and
can be further expressed as follows:
V2 was proposed for further improvement. It was constructed with inverted residuals and linear bottlenecks techniques, which can reduce number of parameters and the loss in activation operation. Combined with single shot detector lite (SSDLite) for object detection, V2 is reported to be faster than V1, and have 20× less computation and 10× fewer parameters than V2.
4.2.2. SqueezeNet
was proposed for preserving accuracy with few parameters [
51]. A novel building block,
, is used as the core structure in
.
Three main strategies are adopted to construct the . First, 3 × 3 filters are replaced with 1 × 1 filters. This can make the number of parameters 9× smaller than before. Second, the number of input channels to filters is decreased. The number of parameters in one standard layer can be represented as , where is the number of input channels, is the number of filters and is the size of filter. layer is proposed to reduce so the total number of parameters can be further decreased. Third, the network is late downsampled. Usually, layers have small activation feature maps if their stride is larger than 1, and larger activation feature maps can lead to higher performance.
To accomplish the above strategies, module was designed, which consists of a squeeze layer and an expand layer. Squeeze layer has only 1 × 1 convolution filters (Strategy 1) and expand layer has 1 × 1 and 3 × 3 convolution filters. Then, three hyperparameters are set: , and , which represent the numbers of 1 × 1 convolution filters in squeeze layer, 1 × 1 convolution filters and 3 × 3 convolution filters in expand layer, respectively. is set to be less than (), so the squeeze layer can help to limit the number of input channels to expand layer (Strategy 2). The model is constructed by stacking many Fire modules. The number of filters per module is increased gradually, and a max-pooling with stride 2 is performed with a certain interval (Strategy 3).
The evaluation demonstrates that the architecture has 50× fewer parameters than original and maintains -level accuracy on ImageNet. Based on , some works implement it on field programmable gate array (FPGA), and the model parameters can be stored entirely within FPGA and there is no need to access off-chip storage.
4.2.3. ShuffleNet
was proposed by Zhang et al. [
52]. In this method, pointwise group convolutions are first used to reduce the costly dense 1 × 1 convolution computation. Then, a novel channel shuffle operation is designed to overcome the side effects of group convolution, which can help information flow across different feature channels.
Group convolution is an effective way to significantly reduce computation cost. However, the outputs are only derived from certain input channels. This blocks the feature exchange among channel groups and the optimal representation cannot be obtained. We proposed a channel shuffle operation to construct association between input and output channels comprising a convolutional layer with g groups and its output with g × n channels. The dimension of the output is reshaped into (g, n) and it is transposed and flattened as the input of next layer.
A unit is formed with a 1 × 1 pointwise group convolution layer and follows channel shuffle operation layer. architecture is mainly built by a stack of units. This structure has less computational cost in the same condition. Let the input be with bottleneck channels m, floating-point operations per seconds (FLOPs) and FLOPs is needed for , while only FLOPs is needed for .
It is reported that, compared with the architecture, model obtains superior performance of absolute 7.8% increase in ImageNet Top-1 error with cost of about 40 millions floating-point operations per seconds (MFLOPs). The speedup on hardware has also been tested. With comparable performance, the achieves 13× speedup over on an off-the-shelf ARM-based core device.
In the latest version, operation is proposed in V2. The input of feature channels are first split into two branch channels, respectively. One branch remains the same, and the other branch is computed with 1 × 1 convolution, 3 × 3 depthwise separable convolution and 1 × 1 convolution. Then, the two branch features are concatenated and a channel Shuffle operation is implemented. After the channel shuffle, it is repeated for the next unit.
The report demonstrates that V2 is about 40% faster than V1 and about 16% faster than V2. With 500 MFLOPs, V2 is 58% faster than V2 and 63% faster than V1.
4.3. Multi-Label Classification
For an input heated metal mark image, we aimed to recognize its attributes of metal type, heating mode, heating temperature, heating duration, cooling mode, placing duration and relative humidity. Each attribute can be trained with a model, with totally seven separate models. However, this has low efficiency for computation time and storage space even using compressed models. In this study, a multiple label classification method was adopted. Seven attributes were recognized in a single test with one unified CNNs model.
Figure 2 gives the basic procedure. For each attribute
, its type was represented with one-hot encoding mode. Then, two-dimensional feature vector, four-dimensional feature vector, four-dimensional feature vector, four-dimensional feature vector, two-dimensional feature vector, three-dimensional feature vector and two-dimensional feature vector were encoded for attributes
–
, respectively. All attributes shared the same backbone network model. All outputs were formed into a tiled vector, and the ground truth labels were concatenated into the same pattern. Finally, the objective function was formulated as follows.
As shown in Equation (
8),
represents training image sample. Total loss was composed of seven sub-loss, each corresponding to an attribute.
,
,
is weight parameter and
-
is adopted for
.
5. Experimental Evaluation
5.1. Experiment Setup
The performances of heated metal mark attributes recognition with compressed CNNs models were evaluated based on a generated benchmark dataset. In this experiment, Python was used as programming language. Tensorflow was adopted as deep learning framework and Keras was selected as library. All experiments were evaluated on Pentium I5-8 series CPU, 32G RAM, Nvidia GTX TitanXp 12G GPU, Ubuntu OS PC.
5.2. Recognition Accuracy Evaluation
Recognition accuracy was used to evaluate recognition performance on different attribute. As shown in Equation (
9),
is the recognition accuracy for attribute
,
means the number of all testing samples containing
, and
denotes the number of correctly recognized attribute
. We divided the dataset into six subgroups with attribute values equally distributed. Five randomly chosen subgroups (1500 image samples) were used for training and the remaining subgroup (300 image samples) was used for testing. The results were obtained by averaging the five independent tests.
, and were used as backbone compressed CNNs models for evaluation. For model input, sample image size was set as 224 × 224 × 3 pixels. Epoch was set as 50 and batch size was set as 32. Adam was used as preferred optimization method. Learning rate was set with initial value of 0.001 and momentum was set as 0.9. Dropout was set as 0.2.
The results of average recognition accuracy are shown in
Table 3. Structure of CNNs models and data augment are listed in the first and second columns, respectively. For data augment, commonly used transformations including random cropping, vertical and horizontal flipping, perturbation of brightness, saturation, hue and contrast were adopted. When the model was trained with data augment,
of training image in each batch was augmented, otherwise the probability was
. For
,
with data augment obtained the best performance, with value of 0.803. For
,
with data augment obtained best performance, with value of 0.837. For
,
with data augment obtained best performance, with value of 0.825. For
,
with data augment obtained best performance, with value of 0.812. For
,
with data augment obtained best performance, with value of 0.883. For
,
and
with data augment obtained best performance, with value of 0.817. For
,
with data augment obtained best performance, with value of 0.894. For the overall performances,
model ranked first.
We found that models training with data augment obtained better performance than those without data augment. There was about 2% accuracy improvement. It can be concluded that data augment is an effective way to train better CNN models, especially for large scale CNNs with huge parameters and lacking of sufficient training data.
Figure 3 demonstrates the misclassified sample images. Each row corresponds to an attribute. The red texts represent ground truth label, while yellow texts represent the predicted results. Galvanized steel and cold rolled steel normally have different corrosion degrees at the experimental condition. The misclassified sample images of
showed similar corrosion degree. For heating temperature, higher temperatures will lead to more corrosion and rougher texture. The misclassified sample images of
came from adjacent temperature. These situations can be seen as general causes of
,
and
. For
and
, the reason for misclassification is hard to describe even for the field professional.
Commonly used large scale datasets are mainly natural scene, animals, etc. These are easy to distinguish by humans, and the differences are easy to explain visually. The heated metal mark image we studied is a special kind of objects, the origin of its mark being caused by complex physical and chemical reactions. Moreover, the benchmark dataset we generated inevitably contains noise, which may influence the model performance. We need further research to explore the internal principle with the help of other professionals.
5.3. Batch Size Evaluation
Training on different batch sizes, 8, 16, 32 and 48 was evaluated.
Figure 4 demonstrates the model accuracy versus training epoch. Here, the average accuracy over seven attribute was used. Data augment was used and other parameters were set the same as for the experiments presented in
Section 5.2.
As shown in the figure, all models converged after about 40 training epochs. Models trained with batchsize 32 obtained better performance, and outperformed other models by about . was more stable and smooth during training, while and fluctuated more. It was reasonably found that for bigger batch sizes the gradient descent direction computation was more accurate, and was gentler during model training. Smaller batch sizes led to more randomness, and it was harder to achieve optimal performance.
5.4. Single Label Model vs. Multi-Label Model
The plain way to train models is to train an independent CNN model for each attribute. This is called single label model. Comparisons between single label model and multi-label model were evaluated. Single label model was trained separately for each attribute
. The results are shown in
Table 4.
It can be seen from the result that models with single label training obtained better performance than those with multi-label training, with about 1– improvement. There were some divergences for different attributes, but the overall trends were consistent.
Using multi-label training, model parameters could be shared. The size of model was greatly reduced by 7×, with some performance loss. Multi-label training is not a trivial task as there are conflicts among training parameters for recognizing different attribute. The loss scale for different attribute may be very large, thus the model training could not be coherent for seven attributes. Therefore, the learning process of shared parameters was unavoidably influenced.
5.5. Compressed Model vs. Heavy Model
Different CNN models contain various depth and width of layers, number of filters, size and shape of filters, which lead to different structures, parameters and complexity. Comparisons between compressed models and heavy models were evaluated.
,
, and
models were selected. The results are shown in
Table 5.
It can be seen that heavy CNNs models obtained better performance for all attribute than compressed models. obtained an average performance of 0.854, which was better than . The main reason is that heavy models contain more complex structures and more parameters, which have the advantages of feature extraction and representation. However, the performance differences between compressed models and heavy models were not large, at only about 1–.
5.6. Running Time Evaluation
Running time of different CNN models was evaluated. Training and testing time of
,
,
and
models with various batch sizes (8, 16, 32 and 48) were evaluated.
Table 6 gives the experiment results.
cost the longest training time among all three compressed CNNs models, at 0.192 s, 0.368 s, 0.736 s and 1.104 s for batch sizes of 8, 16, 32 and 48, respectively, during each training iteration. used the shortest training time, with about of ’s. For testing time, had the minimal cost, 0.0026 s. Comparing with model, the running efficiency was greatly improved with compressed CNNs models. All execution times were evaluated on PC. For model space occupancy, 9.6 M, 3.1 M and 5 M were required for , and , while 94.7 M was needed for . This also demonstrated the space efficiency of compressed CNN models. model rand 10× faster than model, and reduced the storage space by 30×.
5.7. Android Devices Deployment
The models were trained on a PC Server. They were properly running on Linux with Tensorflow framework. However, this could not be done directly on a mobile devices, and some essential transformation and deployment were needed. The compressed CNNs models were deployed on Android platforms, and the corresponding performances were also tested.
The file format of CNNs model on Linux was *.h5. It was first converted into format of *.pb to deploy on Android devices. The file size of
,
and
models were 2.82 MB, 4.02 MB and 9.06 MB, respectively, after format conversion, which were similar to their PC format.
Table 7 gives the result of model testing on selected Android platforms.
626,
845 and
970 were used for testing. As can be seen form the result, mobile devices showed good efficiency, and could execute the operation in tens of milliseconds, thus could support real-time applications.
970 obtained the best performance, and it cost 0.00076 s to execute the
model. This might derive from the Neural Network Processing Unit it contains.