Keywords

1 Introduction

The ability to recognize materials and their texture based on their visual appearance is crucial in several applications, from robot manipulation to industrial production, to food recognition and so on. While the topic has historically been widely researched in computer vision, the generalization abilities obtained so far are still not up to what would be desirable for moving from research labs to commercial applications at large [18, 32].

The generalization problem, i.e. the experimental fact that classifiers trained on a given dataset do not perform very well when tested on a new database, received a renewed attention in the visual learning community since 2012, when it has been casted into the domain adaptation framework [14, 31]. Here, the key assumption is that images depicting the same visual classes, but acquired in different settings, at different times and with different devices, are generated by two related but different probability distributions. Hence, domain adaptation approaches attempt to close the shift among the two distributions. Although domain adaptation by its very nature is pervasive in visual recognition, to the best of our knowledge the problem has not been investigated so far in the texture classification scenario.

This paper aims at filling this gap, presenting a domain adaptation setting for material recognition, and studying how different state of the art non-deep domain adaptation algorithms perform in this scenario. We test all methods using shallow as well as deep features, and we compare our results with off-the-shelf classifiers not explicitly addressing the domain shift between training and test data. Our results clearly show that domain adaptation is a very real problem for classification of textures in the wild, and that the use of domain adaptive classifiers lead to an increase in performance of up to 6.87%.

The rest of the paper is organized as follows: Sect. 2 describes the data, features and classifiers used in our benchmark evaluation. Section 3 reports our experimental findings, clearly demonstrating the presence of a domain shift in this setting and the ability of existing domain adaptation algorithm to alleviate it. We conclude the paper with an overall discussion and proposing possible future research directions.

2 Materials and Methods

2.1 Databases

The goodness of a domain adaptation technique is evaluated by measuring the classification accuracy when trained on a given database and tested on another one that contains the same texture classes. To this end, we have analyzed most of the existing texture databases in order to identify those that share the highest number of texture classes. As a result of this process we found 23 classes in common between the ALOT [3] and RawFooT [11] databases and about ten classes in common between CureT [12] and ALOT, CureT [12] and KTH-TIPS2b [4], STex [16] and CureT. For the evaluation presented in this paper we considered the 23 classes in common between ALOT and RawFooT. Examples of these 23 texture classes of both databases are displayed in Fig. 1.

Fig. 1.
figure 1

Examples of the 23 classes in common between ALOT (left) and RawFooT (right)

The Raw Food Texture database (RawFooT), has been specially designed to investigate the robustness of descriptors and classification methods with respect to variations in the lighting conditions [8,9,10,11]. Classes correspond to 68 samples of raw food, including various kind of meat, fish, cereals, fruit etc. Samples taken under D65 at light direction \(\theta =24^{\circ }\) are showed in Fig. 2. The database includes images of 68 samples of textures, acquired under 46 lighting conditions which may differ in:

  1. 1.

    the light direction: 24, 30, 36, 42, 48, 54, 60, 66, and 90\(^\circ \);

  2. 2.

    illuminant color: 9 outdoor illuminants: D40, D45, ..., D95; 6 indoor illuminants: 2700 K, 3000 K, 4000 K, 5000 K, 5700 K and 6500 K, we will refer to these as L27, L30, ..., L65;

  3. 3.

    intensity: 100%, 75%, 50% and 25% of the maximum achievable level;

  4. 4.

    combination of these factors.

Fig. 2.
figure 2

Overview of the 68 classes included in the Raw Food Texture database. For each class it is shown the image taken under D65 at direction \(\theta \) = \(24^{\circ }\).

For each of the 23 classes in common with ALOT we considered 16 patches obtained by dividing the original texture image, that is of size 800 \(\times \) 800 pixels, in 16 non-overlapping squares of size 200 \(\times \) 200 pixels. We selected images taken under half of the imaging conditions for training (indicated as set1, a total of 3496 images) and the remaining for testing (set2, a total of 3496 images). For each class we selected eight patches for training and eight for testing by following a chessboard pattern (white positions are indicated as W, black positions as B).

Fig. 3.
figure 3

The 250 classes of the ALOT database. (Color figure online)

The Amsterdam Library of Textures (ALOT) is a color image collection of 250 rough textures. In order to capture the sensory variation in object recordings, the authors systematically varied viewing angle, illumination angle, and illumination color for each material. This collection is similar in spirit as the CURET collection [3]. Examples from the 250 classes is displayed in Fig. 3.

The textures were placed on a turn table, and recordings were made for aspects of 0, 60, 120, and 180\(^\circ \). Four cameras were used, three perpendicular to the light bow at 0\(^\circ \) azimuth and 80, 60, 40\(^\circ \) altitude. Furthermore, one is mounted at 60\(^\circ \) azimuth and 60\(^\circ \) altitude. Combined with five illumination directions and one semi-hemispherical illumination, a sparse sampling of the BTF is obtained.

Each object was recorded with only one out of five lights turned on, yielding five different illumination angles. Furthermore, turning on all lights yields a sort of hemispherical illumination, although restricted to a more narrow illumination sector than true hemisphere. Each texture was recorded with 3075 K illumination color temperature, at which the cameras were white balanced. One image for each camera is recorded with all lights turned on, at a reddish spectrum of 2175 K color temperature.

For each of the 23 classes shared with RawFooT, we considered 6 patches obtained by dividing the original texture image, in 6 non-overlapping squares of size 200 \(\times \) 200 pixels. For each class we have 100 textures acquired under different imaging conditions. For each texture we selected three patches for training and three for testing by following a chessboard pattern (white positions are indicated as W, black positions as B). We obtained a training set made of 6900 images (W positions) and a test set made of 6900 images (B positions).

The evaluation is performed on each single pair DB1 \(\rightarrow \) DB2:

  1. 1.

    R \(\rightarrow \) A: RawFooT used for training and ALOT used for test;

  2. 2.

    A \(\rightarrow \) R: ALOT used for training and RawFooT used for test;

For each pair DB1 \(\rightarrow \) DB2 we have 4 subsets:

  1. 1.

    training using DB1: set1 at positions W; test using DB2: set2 at positions B;

  2. 2.

    training using DB1: set1 at positions B; test using DB2: set2 at positions W;

  3. 3.

    training using DB1: set2 at positions W; test using DB2: set1 at positions B;

  4. 4.

    training using DB1: set2 at positions B; test using DB2: set1 at positions W;

this setup, even though is not required for this work, makes it possible to design unbiased inter-dataset experiments by excluding the possibility that the same portion of the texture samples or the same acquisition condition are included in both the training and the test set.

2.2 Features

The majority of texture analysis methods entails the computation of numerical representations, called features, that capture the distinctive properties of texture images. Many features have been proposed in the literature. These were traditionally divided into statistical, spectral, structural and hybrid [23]. Among traditional features the most widely known are probably those based on histograms, Gabor filters [2], co-occurrence matrices [15], and Local Binary Patterns [24].

More recent works approached the problem of texture classification by using features originally designed for scene and object recognition. For instance, Sharan et al. [27] used SIFT and HOG descriptors for material classification, while Sharma et al. [29] used a variation of the Fisher Vector approach for texture and face classification. Cimpoi et al. [5] shown how SIFT descriptors aggregated with the improved Fisher vector method greatly outperform previous descriptors in the state of the art on a variety of texture classification tasks. This direction of research further progressed with the replacement of image features explicitly designed with features automatically learned from data with methods based on deep learning [17]. Cimpoi et al., for instance, used Fisher Vectors to pool features computed by a convolutional neural network (CNN) originally trained for object recognition [6]. Lin and Maji used the same underlying CNN features and summarized them as Gram matrices [19]. In this work we considered three different images features: (i) Local Binary Patterns, (ii) Bag of SIFT descriptors, (iii) features computed by a CNN.

Local Binary Patterns (LBP) represent one of the most widely used method for the representation of textures [22]. LBPs are computed by thresholding the gray values in a circular neighborhood of pixels with the gray value of the central pixel. The resulting bits are arranged to form a binary representation that can be interpreted as a numeric code. The final descriptor is a histogram of the numeric codes. More in detail, we considered a neighborhood of 16 pixels at a distance of two pixels from the central one. Moreover, in forming the final histogram we considered only the “uniform” patterns that are those that include only at most two 0/1 transitions between adjacent bits and that, therefore, correspond to simple patterns. With this configuration, the feature vector is a histogram of 243 bins.

One of the most successful approach for image recognition is the use of the bag of visual words model [7]. Within this approach local descriptors extracted from an image are aggregated to form a histogram representing their distribution. More precisely, a codebook of visual words is formed by clustering the descriptors extracted on a set of training images. Then, given a new image, its descriptors are assigned to the closest visual word in the codebook, and the counts of descriptors assigned to each word form the final descriptor. In this work, we built a codebook of 1024 visual words by clustering the SIFT descriptors [21] extracted from a set of 20000 images from Flickr containing various content, such as sunset, countryside, etc. Therefore, the final feature vector is represented by the 1024 bins of the normalized histogram.

For the third feature vector, we followed the approach explored by Sharif Razavian et al. [28] that consists in using the intermediate representation computed by a convolutional neural network trained for image recognition. We used the VGG-16 network model [30] trained to identify the 1000 categories of the ILSVRC image recognition challenge [26]. As a feature vector we used the activations of the 4096 units forming the last layer before the computation of the final probability estimates.

2.3 Domain Adaptation Classifiers

We considered several domain adaptation methods:

Geodesic Flow Kernel (GFK): this method consists of embedding the source and target datasets in a Grassman manifold and model data with linear subspaces, and then constructing a geodesic flow between the two points, integrating an infinite number of subspaces along the flow. The geodesic flow represents incremental changes in geometric and statistical properties between the two domains. Then, the features are projected into this subspaces to form an infinite-dimensional feature vectors, and the inner product between these feature vectors define a kernel function that can be computed over the original feature space [14]. GFK is one of the most widely used domain adaptation methods in the literature; recent work showed that, when used over deep features, it is competitive with several deep domain adaptation approaches.

Subspace Alignment (SA): here, by using PCA we select, for each domain, the d eigenvectors corresponding to the d largest eigenvalues. These eigenvectors are used as bases of the source and target subspaces. Each source and target data are projected to its respective subspace. It is then learned a transformation matrix to map the source subspace to the target one. This allows to compare the source domain data directly to the target domain data, and to build classifiers on source data and apply them on the target domain. The advantages of the Subspaces Alignment are the robustness of the classifier which is not affected by local perturbations and the absence of regularization parameters [13].

Landmark-based Kernelized Subspace Alignment (LSSA): both methods described above have also some limitations. In the GFK algorithm, the search for the subspaces that lie on the geodesic flow is computationally costly and subject to a local perturbations. The SA algorithm assumes that the shift between the two distributions can be corrected by a linear transformation and in most of the cases only a subset of source data are distributed similarly to the target domain. So, the LSSA algorithm proposes: (i) selection of landmarks extracted from both domains so as to reduce the discrepancy between the source and target distributions, (ii) projecting the source and the target data onto a shared space using a Gaussian Kernel respect to the selected landmarks, (iii) learning a linear mapping function to align the source and target subspaces. This is done by simply computing inner products between source and target eigenvectors [1].

Transfer Component Analysis (TCA): this method tries to learn some transfer components across domains in a Reproducing Kernel Hilbert Space (RKHS) using Maximum Mean Discrepancy (MMD). TCA is a dimensionality reduction method for domain adaptation such that in the latent space spanned by these learned components, the variance of the data can be preserved as much as possible and the distance between different distributions across domain can be reduced [25].

Transfer Joint Matching (TJM): it aims at reducing the domain difference using jointly two learning strategies for domain adaptation: feature matching and instance re-weighting. Feature matching discovers a shared feature representation by jointly reducing the distribution difference and preserving the important properties of input data. Matching the feature distributions based on MMD minimization is not enough for domain adaptation, since it can only match the first-and high-order statistics, and the distribution matching is far from perfect. An instance re-weighted procedure should be cooperated to minimize the distribution difference by re-weighting the source data and then training a classifier on the re-weighted source data [20].

To fully assess the effect of each of the domain adaptation methods described above, we also used a linear SVM trained on the source data, and we tested it on the target data. We refer in the following to these experiments as “NA results”. The C parameter of SVM was set by doing cross-validation on the source domain with following values \(\epsilon \) {0.0001 0.001 0.01 0.1 1.0 10 100 1000 10000}, using the LIBSVM library.

2.4 Experimental Setup

As described before, we evaluated the different DA methods by comparing their performance with that of the linear classifier SVM for the no adaptation results, where we use the original input space without learning a new representation. The z-normalization is the first important step for the all domain adaptation algorithms and PCA is the method used for the dimensionality reduction. For each type of feature we set different parameters for each domain adaptation algorithm:

  • In the GFK the dimensionality of the subspaces was set to 120 for the LBP features, 300 for the SIFT features and 200 for the CNN features. We evaluate the accuracy of this method on the target domain over 5 random trials for each type of features.

  • In the SA we set the dimensionality of the subpaces to 150 for the LBP and CNN features and 300 for the SIFT features. Also for the SA algorithm the evaluation was performed over 5 random trials for each type of features.

  • In the LSSA algorithm an important parameter is the threshold for measuring the quality of a candidate landmark. We set it to 0,5. If the quality measure of the candidate is above this threshold , it is kept as a landmark. The dimensionality of the subpaces was set to the number of matched landmarks.

  • In the TCA we used the linear kernel on inputs and fixed \(\mu \) = 0, 3 (tradeoff parameter) to construct the transformation matrix. The dimensionalities of the latent spaces are fixed to 150 for LBP and SIFT features, 200 for CNN features.

  • The TJM approach involves two model parameter: subspaces bases k and \(\lambda \) regularization parameter . We set \(\lambda \) by searching \(\lambda \) \(\epsilon \) {0.01, 0.1, 1, 10, 100}. The k parameter was set to 100 for each type of feature. The evaluation was performed over 5 random trials.

Table 1. Domain adaptation results with the LBP features. A: Alot database, R: Rawfoot Database
Table 2. Domain adaptation results with the SIFT features. A: Alot database, R: Rawfoot Database

3 Results

The results obtained, for each domain adaptation method, are illustrated in Table 1 when using the LBP features, in Table 2 when using the SIFT features and in Table 3 when using the CNN features. We see that the GFK method outperforms on average the other approaches with all type of features when the ALOT database is the source domain and the Rawfoot database is the target domain. In the opposite case we get the best result with the JTM algorithm for the CNN features. In fact we can note that the type of feature has an important role in the evaluation of the DA methods: when using the CNN features, we achieve the greatest improvement for all methods.

Figure 4 shows the confusion matrices, where an element of a matrix with position (i,j) is a count of observations known to be in group i (true label) but predicted to be in group j (predicted label), for NA and GFK (top row) and NA and JTM (bottom row), using deep features. These are the cases where we see the greater advantage in using Da approaches for the ALOT -Rawfoot and Rawfoot-ALOT settings, respectively. We see that both domain adaptation algorithms significantly reduce the domain shift, alleviating the misclassification compared to the case where the domain shift is not taken into consideration.

Table 3. Domain adaptation results with the CNN features. A: Alot database, R: Rawfoot Database
Fig. 4.
figure 4

Confusion matrices for No Adapt (a) and GFK (b) evaluation with the CNN features when ALOT database is the source domain and for No Adapt (c) and JTM (d) evaluation when Rawfoot database is the source domain.

4 Conclusion

This paper addressed the issue of generalization in texture classification, in the context of domain adaptation. We presented a new benchmark setting that permits to study the problem in this domain, and a benchmark evaluation of shallow algorithms using handcrafted as well as deep features. Our results confirm the existence of the domain shift, as well as the superior generalization abilities of deep features and the effectiveness of domain adaptation algorithms in increasing the generalization across datasets. Future work will extend this study adding deep domain adaptation approaches, as well as designing larger experimental setups.