Ear recognition with ensemble classifiers; A deep learning approach

Maha Sharkas ORCID: orcid.org/0000-0003-4132-3984^1,2

2738 Accesses
16 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

Biometrics has emerged as a major domain for security systems. Ear as a biometric has many distinctive features which makes it promising for personal identification systems. In this paper, two tracks for classification of ear images are implemented and tested. The first employs a classical machine learning technique based on extracting features from the discrete curvelet transform and passing the extracted features to a classifier. Image preprocessing is needed for enhancement and segmentation. Ear region is first selected from the background then the curvelet transform via wrapping is applied on the segmented ear images. Different levels are investigated. The coarse image is divided into blocks and the mean, variance and entropy are calculated for each block and concatenated with the same calculated statistical features from the subimages at different levels forming the feature vector. The feature vector is passed to a classifier for ear recognition and the only classifier that provided comparative results was the ensemble classifiers. In the second track, deep learning methods are employed. Different end-to-end networks are used for classifying ear images. Features are then extracted from each network and fed to a shallow classifier for ear classification. Principal component analysis is used for feature reduction. Different classifiers are again investigated and the only classifiers which succeeded to give superior results are the Ensemble classifiers. The achieved classification rate showed improved results compared to the published methods that proves the superiority of the Ensemble classifiers for correctly classifying ear images.

Ear Recognition System Using Averaging Ensemble Technique

Fingerprint Recognition Revolutionized: Harnessing the Power of Deep Convolutional Neural Networks

An Intelligent Scheme for Human Ear Recognition Based on Shape and Amplitude Features

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Personal identification systems based on ear recognition is an active research area in biometrics. Ear images are captured from a distance which makes the technology an appealing choice for surveillance and security applications as well as other application domains.Unlike faces, ears are relatively constant over a person’s life and are unaffected by expressions, which make them a particularly appealing approach to noncontact biometrics [8].

The ear structure is rich and stable and is permanent over the human life and is quite unique in individuals. Also, it is invariable to the changes in pose and facial expression. Furthermore, it is relatively immune from anxiety, privacy, and hygiene problems like several other biometric candidates. Therefore, automated personal identification systems using ear images have been studied intensively for possible commercial applications [24]. Ear as a Biometric has some advantages over other biometrics like iris, fingerprints, face and retinal scans in that it is large compared to iris and fingerprint and image acquisition of the human ear is very simple and can be captured from a distance.

The anatomy of human ear is given in Figure 1. The human ear is an extremely arched 3D surface which has 3D discriminant features for human identification and recognition. Figure 1 shows various parts of the outer ear image such as helix, fossa, crus-antihelix, anti-helical fold, lower antihelix, antitragus, tragus and upper & lower concha.

It has been found that no two ears are exactly the same even that of identical twins [4] and [26].

For all the previous reasons, Ear biometrics can be used for computerized human identification and verification systems. One of the major applications of this technology is crime investigation and forensic sciences for recognition.

In this paper, two scenarios for ear classification are implemented. The first used the Discrete curvelet transform (DCT) which was especially designed to link scale with orientation. The DCT performs a multiscale and multidirectional expansions that provides good representation for objects with edges especially for objects which are smooth except for discontinuities along general curves with bounded curvature. The DCT was previously used for ear recognition in [1] with ear database from IIT Delhi ear databases. No segmentation method was used as the ear images were already cropped. A feature vector of the coarsest level image and the maximum coefficient of each image at the second coarsest level at eight different angles is generated. The K- nearest neighbor was used for classification. The recognition rate reached 97%.

In this paper features were extracted from the DCT in a different manner, which will be explained in later sections, from the AMI ear database which is not previously segmented and the ensemble classifiers were used for classification.

The second scenario employs deep learning. Here, three pretrained networks namely: AlexNet, GoogleNet and ResNet50 performed end-to-end ear classification. Two databases are investigated, the AMI ear database with 100 classes and IIT Delhi ear database with 125 classes. Classification accuracies are measured and compared.

Features from selected layers in the three networks are extracted and again passed to the ensemble classifiers for classification. The Principal component analysis PCA with different variance levels is used for feature reduction. Classification results at different levels are compared. A block diagram of the proposed system is shown in Fig. 2.

The rest of the paper is organized as follows: In Section 2, a background on different methods for ear recognition is introduced. The details of the first scenario for ear classification is given in Section 3 followed by its results in Section 4. Deep learning is briefly introduced in Section 5 and the second scenario is explained. Results of second scenario are given in Section 6. A discussion of the results is provided in Section 7 and finally the paper is concluded in Section 8.

Research contributions can be summarized as follows:

1.
Designing two tracks for personal identification systems based on ear recognition.
2.
The first track implements a classical machine learning method which goes through segmentation, feature extraction using the DCT and classification using Ensemble classifiers.
3.
The other track investigates deep learning methods using different CNNs.
4.
The results are compared with the latest state-of-the-art methods using the same datasets which proved the superiority of the proposed algorithms.

2 Background

The field of Biometric identification systems usually explores two tracks one using different feature extraction and classification methods, which are the traditional machine learning methods and the other track involves deep learning methods.

We will start by investigating some research based on machine learning techniques which started with Burge and Burger in 1998 with the first automatic ear recognition technique which was based on an adjacency graph built from Voronoi regions of ear-curve [7]. In 1999 Moreno et al. [27] presented the first fully automated ear recognition procedure using geometric characteristics of the ear and a compression network. In 2000, an ear recognition technique based on the Force Field Transform was suggested in [16].

The Forensic Ear Identification Project (FEARID) project was launched in 2001, marking the first large-scale project in the field of ear recognition [19]. In this project, ear prints are studied to investigate their strength for crime scenes. Three left and three right ear prints were collected from three countries. The equal error rate was used for evaluation which was 4% for lab quality images but increased to 9% for print vs mark comparisons.

Later several ear recognition techniques were implemented. Victor et al. [5] applied principal component analysis (PCA) on ear images which gave promising results but results proved that face is a more reliable biometric that the ear.

The field force transform was used in [17]. The method was implemented on 252 ear images taken from 63 subjects from the XM2VTS face database. The accuracy reached 99.2 for poorly registered and extracted ear images and dropped to 62.4 when using PCA but then increased to 98.4 with accurate extraction and registration. In 2006 a method based on non-negative matrix factorization (NMF) was developed by Yuan et al. [39] and was applied to occluded and non-occluded ear images from the USTB ear data base. Ears are manually extracted and three ears are used for training and the fourth for testing. The best recognition rate reached 91%. The drawbacks hers are manual extraction. A method based on the 2D wavelet transform was introduced by Nosrati et al. [28] in 2007, followed by a technique based on log-Gabor wavelets in the same year [23]. In [28] the authors used the 2D wavelet transform for feature extraction on two databases which are the USTB and Carreira-Perpinan. Accuracies reached 90.5% for the USTB database for two images out of four for training and in the case of the Carreira-Perpinan dataset accuracies reached 95.05 for three images out of four for training and 97.05 for four images for training. Here the accuracy increased when more images are used for training and in the case of four images, all images are training images.

In 2011, local binary patterns (LBP) was used for ear image description in [38]. Binarized statistical image features (BSIF) and local phase quantization (LPQ) features were used and their results are given in [6, 30, 31]. In [6] the authors used the Binarized Statistical Image Features to extract features from three databases IIT Delhi 1 & 2 and USTB database. The results reached 97.26, 97.34 and 98.46 respectively with KNN classifiers. The authors in [30] used a combined database from three data sets. The used dataset has 2432 images from 555 subjects which are: 363 subjects from UND-J2, 67 subjects from AMI and 125 subjects from IIT Delhi with at least two samples per subject. The hit rate reached 96.89 with the use of PCA which is used for dimensionality reduction and improved to 99.01 when multi-cluster search strategy is used. The authors continued their work in [31] where the best results were achieved for three different datasets when the LPQ and the BSIF descriptor were combined for feature extraction, LDA is used for dimensionality reduction and the cosine distance for classification. The authors in [24] proposed an automated human identification using 2D ear imaging. They presented a segmentation method based on morphological operations and Fourier descriptors. They extracted ear features using localized orientation information and examined local gray level phase information using complex Gabor filters. The rank-one recognition accuracy reached 96.27% and 95.93%, respectively, on the database of 125 and 221 subjects from the IIT-Delhi database.

In 2014 the first ear recognition system based on curvelet features was published in [1]. The feature vector of each image is composed of the approximate curvelet coefficients and maximum coefficients of second coarsest level curvelet coefficients at eight different angles. The k-NN (k-nearest neighbor) is utilized as a classifier. The accuracy reached 97% for the IIT-Delhi database. Here the author used segmented ear database from the IIT-Delhi database.

The second track, which is based on deep learning methods is investigated in the following research papers.

In [18], the authors fused deep features from different layers using discriminant correlation analysis and used pairwise SVM and KNN for classification. They applied their work on the USTB I and II and IIT-Delhi I and II ear databases and achieved an accuracy rate of above 99%. In [40], the authors proposed a new ear database under uncontrolled condition and tested the classification accuracy with CNN. They changed the last pooling layers with spatial pyramid pooling (SPP) layers in order to fit arbitrary data size and obtain multi-level features. They achieved a max accuracy of about 97%.

The authors in [11] proposed a deep learning model for unconstrained ear recognition. They passed the features to a shallow classifier for ear recognition. They suggested a deep learning–based averaging ensemble to limit the over fitting with best achieved results with an ensemble of ResNet18 models which provided consistent performance across the tested datasets. In [12], the authors created a new ear dataset and called it multi-pie ear dataset. The classification accuracy was improved by combining the output of different CNN models. In [13], the authors proposed a fusion of learned deep features with handcrafted features for unconstrained ear recognition. They reached a conclusion that handcrafted features are not dead and they improve the performance.

Scorenet, which is a deep cascade level fusion is proposed in [20]. Here, the authors fuse deep features from different levels of different CNN networks with handcrafted features for unconstrained ear recognition.

In [29], the authors created an earcode from the first principal component obtained by Kernel PCA. The authors created their own dataset from 103 persons. The performance rates were comparatively high with EER of 0.13 and TPR of 0.85. Also, the authors applied the algorithm on standard databases which are IIT1 and USTB1 databases and achieved comparative results.

The authors in [2] created a human recognition algorithm based on fusion of ear and tragus in a single image to overcome the challenges of partial occlusions, pose variations and week illumination. The Local binary pattern is used for feature extraction with score-level fusion and KNN used for classification. The experiments were implemented on USTB 1,2 and 3 dataset and gave comparative results.

The authors in [21] discussed the lack of color information in ear images and its effect on the accuracy. They suggested a framework responsible for colorizing grayscale and dark images followed by a classification task. The algorithm is implemented on two databases which are the constrained AMI and the unconstrained AWE ear datasets and provided an accuracy of 96 and 50.53 respectively.

In [3], the authors presented an ear recognition system based on CNN especially VGG networks. The best models were used to build ensembles of models with varying depth. The work was implemented on the AMI and WPUT ear datasets and also the AMIC Database which is the original AMI database but with critically cropped background. The rank1 classification accuracy reached 97.5 for VGG ensembles of 13-16-19, and 93.21 for AMIC with VGG ensembles of 11-13-16-19 and 79.08 for the WPUT database for the same VGG ensembles.

It is noticed here that the recognition accuracy is reduced with AMIC which is a segmented version of the AMI database and this is due that cropping profile images may result in losing important information.

The local binary pattern LBP and its use to extract features from ear images is discussed by the authors in [14]. They investigated its performance over five benchmark databases which are the IIT Delhi I and II, AMI, WPUT and AWE. The results showed good performance in case of constrained images while the accuracies decreased a lot with increased distortions.

A six later deep CNN was proposed by the authors in [33] and was tested on IIT Delhi II and AMI ear datasets with a recognition rate accuracy that reached 97.36% and 96.99% respectively for 1000 epochs. The results are repeated on the AMI dataset where the ear images are rotated with different angles with different illumination conditions and also when adding random noise. The recognition accuracy decreased and reached 91.99 for the combined variation conditions.

3 Ear classification with DCT features

In this section, the framework of the first scenario is introduced. The ear recognition algorithm starts with a simple segmentation method based on filtering and morphological operations. The segmentation method cropped the ear images leaving a part of the background. The discrete curvelet transform via wrapping is employed for feature extraction. Statistical features are extracted from the curvelet images at different levels. Different levels are investigated (three levels, four levels and five levels). The coarsest level image is divided into blocks and the mean and standard deviation are calculated for each block and concatenated with the same features extracted from images at different fine levels forming the ear feature vector. The entropy is then added using the same technique forming a new feature vector. The ensemble classifier using subspace discriminant analysis is used for classification.

The proposed algorithm is investigated in the following sections.

3.1 Ear segmentation

The AMI Ear Database used in this first scenario was created by Esther Gonzalez during her work on the PhD in Computer Science. The ear database contains images which are collected from students, teachers and staff of the Computer Science department at Universidad de Las Palmas de Gran Canaria (ULPGC), Las Palmas, Spain. The images are captured in an indoor environment. The database was collected from 100 different subjects in the age group from 19–65 years. Seven images (six right ear images and one left ear image) were taken for each individual.

Samples of the used database are shown in Fig. 3.

Nikon D100 camera was used to capture all the images under the same lighting conditions, with the subject placed seated at a distance of about 2 meters from the camera and looking at some previously fixed marks. Six of the seven images used A135 mm focal length while the 200 mm focal length was used for the image that was called ZOOM. From the captured images, five of them were right side profile images (right ear) with the subject facing forward (FRONT), looking up and down (UP, DOWN) and looking left and right (LEFT, RIGHT). The sixth image of right profile was taken with the subject also facing forward but with a different camera focal length (ZOOM). The last image which is the (BACK) image was a left side profile (left ear) and the subject in this case is facing forward and the same camera focal length is used as the previous five images.

The database consists of 700 images and has been sequentially numbered for every subject with an integer identification number. Images have a resolution of 492 x 702 pixels and are available in jpeg format.

94 subjects each having six images excluding the back ear comprising 564 images are used in this paper. Six subjects and the back ear are removed from the database in the first scenario because they were not correctly segmented.

All images are converted from color images to greyscale images where grayscale values are formed by forming a weighted sum of the R (red), G (green), and B (blue) components as follows:

$$0.2989\;\ast\;\mathrm R\:+\:0.5870\;\ast\;\mathrm G\:+\:0.1140\;\ast\;\mathrm B$$

Image segmentation starts with low pass filtering the ear image followed by applying a grey level threshold to convert to a binary image then in-between holes are filled. The threshold chosen is a global threshold using Otsu’s method. Otsu’s method chooses a threshold that minimizes the intraclass variance of the thresholded black and white pixels and is used for binarization. Otsu’s thresholding method involves iterating through all the possible threshold values and calculating a measure of spread for the pixel levels at each side of the threshold, i.e. the pixels that either fall in foreground or background, then in-between holes are filled.

All connected components (objects) that have fewer than P pixels are removed, producing another binary image. The value of P is experimentally chosen to be 150.

Mapping is performed between the processed binary image and the original image to produce the final segmented ear image.

Ear segmentation Steps are demonstrated in Fig. 4.

It can be noticed that a very simple segmentation method is used and a part of the background is included.

Samples of other segmented ear images are shown in Fig. 5.

3.2 The discrete curvelet transform

The curvelet transform was suggested by E. J. Cand`es and D. L. Donoho in [9]. The Curvelet transform is a geometric transform created to overcome the limitations of wavelet like transforms. Curvelet transform is a multi-scale and multi-directional transform with needle shaped basis functions. The basis functions of the wavelet transform are isotropic therefore, it requires large number of coefficients to represent the curve singularities. On the other hand, the basis functions of the Curvelet transform are needle shaped and have high directional sensitivity and anisotropy. Also, they obey parabolic scaling and therefore the Curvelet transform allows almost optimal sparse representation of curve singularities.

The Curvelet transform was designed to represent edges and other singularities along curves much more efficiently than traditional transforms by using fewer coefficients for a given accuracy of reconstruction.

The origin of the Curvelet transform is a ridge transform added with a binary square window. However, there exists big data redundancy in the transform. Therefore, the first generation curvelet transform can be improved to obtain the 2nd generation, and the second takes on features with faster computation and less redundancy. Curvelets as a function of x=(x1, x2) at scale 2^-j, orientation ${\theta }_{t}$ and position ${x}_{k}^{(j,l)}={R}_{{\theta }^{l}}^{-1}\left({k}_{1, }{2}^{-j}, {k}_{2}, {2}^{-\frac{j}{2}}\right)$are defined as [10]

$${\varphi }_{j,l,k}\left(x\right)={\varphi }_{j}\left({R}_{{\vartheta }_{i}}\left(x-{x}_{k}^{\left(j,l\right)}\right)\right)$$

(1)

$ {R}_{\theta } $ is the rotation in radians and is given by

$${R}_{\theta }=\left(\begin{array}{*{20}c}cos\theta & sin\theta \\ -sin\theta & cos\theta \end{array}\right)$$

(2)

and ${R}_{\theta }^{-1}$ is its inverse

To calculate the curvelet coefficient then apply the inner product between an element $f\in {L}^{2}({R}^{2}$) and a curvelet ${\varphi }_{j,l,k}\left(x\right)$

$$c\left(j,l,k\right):=\left(f,\varphi_{j,l,k}\right)=\int_{R^2}^{}f\left(x\right)\overline{\varphi_{j,l,k}\left(x\right)}dx$$

(3)

In the digital form the digital curvelet transform is given by

$${C}^{D}\left(j,l,k\right)={\sum }_{0\le {t}_{1,}{t}_{2<n}}f\left({t}_{1},{t}_{2}\right)\overline{{{\varphi }^{D}}_{j,l,k}\left({t}_{1},{t}_{2}\right)},$$

(4)

Where $f({t}_{1},{t}_{2}$) is an input cartesian array and $o\le {t}_{1},{t}_{2}<n$

The notation D stands for digital, ${{\varphi }^{D}}_{j,l,k}$ is the digital curvelet transform and ${C}^{D}\left(j,l,k\right)$ is a collection of coefficients.

The digital curvelet transform is implemented using fast discrete curvelet transform. It is computed in the spectral domain to use the advantage of FFT. The image and the curvelet are both transformed to the Fourier domain and the curvelet is convolved with the image in the spatial domain which becomes the product in the Fourier domain. The curvelet coefficients are finally obtained by applying the inverse Fourier transform on the spectral product.

The frequency response of a curvelet is a non-rectangular wedge, and this wedge is therefore needed to be wrapped into a rectangle to be able to apply the inverse Fourier transform. The wrapping is performed by periodic tiling of the spectrum using the wedge, and then the rectangular coefficients area is collected in center. The rectangular region collects the wedge’s corresponding portions from the surrounding periodic wedges using this periodic tiling [36]. The complete feature extraction process using a single curvelet is illustrated in Fig. 6.

In this paper, the ear images are segmented and then transformed using the discrete curvelet transform DCT via wrapping into the curvelet domain. The DCT is a redundant transform. The DCT is implemented in different decomposition levels (three levels, four levels and five levels) and feature vectors are created.

In the three level decomposition, we have the coarsest level and one fine level at eight different angle. An example of the decomposed ear image is shown in Fig. 7.

At the three levels decomposition, the coarsest level image is of size 85×85 and is divided into blocks of size 7×7. In each block, the mean (mn) and standard deviation (stnd) are calculated. We have 169 blocks resulting in 169 values for mn and the same for stnd. Also, the mn and stnd are calculated for all image in the fine level, then again we get 8 values for mn and the same for stnd. Finally, the last level has one value for mn and same for stnd. All these values are concatenated to get a feature vector of length [169 + 169 + 8 + 8 + 1 + 1 = 356].

The same procedure is done for the four level decomposition with a coarsest level image of size 43×43 and 2 fine levels. The first level has images at eight different angles and the second fine levels has images at sixteen different angles.

The feature vector is of size [49 + 49 + 8 + 8 + 16 + 16 + 1 + 1 = 148].

Again, for the five level decomposition we have a coarsest level image of size 21×21 and three fine level at sixteen, thirty-two and thirty-two images at different angles respectively.

The feature vector is of length 180.

The classification results using the ensemble classifier will be shown later.

Another statistical parameter is added which is the entropy, and the same procedure is implemented. The entropy (E) is a statistical measure that measures the randomness and is used to estimate the texture in an input image and can also measure the distribution variation in a region. The feature vector length became 534 for the three levels, 222 for the four levels and 270 for the five levels. The mathematical equations for the mean, standard deviation and entropy in each block Z are given below.

$${mn}_{z}=\frac{1}{FG}\sum _{i,j=1}^{FG}{p}_{z(i,j)}$$

(5)

Where ${p}_{z(i,j)}$ are the pixels in each Z block and F×G is the block size.

$${stnd}_{z}=\sqrt{\frac{1}{FG}}{\sum }_{i,j=1}^{FG}{({p}_{z}\left(i,j\right)-{mn}_{z})}^{2}$$

(6)

$${E}_{z}=-\sum _{i=0}^{n-1}{pr}_{i}\times ln{pr}_{i}$$

(7)

where n is the number of grey levels and pr_i is the probability of a pixel having gray level i.

Again, the classification results will be discussed later.

3.3 The ensemble classifier

Ensemble classifier are a group of individual classifiers that are cooperatively trained on data sets to solve a supervised classification problem [34].

The base classifiers are trained separately on the data set to give a decision on a test pattern. The decisions arethen combined by a suitable fusion method. A number of fusion methods are discussed in the literature and include majority voting, Borda count, algebraic combiners etc. [32].

Several classifiers were investigated for ear classification using the features obtained in the first scenario including decision trees, supervised vector machine SVM, K nearest neighbors and ensemble classifiers which were the ones that provided the best results especially with the subspace discriminant ensemble classifier. The number of learners are chosen to be 30, and the subspace dimension is set to be 178. The classifier has 94 input classes and uses a 5-fold cross validation.

4 Results of the first scenario

As mentioned earlier, the AMI ear image database is used in this scenario. 94 subjects each having six images comprising a total of 564 images are the database used. The DCT via wrapping is implemented at different decomposition levels (3, 4 and 5 levels).

The curvelet transform is a completely redundant transform therefore features have to be carefully selected. Several experiments are implemented and tested. The coarsest level image is divided into sub-blocks of size 7×7, and the mean and variance are calculated for each block and concatenated with the calculated mean and variance of the subimages at the fine levels forming the feature vector.

For the three level decomposition, the coursest level image is of size 85×85 and the second coarsest level have subimages at eight different angles. Similarly, for the four levels, the coarsest level image is of size 43×43 and the finest levels have eight then sixteen different angles respectively. For the five levels, the coursest level image is of size 21×21 with sixteen, thirty-two and thirty-two angles at the second, third and fourth finest levels respectively.

Several classifiers were used for training and testing including Supervised Vector Machine (SVM), K nearest neighbor (KNN), Forest trees and Ensemble Classifiers which proved to produce the best accuracy results. The Ensemble classifier is used to test the accuracy. The accuracy is given by:

$$accuracy=\frac{{t}_{p}+{t}_{n}}{{t}_{p}+{t}_{n}+{f}_{p}+{f}_{n}}$$

(8)

where t_p is the true positive rate, t_n is the true negative rate, f_p is the false positive rate and f_n is the false negative rate.

The accuracy results are shown in Table 1.

Table 1 Accuracy results for feature vectors of mean and variance

Ear recognition with ensemble classifiers; A deep learning approach

Abstract

Similar content being viewed by others

Ear Recognition System Using Averaging Ensemble Technique

Fingerprint Recognition Revolutionized: Harnessing the Power of Deep Convolutional Neural Networks

An Intelligent Scheme for Human Ear Recognition Based on Shape and Amplitude Features

1 Introduction

2 Background

3 Ear classification with DCT features

3.1 Ear segmentation

3.2 The discrete curvelet transform

3.3 The ensemble classifier

4 Results of the first scenario

5 Deep learning

6 Results of the second scenario

7 Discussion

8 Conclusions

References

Funding

Author information

Authors and Affiliations

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation