1 Introduction

There are a limited amount of datasets available for facial micro-expressions (henceforth micro-expressions) analysis, and the ones that do exist vary in standards, especially with the frame rates and resolution chosen for capturing the videos. Early datasets were created with low specification such as low resolution and frame rate. Recently with new and advanced technologies for capturing and gathering datasets, researchers start to create high quality dataset. The non-publicly available datasets include the USF-HD [23], and the Polikovsky dataset [21] which have a frame rate of 29.7 and 200 fps respectively. The publicly available datasets include the CASME [26] dataset using 60 fps, the SMIC [13] dataset using 100 fps, CASME II [27] using 200 fps, SAMM [6] using 200 fps and CAS(ME)2 [22] using 30 fps. The researchers in this field have been collecting data using different settings for frame rate, resolution, experimental design, with or without stimuli, lighting condition and camera model. While some suggested high frame rate [6, 27], the most recent work [22] in this field still use a low frame rate. The question that arises here is that are these high quality datasets needed to improve micro-expressions analysis, and among those different standards which is the best?

To address the above question, we provide new insights of the implication of spatial temporal changes on micro-expression recognition by conducting a comparative study using the most popular feature types on two high frame rate and popular benchmark datasets, i.e. SMIC and CASME II. First we review the relevant spatial temporal work in micro-expression recognition outline our method in generating various frame rates and resolution. Then we summarise the three basic feature descriptors and a classifier used for this work. Finally, we present the results and discuss the future work.

2 Related work

Change in temporal settings using temporal interpolation model (TIM), which have been first introduced by Pfister et al. [18], have been tested in many micro-expression studies. To achieve more statistically stable histograms, Pfister et al. [19] used TIM to increase the number of frames of earlier version of SMIC.

In 2013, Li et al. [13] carried out a micro-expressions recognition experiment on SMIC dataset to discriminate three classes of micro-expression (positive, negative and surprise) using LBP-TOP. Due to limited number of frames In VIS and NIR, to avoid the problems that may happen when applying different parameters for LBP-TOP, TIM has been used to up-sample the frames. TIM here just used for avoiding the problem that may arise when applying LBP-TOP. A year later, Le et al. [12] used TIM to avoid the biases that can be caused by the different frame lengths. They equalized all frame lengths of CASME II and SMIC samples (15 and 10 frames respectively).

In 2015, Huang et al. [10] proposed Spatio-Temporal Local Binary Pattern with Integral Projection (STLBP-IP). They used integral projection to boost the capability of LBP-TOP with experiments conducted on the CASME II and SMIC datasets using Support Vector Machine (SVM) as a classifier. For a fair comparison with the method proposed by Li et al. [13], they used TIM to interpolate videos into 15 frames. At the same year, Li et al. [14] evaluated the performance of three feature types (LBP, HOG and histograms of image gradient orientation (HIGO)) on CASME II and SMIC. They used TIM on SMIC to test the affect of interpolation length. Different frame lengths from 10 to 80 with fixed incremental steps of 10 frames have been tested. They concluded that 10 frames achieved the best performance.

Using TIM to adjust the temporal settings is a well-known method in micro-expression analysis. Recently, there were studies investigating image resolution for deep learning by [9, 24]. However, there is a lack of thorough research in further investigating the implication of spatial-temporal changes for micro-expression recognition.

3 Methodology

Due to the inconsistency of facial micro-expression dataset specifications, such as different resolutions and frame rates, we propose to experiment the effects of these variations in micro-expression recognition. CASME II [27] and SMIC [13] has been chosen to experiment these effects because it has high frame rate which makes it intuitive to test different frame rates. It also contains a high number of micro-expression samples and higher intensity for the facial movements.

3.1 Frame rate subsampling

To subsample each micro-expression video clip to different frame rates, we use a Temporal Interpolation Model (TIM) [18]. This uses graph embedding to interpolate at random points with the micro-expression clips. This method allows for a more statistically stable feature extraction when reducing the original frame rate of SMIC and CASME II.

A micro-expression video is seen as a set of images sampled along a curve, and a continuous function is created in a low-dimensional manifold by representing the video as a path graph Pn with n vertices. The vertices correspond to video frames and edges to the adjacency matrix W ∈{0,1}n×n with Wi,j = 1 if |ij| = 1 and 0 otherwise. To complete manifold embedding in the graph, Pn is mapped to a line that minimise the distance between connected vertices. If y = (y1,y2,…,yn)T is the map, y is obtained by minimising the following

$$ \sum\limits_{i,j}(y_{i}-y_{j})^{2} W_{i,j}, i,j = 1,2,\ldots,n $$
(1)

where this equation is equivalent to calculating the eigenvectors of the Laplacian graph Pn. The Laplacian graph is created with the eigenvectors {y1,…,yn− 1} and allows yk to be viewed as a set of points described by

$$ {f^{n}_{k}}(t) = \sin(\pi kt + \pi(n-k)/(2n)), t\in [1/n,1] $$
(2)

sampled at t = 1/n,2/n,…,1. The resulting curve described by

$$ \mathcal{F}^{n}(t) = \left[\begin{array}{cc} {f^{n}_{1}}(t) \\ {f^{n}_{2}}(t) \\ {\vdots} \\ f^{n}_{n-1}(t) \end{array}\right] $$
(3)

This curve is then used to temporally interpolate images at random positions within a micro-expression. To find the correspondences for the curve \(\mathcal {F}^{n}\) within the image space, the image frames are mapped to points defined by \(\mathcal {F}^{n}(1/n),\mathcal {F}^{n}(2/n),\ldots ,\mathcal {F}^{n}(1)\). A linear extension of graph embedding [25] is then used to learn a transformation vector w that minimises

$$ \sum\limits_{i,j}(w^{T}x_{i} - w^{T} x_{j})^{2} W_{i,j}, i,j = 1,2,\ldots,n $$
(4)

where \(x_{i} = \xi _{i} - \bar {\xi }\) is a mean-removed vector and ξi is the vectorised image. The resulting eigenvalue problem was solved by He et al. [8]

$$ XLX^{T}w = \lambda^{\prime}XX^{T}w $$
(5)

by using the singular value decomposition with X = UΣVT. A new image ξ can then be created using interpolation by

$$ \xi = UM\mathcal{F}^{n}(t) + \bar{\xi} $$
(6)

where M is a square matrix. There is an assumption that ξi are linearly independent, and the validity of the TIM method depends on this.

The interpolated frames of a micro-expression clip preserves the characteristics of the original movement well, whilst smoothing out the temporal profile. For the proposed method, we chose to interpolate the original frame rate of 200 fps down to 100 and 50 fps. The amount of frames chosen was determined by

$$ \alpha = \gamma(\theta \in {\Omega}) $$
(7)

where α is the amount of frames chosen for subsampling, γ is the scaling factor and 𝜃 ∈Ω is the original amount of frames 𝜃 within the movement Ω. For instance, the scaling factor for CASME II is represented by 0.5 for 100 frames and 0.25 for 50 frames.

3.2 Resolution down-scale

CASME II has been captured in 640×480 pixels in the raw section of the dataset. The pre-processed part of the dataset have about 280×340 pixels for the cropped facial area. SMIC high speed (HS) camera set to 100 fps and resolution of 640×480 was used to gather the expressions. The facial resolution for SMIC is 190×230 pixels, which has lower resolution than CASME II. In order to test the effects of resolution variations in micro-expressions recognition we scaled down both datasets (SMIC and CASME II) by 75% and 50% from the original resolution which also included as shown in Fig. 1.

Fig. 1
figure 1

Down scale resolution: 100% (original resolution), 75% and 50% from the original resolution

3.3 Feature representation

3.3.1 Local binary pattern-three orthogonal planes(LBP-TOP)

The LBP operator forms labels for each pixel in an image by thresholding a 3×3 neighbourhood of each pixel with the centre value. The result is a binary number where if the outside pixels are equal to or greater than the centre pixel, it is assigned a 1, otherwise it is assigned a 0. The amount of labels will therefore be 28 = 256 labels.

This operator was extended to use neighbourhoods of different sizes. Using a circular neighbourhood and bilinearly interpolating values at non-integer pixel coordinates allow any radius and number of pixels in the neighbourhood. The grey-scale variance of the local neighbourhood can be used as the complementary contrast method. The following notation of (P,R) will be used for pixel neighbourhoods, where P are sampling points on a circle of radius R. Figure 2 shows an example of LBP computation.

Fig. 2
figure 2

LBP code calculation by using the difference of the neighbourhood pixels around the centre

Uniform patterns can be used to reduce the length of the overall feature vector and implement a single rotation-invariant descriptor. An LBP that is uniform when the binary pattern contains at most two bitwise transitions from 0 to 1 or vice versa when the bit pattern is traversed circularly. So 00000000 (0 transitions), 01110000 (2 transitions) and 11001111 (2 transitions) are uniform whereas the patterns 11001001 (4 transitions) and 01010010 (6 transitions) are not. In the computation of the LBP labels, uniform patterns are used so that there is a separate label for each uniform pattern and all the non-uniform patterns are labelled with a single label. For example, when using (8,R) neighbourhood, there are a total of 256 patterns, 58 of which are uniform, which yields in 59 different labels.

Based on the LBP operator, LBP-TOP was first described as a texture descriptor [29] that used XT and YT temporal planes rather than just the 2D XY spatial plane. Yan et al. [27] used this method to report initial findings in the CASME II dataset, and Pfister et al. [19] and Davison et al. [4] used it as feature descriptors in their work.

Each region has the standard LBP operator applied [17] with c being the centre pixel and P being neighbouring pixels with a radius of R

$$ LBP_{P,R}=\sum\limits_{p = 0}^{P-1}s(g_{p}-g_{c})2^{p} $$
(8)

where gc is the grey value of the centre pixel and gp is the grey value of the p-th neighbouring pixel around R. 2p defines weights to neighbouring pixel locations and is used to obtain the decimal value. The sign function to determine what binary value is assigned to the pattern is calculated as

$$ s(\mathbf{A})= \left\{\begin{array}{lll} 1, & \text{if}\ \mathbf{A}\geq 0 \\ 0, & \text{if}\ \mathbf{A} < 0 \end{array}\right. $$
(9)

If the grey value of P is larger than or equal to c, then the binary value is 1, otherwise it will be 0. Figure 2 illustrates the sign function on a neighbourhood of pixels. After the image has been assigned LBP, the histogram can be calculated by

$$ H_{i} = \sum\limits_{x,y} I\{LBP_{l}(x,y) = i\},i = 0,\ldots, n - 1 $$
(10)

where LBPl(x,y) is the image labelled with LBP. As this method is incorporating temporal data, the histogram can be extended to be calculated for all three planes

$$ H_{i,j} = \sum\limits_{x,y,t} I\{LBP_{j}(x,y,t) = i\},i = 0,\ldots, n_{j} - 1 $$
(11)

where nj is the number of labels produced by the LBP operator in the j th plane. j = 0,1,2 which represents the XY, XT and YT planes respectively. LBPi(x,y,t) expresses the LBP code of the central pixel (x,y,t) in the j th plane. The I{A} function is the equivalent to (10) that refers to the sign function in (9). An illustration of the LBP-TOP histogram concatenation process can be seen in Fig. 3.

Fig. 3
figure 3

LBP is calculated on every block in all three planes. Each plane is then concatenated to obtain the final LBP-TOP feature histogram

The neighbouring points and radius parameters (P,R) can be defined as PXY,PXT,PYT,RX,RY,RT for each plane and axis, with the overall feature descriptor defined as \(LBPTOP_{P_{XY}, P_{XT}, P_{YT}, R_{X}, R_{Y}, R_{T}}\). We chose to use the best case results from [27] and set the neighbouring points and radii parameters to LBPTOP4,4,4,1,1,4.

3.3.2 3D histograms of oriented gradient (3DHOG)

3D histograms of oriented gradient (3DHOG) [11] is adapted version of histograms of oriented gradient (HOG) for static images to be suitable for dynamic texture in micro-expressions. 2D can be represent as I(x,y), orientation and gradient could be calculated [5, 21] as follows

$$ \begin{array}{lll} &m_{2D}(x,y) = \sqrt{ \delta I_{x} (x,y)^{2} + \delta I_{y} (x,y)^{2} } \\ &\theta_{2D}(x,y) = tan^{-1} (\delta I_{y} (x,y)^{2} / \delta I_{x} (x,y)^{2} ) \end{array} $$
(12)

where δIx(x,y) and δIy(x,y) stand for image partial derivative. In 3D case the video v(x,y,t) where t refer to time, firstly partial derivative should be calculated along x,y and t. then compute magnitude mxy(x,y,t),mxt(x,y,t) and myt(x,y,t) and orientation 𝜃xy(x,y,t),𝜃xt(x,y,t) and 𝜃yt(x,y,t) for each couple (δvx,δvy), (δvx,δvt) and (δvy,δvt) using (13) [21]

$$ \begin{array}{lll} &m_{xy}(x,y,t) = \sqrt{ \delta v_{x} (x,y,t)^{2} + \delta v_{y} (x,y,t)^{2} } \\ &\theta_{xy}(x,y,t) = tan^{-1} \left( \frac{\delta v_{x}(x,y,t)^{2}}{\delta v_{y}(x,y,t)^{2}}\right) \\ &m_{yt}(x,y,t) = \sqrt{ \delta v_{y} (x,y,t)^{2} + \delta v_{t} (x,y,t)^{2} } \\ &\theta_{yt}(x,y,t) = tan^{-1} \left( \frac{\delta v_{y}(x,y,t)^{2}}{\delta v_{t}(x,y,t)^{2}}\right) \\ &m_{xt}(x,y,t) = \sqrt{ \delta v_{x} (x,y,t)^{2} + \delta v_{t} (x,y,t)^{2} } \\ &\theta_{xt}(x,y,t) = tan^{-1} \left( \frac{\delta v_{x}(x,y,t)^{2}}{\delta v_{t}(x,y,t)^{2}}\right) \end{array} $$
(13)

gradient orientation histograms are computed for every frame for the couples (δvx,δvy) gradient orientation histogram contains 8 bins (δvx,δvt) and (δvy,δvt) contains 12 bins. After computing histograms in (δvx,δvy) (δvx,δvt) and (δvy,δvt) for every frame, all histograms corresponding to the same frame are concatenated to one feature vector and normalized.

3.3.3 Histogram of oriented optical flow (HOOF)

HOOF [2] features compute optical flow for each frame, then the vector binned according to the orientation and weighted according to the magnitude, where each optical flow consist of pair of angle and magnitude and can be represented as v = [x,y]T with direction \(\theta = tan^{-1}(\frac {y}{x})\). Figure 4 explain how to build HOOF feature with 4 bins.

Fig. 4
figure 4

Build HOOF features with 4 bins

One of the HOOF-based methods is Main Directional Mean Optical Flow (MDMO) [15] which is a ROI-based normalised statistic feature. Discriminative Response Map Fitting (DRMF) [1] used to locate 68 facial feature points, 66 of them have been used (two inner corner points of lip ignored) to normalize faces base on the first frame. Normalized face has been partitioned to 36 regions of interest (ROIs) determined by 66 feature points and partially based on FACS. Using optical flow the change in intensity between two pixels has been detected between two frames over time motion of objects. Changing in intensity can be represented by

$$ I(x,y,t) = I(x+ {\Delta} x,y+ {\Delta} y,t+ {\Delta} t) $$
(14)

the optical flow value of a pixel between two frames at time t in Euclidean coordinates can be represented as a two-dimensional vector

$$ [{V^{t}_{x}},{V^{t}_{y}}]^{T} $$
(15)

To compute MDMO feature for micro-expression recognition the Euclidean coordinates \([{V^{t}_{x}},{V^{t}_{y}}]^{T}\) has been converted into polar coordinates (ρi,𝜃i), where ρi is the magnitude and 𝜃i is orientation of the optical flow vectors. Histogram of oriented optical flow (HOOF) [2] has been computed for each ROI in each frame \({R^{k}_{i}}\), where i the index of frames and k is the index of ROIs and the optical has been classified into 8 bins. A mean vector \(\overline {u}^{k}_{i}\) has been computed for optical flow vectors in bin with maximum count. \(\overline {u}^{k}_{i} = (\overline {\rho }^{k}_{i},\overline {\theta }^{k}_{i})\), \(\overline {\theta }^{k}_{i}\) called maindirection. A feature vector Ψi has been built by \({\Psi }_{i} = (\overline {u}^{1}_{i}, \overline {u}^{2}_{i},,\overline {u}^{36}_{i})\), this make The dimension of feature vector is 36×2 = 72, where 36 is the number of ROIs. Then micro-expression represented by concatenated features vector of each frame Γ = (Ψ12,...,Ψn), where n the number of frames in micro-expression. Finally, a normalisation in Cartesian coordinate has been done before converted back to polar coordinates and represented by

$$ \overline{\Psi} = [(\overline{ \rho_{1}},\overline{ \theta_{1}})^{T},(\overline{ \rho_{2}},\overline{ \theta_{2}})^{T},...,(\overline{ \rho_{36}},\overline{ \theta_{36}})^{T}] $$
(16)

In this experiment, block-based HOOF has been used and the parameters was set to pRow = 6, pCol = 6 and pFrames = 6 where pRow, pCol and pFrames is sub-block size in pixels for the row, column and frame respectively. The number of blocks was set to 3×3 spatial blocks and 2 temporal blocks, Horn-Schunck method has been used for optical flow computing and the last parameter is quantization, which is set to 8 orientations.

3.4 Classification

First proposed by Cortes and Vapnik [3], a Support Vector Machine (SVM) attempts to find a linear decision surface (hyperplane) that can separate classes and has the largest distance between support vectors (elements in data closest to each other across classes). If a linear surface does not exist, then an SVM is able to use kernel functions to map the data into a higher dimensional space where a decision surface can be found.

We use the Sequential Minimal Optimization (SMO) [20] algorithm to train the SVMs. SMO is able to break down large quadratic programming problems into a series of the smallest possible problems, which are solved analytically and avoids using a time-consuming numerical quadratic programming optimisation as an inner loop. SMO is also able to handle large training sets and is one of the computationally fastest methods of evaluating linear SVMs.

3.5 Evaluation method

We report the result in F-Measure (or F1-Score) due to the inbalanced database [12]. Using the conventional accuracy measure may result in a bias towards classes with large number of samples, hence overestimating the capability of the method. F-Measure micro-average across the whole dataset and is computed based on the total true positives, false negatives and false positives, across 10-fold cross validation and/or Leave-one-subject-out (LOSO) folds. The next section discuss and compare the result of our comparative study (Table 1).

Table 1 The performance of the state-of-the-art methods on SMIC and CASME II

4 Results and discussion

We have conducted comprehensive evaluation on the performance of three popular feature representations on micro-expressions recognition using difference frame rates and resolution. We observed that the top performers across different categories were varied.

Table 2 summarised the results of two validation methods, 10-fold cross validation and Leave-one-subject-out (LOSO), on CASME II. For 10-fold cross validation, the best result was LBP-TOP with 200 fps and 100% of the original resolution, achieved an F-Measure of 0.637. However, when validated with LOSO, the best results was HOOF with 200 fps and 50% resolution, achieved an F-Measure of 0.439.

Table 2 The results of CASME II for both the 10-fold cross-validation and leave-one-subject-out for the 3DHOG, HOOF and LBP-TOP features with a varying resolutions and frame rates. The best result for each metric is in bold

Table 3 summarised the results of 10-fold cross validation and LOSO on SMIC. For 10-fold cross validation, the best result was 3DHOG with 50 fps and 75% resolution, achieved an F-Measure of 0.624. When validated with LOSO, the best result was HOOF with 100 fps and 75% resolution, achieved an F-Measure of 0.614.

Table 3 The results of SMIC for both the 10-fold cross-validation and leave-one-subject-out for the 3DHOG, HOOF and LBP-TOP features with varying resolutions and frame rates. The best result for each metric is in bold

4.1 Comparison of the state-of-the-art methods

Table 1 compared the performance of the state-of-the-art methods on SMIC and CASME II. The majority of the state-of-the-art results were based on the original resolution and frame-rate. Therefore, this comparison was based on the original properties of the dataset. In addition, the majority of previous works reported their results using accuracy as the performance metric. Therefore, we compared these methods based on accuracy. From our observation, although Li et al. [14] and Liu et al. [15] achieved good accuracy on SMIC dataset, the performance on CASME II were comparable to the basic features of HOOF, LBP-TOP and 3DHOG.

4.2 Temporal analysis

Since LOSO is a better approach in performance measure, we further analyse its performance for temporal analysis. As shown in Figs. 5 and 6, we observed that LBP-TOP and HOOF performed better in high frame rate on CASME II and SMIC. For the majority of the resolution, F-Measure decreased or maintained as the frame rates dropped. In contrast, when compared the performance of 3DHOG on different frame rates, the F-Measure increased on lower frame rate (50 fps for CASME II and SMIC), as illustrated in Fig. 7.

Fig. 5
figure 5

Comparison of F-Measure using LBP-TOP with varying resolution and frame-rates. The graph shows the best result is using high frame rate and high resolution when evaluated using LOSO validation on CASME II (100%, 75% and 50%) and SMIC (HS100%, HS75% and HS50%)

Fig. 6
figure 6

Comparison of F-Measure using HOOF with varying resolution and frame-rates. The graph shows the best result is using high frame rate and lower resolution when evaluated using LOSO validation on CASME II (100%, 75% and 50%) and SMIC (HS100%, HS75% and HS50%)

Fig. 7
figure 7

Comparison of F-Measure using 3DHOG with varying resolution and frame-rates. The graph shows the overall best result is achieved by using low frame rate when evaluated using LOSO validation on CASME II (100%, 75% and 50%) and SMIC (HS100%, HS75% and HS50%)

4.3 Spatial analysis

Regarding the effect on varying resolutions on micro-expressions recognition, there is also an inconsistency for optimal resolution through different features. We found that LBP-TOP achieved better result when the full resolution of CASME II and SMIC were used. For HOOF, a low resolution (50% from the original resolution) achieved the best results on CASME II and the mid resolution (75% from the original resolution) achieved the best results on SMIC. This might be due to the lower facial resolution of SMIC when compared to CASME II. For 3DHOG, the result on spatial analysis is inconclusive. As illustrated in Fig. 7, the results are varied but the majority of the results performed better with low frame rate.

4.4 Features analysis

By taking into account the effect of resolution and frame rate on micro-expressions recognition when using different feature descriptors, we observed that LBP-TOP as an example of texture-based features performed better on high resolution and high frame rate as illustrated in Fig. 5, texture-based features depended on pixels to extract the informations of micro-expressions, so more pixels(high resolution) means more informations. Also, high frame rate increases the number of pixels on the 3rd plane, this is why LBP-TOP performs better on high settings. Gradient-based features such as 3DHOG in these experiments does not need a high specification to achieve good results as shown in Fig. 7, with the best result achieved on 50 fps across different resolutions. On the other hand, HOOF features, which is optical flow-based, as shown in Fig. 6, performed better in a high frame rate scenario. Whilst there was no considerable need for high resolution, we found that the best result was recorded in 50% of the original resolution. As the opposite of texture-based features, we found that optical flow extracts the information by calculating the motion between frames, so it depends on temporal more than spatial this is why it needs high frame rate for better performance rather than high resolution.

4.5 Summary

The LOSO method of evaluation shows a decrease in accuracy compared with the 10-fold cross validation. While both have been used in previous research, the differences show the challenging nature of finding the correct way to determine a method’s success. Further, the lower performance seen overall with LOSO can be attributed to the fairer nature of testing on a subject completed omitted from the training stage.

The frame rate is certainly important, with drops seen as this is decreased. However, obtaining equipment and data storage for 200 FPS recording can be difficult. A good trade-off could reduce the frame rate, but keep the best performing resolution and feature.

LBP-TOP and 3DHOG are relatively simple feature types, with HOOF being somewhat more informative based on temporal data. As micro-expression movements can look unique, even though the same muscle are used, simple features tend to pick out the obvious changes and struggle to model how a real micro-expression differs from noise.

5 Conclusion

Research on automated facial micro-expression recognition using machine learning has witnessed good progress in recent years. A number of promising methods based on texture features, gradient features and optical flow features have been proposed. Many datasets were generated but lack of standardisation is indeed a great challenge. Therefore, this paper focused on comparing and discussing the effect of different frame rate and resolution.

Three of the most famous feature descriptors have been used to represent micro-expression. These features have variation in their nature, which is very suitable to test the effect of frame rate and resolution based on these features. LBP-TOP used as an examples of texture-based features, 3DHOG to represent gradient-based features and HOOF to represent optical flow-based features. For LBP-TOP we found it was perform better in high-resolution and high frame rate, unlike 3DHOG which perform better in low specification, high frame rate improved recognition rate when HOOF features used but increasing resolution does not give a significant improvement. Based on the results obtained, we found that the results of micro-expressions recognition with different settings for datasets being influenced by different features used, where the different features act in a different way through various settings. Analysis on 10 folds of 10-fold cross validation have been done, the distribution of classes through 10 folds are well distributed with low standard deviation, this implies that the sampling variations are minimal as shown in the Appendix. Based on this analysis, the results variation were caused by the spatial and temporal changes.

The accuracy of many state-of-the-art methods is still too low to be deployed effectively in a real-world environment. We provide important insights for researchers in this field to consider the settings when conducting new experiment in the future. The progress in research of micro-expressions recognition can aid in the paradigm shift in affect computing for real-world applications in psychology, health study and security control. Future work will be investigating into SAMM dataset, where the authors [7] recently introduced emotional classes and objective classes. This dataset has been used in micro-expressions grand challenge [28].