2.1.2 Approaches.
A popular representation is a
Vector Space model (
VSM) [Salton et al.
1975], which projects data into a vector
\(v \in \mathbb {R}^d\) that can be manipulated with distance measures and Linear algebra constructs [Basseville
1989; Cha
2007] to process/compare instances. The
Bag of Words/Features (
BoW/BoF) [Salton and Buckley
1988; Sebastiani
2002] is a popular VSM that produces count-based weights for points/regions of interest in the input. BoF is a popular baseline for text, where semantically salient terms are easily identifiable and delineated by syntax and grammar. Common weighting schemes include boolean, term, and document frequency (BF, TF, DF), denoting presence, instance, and collection-level counts of single or n-tuples of terms (n-grams) in the text. The work in Badawi and Altınçay [
2014] utilizes BoW features, along with an investigation on term weighting and selection in the binary classification of articles and biomedical documents. “Termset features” are introduced—i.e., tuples of document terms (e.g., words) where the feature activates if either one or both terms are detected. In Zhang and Wu [
2015], the authors reduce sparsity by building an extension library, using bigram conditional probabilities in the text sequence and word-to-category similarity, based on word counts. Given a text, additional features are inserted with respect to their similarity score to the original feature set and introduced threshold heuristics. The extended feature set is then used to build the BoF representation.
The Term frequency-inverse document frequency (
TFIDF) weighting scheme [Salton and Buckley
1988] normalizes term counts by DF weight, reducing the importance of tokens that occur too often in the document collection and behave like stopwords. In Trstenjak et al. [
2014], the authors build a term-document frequency matrix to map articles to word TFIDF vectors, applying log-scaling and renormalization to improve robustness to varying document lengths. The authors in Sowmya et al. [
2016] apply TFIDF to Wikipedia articles; document weights are pooled to category-level counts in order to build “centroid” vectors for each class, subsequently normalized by intra-class DF scores. In Thirumoorthy and Muneeswaran [
2021], the authors propose feature selection schemes with respect of term and document frequency scores within and across classes. They use an evolutionary method, rich/poor population-based methods [Moosavi and Bardsiri
2019], and a classification-based fitness objective to optimize the final feature subset.
Contrary to text, BoF application in the visual domain is not straightforward: images lack clear semantic boundaries (“visual terms” are hard to delineate) and pixel-level approaches are intractable for most real-world tasks. Thus, visual LLTMs employ methods that apply (a)
detection, i.e. locate regions of interest in the image [Tuytelaars and Mikolajczyk
2008], and (b)
description, which applies low-level templates for building a representation for each such detected region.
A popular method is
Scale Invariant Feature Transform (
SIFT) [Lowe
2004; Younes et al.
2012]. It describes keypoints by normalized histograms of gradient orientations in the pixel intensities of a co-centric patch, that are largely invariant to shifts in illumination, viewpoint, rotation, and scale. SIFT is adopted for a variety of recognition tasks, e.g., detection/description in [Amato et al.
2015] for a local feature-based landmark classification along with other descriptors [Bay et al.
2008], while a similar approach extracts SIFT from a dense regular grid with patch overlaps [Bian et al.
2017].
Other methods extract fine-grained information suitable for texture, such as Gabor features [Mehrotra et al.
1992; Manjunath and Ma
1996], that involve the application of Gabor filterbanks. Gabor filters are applied on separate color channels in Risojević et al. [
2011], using mean and stdev values over multiple scales and orientations, along with spatial envelope GIST features [Oliva and Torralba
2001]. Another approach is
Local Binary Patterns (
LBP) [Ojala et al.
2002], that extracts fine-grained, rotation-invariant binary-value histograms, via simple pixel-level comparisons between a center and its radial neighbors. LBP has been evaluated with different tasks, global/local contexts, resolutions, and focal configurations [Prasad and Mary
2019; Bian et al.
2017]. The authors in Ningtyas et al. [
2022] utilize LBP along with
Gray Level Co-occurence Matrix features (
GLCM) to capture texture information for leaf classification.
ORB [Rublee et al.
2011] improves upon BRIEF [Calonder et al.
2010] features, adding “steering” mechanisms for rotation invariance and noise resistance, and is used in tasks like monument classification [Amato et al.
2015]. Further methods include
Histograms of gradients (
HoG) [Dalal and Triggs
2005], which computes distributions of pixel intensity gradient orientations, BRISK [Leutenegger et al.
2011], producing pairwise intensity comparison as binary features and KAZE [Alcantarilla et al.
2012], which applies nonlinear diffusion filtering for smoothing and multiscale operation; these are used for feature extractor/detection in various tasks [Prasad and Mary
2019; Amato et al.
2015].
In audio data, similar semantic ambiguity exists, as in the visual domain; However, given its one-dimensional and temporal structure, LLTM methods often apply simple features within time-segmented frames or in a global context.
A popular method is to capture statistical signal properties (e.g., mean, variance, extrema) and utilize the responses as feature vectors; for example, in Maršík et al. [
2014] the authors consider RMS amplitudes as audio volume estimates, along with a self-similarity computation that counts the similar segments in a music piece. Further, in Zahid et al. [
2015], signal sign change (zero crossing rate, ZCR) averages, short-time signal energy and periodicity analysis features are concatenated across audio window frames and used toward capturing repeated acoustic patterns.
Another approach switches to the frequency domain to mine low-level features from audio spectra. This is often achieved via
Fast Fourier Transform (
FFT) [Bracewell and Bracewell
1986]; FFT maps time-domain signals to frequency spectrograms via Fourier Analysis on small overlapping time windows. This is used in works like [Laurier et al.
2009], where the authors extract spectral statistic estimates like kurtosis, skewness, flatness, and flux for music emotion classification.
Mel-frequency cepstral coefficients (
MFCC) [Slaney
1998] involve multiple transformation, scaling, and normalization steps for short-term power spectrum description of an audio signal. It is used in studies like [Maršík et al.
2014], using mean and covariance statistics and [Zahid et al.
2015], along with spectral flux scores. Furthermore,
Gammatone Cepstral Coefficients (
GTCC) is a biologically-inspired modification to MFCC based on Gammatone filter functions with equivalent rectangular bandwidth bands; it is proposed and utilized in Valero and Alias [
2012].
Some features target musicality/rythm-based information by using frequency binning and temporal progression statistics, or model psycho-acoustic phenomena by considering the operation of the human hearing system. For instance, timbral, perceptual and tonal information via dissonance and loudness measures are extracted in [Laurier et al.
2009], along with “danceability”,
beats per minute (
BPM), ZCR, and chord change features. The authors in Maršík et al. [
2014] use musical LLTM like BPM, probabilistic estimates of chord root transitions, and musical keys. Furthermore, the work in Meister et al. [
2022] uses an engineered feature bank composed of temporal, spectral, cepstral and tonal responses, which are ranked and evaluated for COVID patient classification.
In summary, this section showcased approaches that utilize LLTM features for classification; we now move on to methods that transform, manipulate and aggregate this information towards improving performance and tractability.