survey

Open access

Content-based and Knowledge-enriched Representations for Classification Across Modalities: A Survey

Authors:

Nikiforos Pittaras,

George Giannakopoulos,

Panagiotis Stamatopoulos,

Vangelis KarkaletsisAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 14s

Article No.: 312, Pages 1 - 40

https://doi.org/10.1145/3583682

Published: 17 July 2023 Publication History

PDF eReader

Abstract

This survey documents representation approaches for classification across different modalities, from purely content-based methods to techniques utilizing external sources of structured knowledge. We present studies related to three paradigms used for representation, namely (a) low-level template-matching methods, (b) aggregation-based approaches, and (c) deep representation learning systems. We then describe existing resources of structure knowledge and elaborate on the need for enriching representations with such information. Approaches that utilize knowledge resources are presented next, organized with respect to how external information is exploited, i.e., (a) input enrichment and modification, (b) knowledge-based refinement and (c) end-to-end knowledge-aware systems. We subsequently provide a high-level discussion to summarize and compare strengths/weaknesses of the representation/enrichment paradigms proposed, and conclude the survey with an overview of relevant research findings and possible directions for future work.

1 Introduction

A vast amount of diverse data are prevalent in our digital media ecosystem. Large quantities of text, images, audio, and video are created and circulated on the Internet, from journalism websites, blogs and academia-related content, to fiction literature portals, and social media. Efficient management, browsing, and consumption of this content depends on accurate discovery and search operations, applied on massive data collections. To this end, developing robust classification methods for tagging and categorization has been crucial in the era of big data. Classification systems [Aggarwal 2015] automatically assign labels to new instances, facilitating efficient organization and categorization of large volumes of data with little to no human involvement. They are applied to a wide array of commercial, industrial and artistic applications, from phishing and spam detection [Whittaker et al. 2010; Subramaniam et al. 2010], medical imaging, style transfer, and optical character recognition [Razzak et al. 2018; Jing et al. 2019; Singh 2013], to speaker diarization and genre classification [Anguera et al. 2012; Sturm 2012]. The ubiquitous adoption of such models enables the automation of such tasks, avoiding the comparatively prohibitive cost of manual human effort on such a large scale.

At the same time with the rapid growth of available content, a systematic collection, structured formatting, and storage of knowledge has accompanied the growth of artificial intelligence research and the development of commercial AI-powered solutions. As a result, a wealth of curated, high quality information is available and applicable in classification systems and machine learning (ML) tasks, ranging from fine-grained linguistic and audiovisual information to high-level conceptual ontologies [Ganitkevitch et al. 2013; Miller 1995; Deng et al. 2009; Gemmeke et al. 2017]. However, the utilization of such resources for broad-range solutions has been lacking.

An important classification component is building the data representation, i.e., mapping real-world objects fed to the system (e.g., e-mails, photographs, recordings) into a feature collection that can be processed by the ML system [Storcheus et al. 2015]. Since this mapping is often the sole information source for subsequent components, producing high-quality representations is a crucial part for efficient classification. This often requires the construction of representations that go beyond low-level local pattern-matching (e.g., token/ngram frequencies in text, local template matching in audiovisual content), but encapsulate complex, conceptual information [Bengio 2011;, 2009]. A lot of research effort has pursued this via handcrafted feature engineering/transformations, as well as automated methods for content-based representation learning [Zheng and Casari 2018; Bengio et al. 2013]. However, engineered approaches tend to rely on empirical expert knowledge specific to the modality, data, domain and/or task at hand and is often based on rigid heuristics (e.g., text token bag hyperparameters, visual descriptor templates or audio signal/temporal/segmentation-based measures). On the other hand, representation learning methods operate on vast amounts of data, require considerable computational resources and energy [Strubell et al. 2019] and heavily rely on the distributional hypothesis [Harris 1954] to arrive at sets of features from the bottom-up, that hopefully encapsulate useful semantics.

In light of these limitations, this article investigates the utilization of high-level information (e.g., conceptual, semantic, relational) encoded in knowledge resources in the classification pipeline, with a focus on enriching data representations fed to the predictive algorithm. We provide a brief survey of representation methods used for classification for different modalities, along with representation enrichment approaches from external information sources.

1.1 The Structure of a Classification Problem

Classification entails applying a meaningful category (i.e., a semantic label/class) to a piece of data. In ML, a typical classification problem consists of the following components [Sebastiani 2002].

(1)

Inputs: A labeled dataset \(D:\)\((d_i, L_i)\), \(i=[1, 2, \dots , N]\), \(L_i \in L\) where \(d_i\) is an input instance, L is the set of all available dataset labels and \(L_i\) are the labels associated with the \(i-th\) instance. A label \(l \in L\) is a semantic tag related to the content of \(d_i\), either directly (e.g., topic or sentiment classification) or indirectly (e.g., related to content generation, such as in authorship attribution and style classification tasks). Classification problems can be characterized by the number of instance candidate labels \(\left|L\right|\) – e.g., binary (two available labels, that may map to a “yes” or “no” answer, e.g., hate speech detection, medical image disease prediction, speech pathology detection) and multiclass (more than two candidate labels, e.g., as in text topic classification, visual object recognition, and music genre classification). The maximum number of label annotations per instance \(M = \underset{i=1\dots N}{max} \left| L_{i} \right|\) renders the task as a single-label (\(M=1\), i.e., one label allowed per instance – e.g., sentiment analysis, handwritten digit recognition and speaker diarization) or a multi-label (\(M\gt 1\), i.e., multiple possible labels per instance – e.g., document topic classification, visual object recognition, and music genre classification).

(2)

Preprocessing: preliminary data operations [García et al. 2015; Camastra and Vinciarelli 2015], e.g.,:

–

Data Augmentation: when dataset-related limitations impact performance (e.g., sample scarcity, label imbalance) operations that modify the input collection may be employed, such as data augmentation and under/oversampling [Van Dyk and Meng 2001; He and Ma 2013; Shorten and Khoshgoftaar 2019].

–

Data cleaning [Chu et al. 2016] – e.g., handling undesirable input patterns (e.g., non-alphanumerics/whitespace in text, image/audio denoising, frequency filtering, and so on. [Bhattacharyya 2011; Burges et al. 2002])

–

Filtering – e.g., stopword handling, stemming, lemmatization, POS extraction in text [Silva and Ribeiro 2003; Jivani 2011; Ratnaparkhi 1996], normalization/equalization in images/audio [Pei and Lin 1995; Bhattacharyya 2011; Bai and Chen 2007] as well as modality-shifting operations [Knight et al. 2020].

–

Segmentation – e.g., word/sentence splitting and tokenization in text [Vijayarani et al. 2015], region/color segmentation in images [Bhattacharyya 2011], temporal/spectral segmentation and source separation in audio [Zaitoun and Aqel 2015; Chang et al. 2021; Mesgarani et al. 2006; Theodorou et al. 2014]

(3)

Representation: A representation mapping [Martinčić-Ipšić et al. 2019] converts preprocessed inputs into a suitable format, with respect to computational costs of subsequent processing and efficiency of learning [Bengio 2009]. This is often realized through mappings to a vector space [Salton et al. 1975] in which objects are represented by ordered sets of attributes deemed useful for the task by algorithms or human engineers. Although different approaches have been explored [Sonawane and Kulkarni 2014; Jolion and Kropatsch 2012; Giannakopoulos et al. 2012], vector-based representations of data are both straightforward and allow exploiting the rich tradition, advances and tools of vector algebra and calculus, as other disciplines have done before (e.g., mathematics, physics and multiple engineering fields); as a result, vector formats are either required by or compatible with the vast majority of data analysis and ML approaches today.

(4)

Classification: finally, a learning algorithm is trained to solve the classification task: given a representation \(x_{i}\) of the input instance \(d_i \in D\), produce an optimal approximation of \(L_i\) with respect to a performance measure, ensuring good generalization ability on unseen data [Mohri et al. 2018; Sokolova and Lapalme 2009].

With this framework in mind, we will explore knowledge utilization methods for improving classification tasks.

1.2 Injecting Knowledge into Classification

Multiple sources of structured human knowledge exist, that can be utilized in classification applications. This article explores avenues of introducing such information in representations for the classification task, towards improving categorization performance, as these have been discussed in the existing literature. Our specific contributions include:

–

A novel, complexity-focused perspective on representation construction paradigms, paired with a large body of indicative literature review of related approaches using text, image and audio data, presented in Section 2.

–

A novel categorization of representation enrichment methods from a perspective knowledge utilization mechanisms, along with multiple representative literature over three modalities, provided in Section 4.

–

A detailed review of knowledge resources used for enrichment methods discussed, presented in Section 3.

–

A comparative, high-level view of proposed content-based and enrichment paradigms with respect to multiple qualitative axes of representation desiderata, presented in Section 5.

–

A critical overview of the totality of covered work, along with the body of findings and suggested future directions in representation enrichment regardless of the underlying modality, provided in Section 6.

There have been multiple surveys focusing on modality-oriented classification [Sebastiani 2002; Minaee et al. 2021; Lu and Weng 2007; Fu et al. 2010] as well as reviews on knowledge integration for ML tasks. The study in Altınel and Ganiz [2018] focuses on quantitative comparisons of experimental results over broad approaches for text classification. The work of Turney and Pantel [2010] investigates semantic variants of vector space models [Salton et al. 1975] and their utilization in text ML tasks. In Camacho-Collados and Pilehvar [2018], the authors focus on sense representations of text in both supervised and unsupervised settings, along with application and evaluation approaches. The survey in Ferrone and Zanzotto [2020] covers symbolic, distributed and distributional features for NLP tasks, while [Borghesi et al. 2020] provides a brief overview on augmenting deep learning with external knowledge expressed in constraints.

This study complements existing surveys by approaching knowledge enrichment with emphasis on:

–

The classification task; our examination and the representations covered are utilized for classification

–

A holistic review of representation extraction; our primary focus is representations construction methods, from simple low-level feature extraction to deep representation learning [Bengio et al. 2013].

–

Different approaches of knowledge infusion to representations; we examine a broad range of representation enrichment methods, from low-level semantic features to end-to-end representation learning modifications.

–

Different knowledge resources; we cover multiple kinds of information and knowledge repositories (e.g., graphs, lexicons, collection of name-value information, linked data)

–

Different modalities; we examine the aforementioned axes on studies dealing with text, image, and audio data

1.3 Article Structure

The article is structured as follows: we begin with content-based (knowledge-agnostic) representations for classification over different modalities in Section 2. In Section 3 we expand on the motivation for knowledge utilization and present a collection of usable resources. Section 4 covers representation enrichment via knowledge exploitation, for classification tasks on different modalities. A qualitative comparison over the proposed paradigms is facilitated in Section 5. We discuss findings and outcomes in Section 6 and close by presenting conclusions and future work avenues in Section 7.

2 Knowledge-agnostic Representations for Classification

In this section, we discuss existing data representation approaches for classification tasks that do not explicitly consider external information sources. We structure the discussion by organizing studies into 3 categories/paradigms:

–

Low-level and template-matching (LLTM) methods: We begin with approaches that rely on matching predefined templates on the input data. The collection of responses of the template matching constitutes the representation output and a corresponding vector space embedding. Such approaches are covered in Section 2.1, and generally correspond to simple, engineered and isolated components that perform intuitive measurements on real-world objects, mapping them to a computationally usable numeric representation to be fed down the pipeline. Examples include, e.g., bags of words and features, audio and visual descriptor templates, statistical/signal measures of data streams, and so on. [Sebastiani 2002; Jörgensen et al. 2001; Boyer et al. 1999].

–

Aggregation-based (AGB) methods: here we include approaches that group, transform and/or combine low-level representations from the previous category into higher level features, simultaneously facilitating improvements in terms of computational efficiency, redundancy filtering, and dimensionality reduction. Works in this category showcase initial representation and ML research efforts towards improving representation quality via content-based means, using engineered and pre-defined aggregation and/or post-processing operations. Approaches may include clustering, topic modeling, and decomposition methods [Uys et al. 2008; Rokach and Maimon 2005; Miettinen 2009]. They are covered in Section 2.2.

–

Deep representation learning (DRL) methods: the final group covers approaches that heavily rely on a hierarchy of modular, nonlinear components for representation learning. We focus on neural network (NN) models and deep learning, that can produce very high-level information, i.e., rich features that correlate to high-level abstract/conceptual/context-aware information. This category reflects recent deep learning trends that discourage the explicit definition and over-engineering of the feature extraction procedure. Instead, arriving at richer and more efficient representations is entrusted to neural systems, which are set up to automatically discover useful features by learning from big data. These methods include convolutional, recurrent and transformer NNs [Medsker and Jain 2001; Gu et al. 2018; Tay et al. 2023] and are presented in Section 2.3.

The proposed grouping above enables the identification of the general feature generation approach adopted for each work. Upcoming sections focus on each category, presenting related work that adopts such approaches for classification of different modalities. Notably, the saliency and holistic nature of the proposed content-based paradigms render them relevant to a very large number of different classification tasks and domains; to this end, works covered for each category in the next sections should not be viewed as an exhaustive enumeration of all relevant approaches, but rather as indicative characteristic, influential and/or recent examples which illuminate the overall approach and methods typically pursued in the paradigm. Classification pipeline details (e.g., features, classifiers, metrics) for each study in the content-based literature covered is presented in Table 1.

Table 1.

citation	mod.	category	representation	labeling	classifiers	metrics
[Badawi and Altınçay 2014]	TXT	LLTM	BOW, termsets	BIN	k-NN	F1
[Trstenjak et al. 2014]	TXT	LLTM	TFIDF	MC-SL	k-NN	ACC
[Zhang and Wu 2015]	TXT	LLTM	BOW, extension	MC-SL	NB	P, R, F1
[Sowmya et al. 2016]	TXT	LLTM	TFIDF	MC-ML	k-NN, Rocchio	mAP
[Thirumoorthy and Muneeswaran 2021]	TXT	LLTM	BoW	MC-SL	SVM, NB	P, R, F1, ACC
[Zhang et al. 2011]	TXT	AGB	TFIDF, LSA	MC-SL	SVM	P, R
[Giannakopoulos et al. 2012]	TXT	AGB	NGG	MC-SL	graph similarity	P
[Zareapoor and Seeja 2015]	TXT	AGB	BoW, PCA, LSA	MC-SL	RF	AUC, ACC
[Ye et al. 2017]	TXT	AGB	TFIDF, LDA	MC-SL	SVM	P, R, F1
[Škrlj et al. 2021]	TXT	AGB	TFIDF, Word2Vec, Evolution	MC-SL	SVM, NEURAL, LR	ACC
[Liu et al. 2015]	TXT	DRL	SKIPGRAM, LDA	MC, SL	LINEAR	PR, R, F1, ACC
[Yang et al. 2018]	TXT	DRL	SKIPGRAM	BIN	NEURAL	ACC
[Sun et al. 2019a]	TXT	DRL	TRANSFORMER	BIN / MC, SL	NEURAL	ACC
[Chen et al. 2020]	TXT	DRL	TFIDF, CNN	MC, SL	NEURAL	ACC
[Pan et al. 2022]	TXT	DRL	TRANSFORMER	MC, SL	NEURAL	ACC
[Risojević et al. 2011]	IMG	LLTM	GIST, Gabor	MC-SL	SVM	ACC
[Amato et al. 2015]	IMG	LLTM	SIFT, SURF, ORB, BRISK	MC-SL	k-NN, similarity	ACC, F1
[Bian et al. 2017]	IMG	LLTM	SIFT, LBP	MC-SL	KELM	ACC
[Prasad and Mary 2019]	IMG	LLTM	HoG, BRISK, LBP, KAZE	MC-SL	SVM	ACC
[Ningtyas et al. 2022]	IMG	LLTM	LBP, GLCM	MC-SL	k-NN	ACC
[Zou et al. 2016]	IMG	AGB	SIFT, LBP, SPM, LLC, KMeans	MC-SL	KCR	ACC
[Ilea et al. 2016]	IMG	AGB	RCovD, GMM, FV	MC-SL	kSVM	ACC
[Srivastava et al. 2019]	IMG	AGB	LBP, CFC	MC-SL	SVM	P, R, S, F1, ACC
[Zhu et al. 2019]	IMG	AGB	SURF, BRISK, KMeans	MC-SL	SVM	ACC
[Bodine and Hochbaum 2022]	IMG	AGB	pixel, PCA	Binary, MC-SL	DT	ACC
[Huang et al. 2017]	IMG	DRL	CNN, RESIDUAL	MC, SL	NEURAL	ACC
[Dosovitskiy et al. 2021]	IMG	DRL	TRANSFORMER, CNN	MC, SL	NEURAL	ACC
[Touvron et al. 2022]	IMG	DRL	MLP, RESIDUAL	MC, SL	MLP	ACC
[Liu et al. 2021]	IMG	DRL	TRANSFORMER	MC, SL	NEURAL	ACC
[Chen et al. 2021]	IMG	DRL	TRANSFORMER	MC, SL	NEURAL	ACC
[Laurier et al. 2009]	AU	LLTM	signal, musical, psych.	MC-SL	kSVM	ACC
[Valero and Alias 2012]	AU	LLTM	spectral, MFCC, MPEG7	MC-SL	DT, SVM, MLP, k-NN	ACC
[Maršík et al. 2014]	AU	LLTM	signal, spectral, musical	MC-SL	MLP	P, R
[Zahid et al. 2015]	AU	LLTM	signal, spectral, temporal, MFCC	MC-SL	SVM, MLP, rule-based	ACC
[Meister et al. 2022]	AU	LLTM	signal, spectral, temporal, MFCC	MC-SL	SVM, RF, LR, k-NN, ADAboost	ACC, AUC
[Lee and Ellis 2010]	AU	AGB	MFCC, GMM, PLSA	MC-SL	kSVM	AP
[Kim et al. 2012]	AU	AGB	MFCC, VQ, LDA	MC-SL	kSVM	F1
[Grosse et al. 2007]	AU	AGB	MFCC, SISC, spectral	MC-SL	GDA, SVM	ACC
[Baniya et al. 2014]	AU	AGB	MFCC, signal, spectral, musical, PCA, MRMR	MC-SL	SVM	ACC
[Zeghidour et al. 2021]	AU	AGB	Gabor, GMM	MC-SL, ML	MLP	ACC, AUC
[Choi et al. 2016]	AU	DRL	CNN	MC, SL	NEURAL	ACC
[Hershey et al. 2017]	AU	DRL	CNN, INCEPTION, RESIDUAL	ML	NEURAL	mAP, AUC
[Nanni et al. 2021]	AU	DRL	spectral, CNN	MC, SL	NEURAL	mAP
[Gong et al. 2021]	AU	DRL	spectral, CNN	MC, SL	NEURAL	ACC
[Wang and Oord 2021]	AU	DRL	spectral, CNN, RESIDUAL	MC, SL	NEURAL	ACC, mAP

Table 1. Indicative Studies using Knowledge-agnostic Representations

LLTM, AGB and DRL refer to the categories outlined in Section 2. MC, SL, and ML labeling refers to multiclass, single-label and multi-label configurations respectively. Evaluation measures ACC, PR, RE, F1, AUC, and (m)AP refer to accuracy, precision, recall, F1-measure, area under curve and (mean) average precision, respectively. Entries in the representation/classifier columns refer to acronyms described in the text, or intuitive categories of algorithms/approaches. NEURAL refers to using an appropriate layer for classification in the network output (e.g., dense followed by softmax).

2.1 Template Matching and Low-level Approaches

2.1.1 Overview.

In this section, we examine LLTM: representations that rely on matching templates on the input data [Bengio 2009], producing low-level responses. Given a representation template, we can discriminate between “local” (the template is applied on an input subsection) and global (the template spans the entire input instance) applications of the extraction process [Zhang et al. 2007; Mikolajczyk et al. 2005]. Local templates may regard the input as key-value pairs: keys locate distinct attributes (e.g., an individual word/ngram in text, a region/point of interest in images and audio) while values correspond to a magnitude of match or weight of the template in that location. Terms may be easily delineable in the source data (e.g., individual words in text, detected keypoints in images, specific temporal slice/peak in audio) or lack high-level semantics and an intuitive explanation—the latter can affect the interpretability of the representation and, down the line, of the entire classification pipeline [Došilović et al. 2018; Danilevsky et al. 2020]. Global features usually provide coarse information on a narrow view of the input—e.g., color distribution information in images, sentiment/grammaticality scores in text, SNR values for audio, and so on. [Oliva and Torralba 2006].

2.1.2 Approaches.

A popular representation is a Vector Space model (VSM) [Salton et al. 1975], which projects data into a vector \(v \in \mathbb {R}^d\) that can be manipulated with distance measures and Linear algebra constructs [Basseville 1989; Cha 2007] to process/compare instances. The Bag of Words/Features (BoW/BoF) [Salton and Buckley 1988; Sebastiani 2002] is a popular VSM that produces count-based weights for points/regions of interest in the input. BoF is a popular baseline for text, where semantically salient terms are easily identifiable and delineated by syntax and grammar. Common weighting schemes include boolean, term, and document frequency (BF, TF, DF), denoting presence, instance, and collection-level counts of single or n-tuples of terms (n-grams) in the text. The work in Badawi and Altınçay [2014] utilizes BoW features, along with an investigation on term weighting and selection in the binary classification of articles and biomedical documents. “Termset features” are introduced—i.e., tuples of document terms (e.g., words) where the feature activates if either one or both terms are detected. In Zhang and Wu [2015], the authors reduce sparsity by building an extension library, using bigram conditional probabilities in the text sequence and word-to-category similarity, based on word counts. Given a text, additional features are inserted with respect to their similarity score to the original feature set and introduced threshold heuristics. The extended feature set is then used to build the BoF representation.

The Term frequency-inverse document frequency (TFIDF) weighting scheme [Salton and Buckley 1988] normalizes term counts by DF weight, reducing the importance of tokens that occur too often in the document collection and behave like stopwords. In Trstenjak et al. [2014], the authors build a term-document frequency matrix to map articles to word TFIDF vectors, applying log-scaling and renormalization to improve robustness to varying document lengths. The authors in Sowmya et al. [2016] apply TFIDF to Wikipedia articles; document weights are pooled to category-level counts in order to build “centroid” vectors for each class, subsequently normalized by intra-class DF scores. In Thirumoorthy and Muneeswaran [2021], the authors propose feature selection schemes with respect of term and document frequency scores within and across classes. They use an evolutionary method, rich/poor population-based methods [Moosavi and Bardsiri 2019], and a classification-based fitness objective to optimize the final feature subset.

Contrary to text, BoF application in the visual domain is not straightforward: images lack clear semantic boundaries (“visual terms” are hard to delineate) and pixel-level approaches are intractable for most real-world tasks. Thus, visual LLTMs employ methods that apply (a) detection, i.e. locate regions of interest in the image [Tuytelaars and Mikolajczyk 2008], and (b) description, which applies low-level templates for building a representation for each such detected region.

A popular method is Scale Invariant Feature Transform (SIFT) [Lowe 2004; Younes et al. 2012]. It describes keypoints by normalized histograms of gradient orientations in the pixel intensities of a co-centric patch, that are largely invariant to shifts in illumination, viewpoint, rotation, and scale. SIFT is adopted for a variety of recognition tasks, e.g., detection/description in [Amato et al. 2015] for a local feature-based landmark classification along with other descriptors [Bay et al. 2008], while a similar approach extracts SIFT from a dense regular grid with patch overlaps [Bian et al. 2017].

Other methods extract fine-grained information suitable for texture, such as Gabor features [Mehrotra et al. 1992; Manjunath and Ma 1996], that involve the application of Gabor filterbanks. Gabor filters are applied on separate color channels in Risojević et al. [2011], using mean and stdev values over multiple scales and orientations, along with spatial envelope GIST features [Oliva and Torralba 2001]. Another approach is Local Binary Patterns (LBP) [Ojala et al. 2002], that extracts fine-grained, rotation-invariant binary-value histograms, via simple pixel-level comparisons between a center and its radial neighbors. LBP has been evaluated with different tasks, global/local contexts, resolutions, and focal configurations [Prasad and Mary 2019; Bian et al. 2017]. The authors in Ningtyas et al. [2022] utilize LBP along with Gray Level Co-occurence Matrix features (GLCM) to capture texture information for leaf classification.

ORB [Rublee et al. 2011] improves upon BRIEF [Calonder et al. 2010] features, adding “steering” mechanisms for rotation invariance and noise resistance, and is used in tasks like monument classification [Amato et al. 2015]. Further methods include Histograms of gradients (HoG) [Dalal and Triggs 2005], which computes distributions of pixel intensity gradient orientations, BRISK [Leutenegger et al. 2011], producing pairwise intensity comparison as binary features and KAZE [Alcantarilla et al. 2012], which applies nonlinear diffusion filtering for smoothing and multiscale operation; these are used for feature extractor/detection in various tasks [Prasad and Mary 2019; Amato et al. 2015].

In audio data, similar semantic ambiguity exists, as in the visual domain; However, given its one-dimensional and temporal structure, LLTM methods often apply simple features within time-segmented frames or in a global context.

A popular method is to capture statistical signal properties (e.g., mean, variance, extrema) and utilize the responses as feature vectors; for example, in Maršík et al. [2014] the authors consider RMS amplitudes as audio volume estimates, along with a self-similarity computation that counts the similar segments in a music piece. Further, in Zahid et al. [2015], signal sign change (zero crossing rate, ZCR) averages, short-time signal energy and periodicity analysis features are concatenated across audio window frames and used toward capturing repeated acoustic patterns.

Another approach switches to the frequency domain to mine low-level features from audio spectra. This is often achieved via Fast Fourier Transform (FFT) [Bracewell and Bracewell 1986]; FFT maps time-domain signals to frequency spectrograms via Fourier Analysis on small overlapping time windows. This is used in works like [Laurier et al. 2009], where the authors extract spectral statistic estimates like kurtosis, skewness, flatness, and flux for music emotion classification. Mel-frequency cepstral coefficients (MFCC) [Slaney 1998] involve multiple transformation, scaling, and normalization steps for short-term power spectrum description of an audio signal. It is used in studies like [Maršík et al. 2014], using mean and covariance statistics and [Zahid et al. 2015], along with spectral flux scores. Furthermore, Gammatone Cepstral Coefficients (GTCC) is a biologically-inspired modification to MFCC based on Gammatone filter functions with equivalent rectangular bandwidth bands; it is proposed and utilized in Valero and Alias [2012].

Some features target musicality/rythm-based information by using frequency binning and temporal progression statistics, or model psycho-acoustic phenomena by considering the operation of the human hearing system. For instance, timbral, perceptual and tonal information via dissonance and loudness measures are extracted in [Laurier et al. 2009], along with “danceability”, beats per minute (BPM), ZCR, and chord change features. The authors in Maršík et al. [2014] use musical LLTM like BPM, probabilistic estimates of chord root transitions, and musical keys. Furthermore, the work in Meister et al. [2022] uses an engineered feature bank composed of temporal, spectral, cepstral and tonal responses, which are ranked and evaluated for COVID patient classification.

In summary, this section showcased approaches that utilize LLTM features for classification; we now move on to methods that transform, manipulate and aggregate this information towards improving performance and tractability.

2.2 Aggregation-based Methods

2.2.1 Overview.

In this section, we explore approaches that rely on aggregating, combining and/or transforming lower-level representations to arrive at higher-level features, with respect to the abstractness and richness of the information encapsulated. Aggregation methods produce mid-level representations from low-level inputs, using engineered and/or learned functions rather than directly exploiting token-based statistics. In the scope of our analysis, these techniques correspond to the first attempts of improving the semantic content and richness of low-level representations by applying aggregation as a post-processing component. Contrary to low-level features, AGB methods usually build distributed representations [Hinton et al. 1984; Rumelhart 1986]: i.e., resulting semantics are spread or “distributed” over multiple dimensions in the embedding space [Bengio 2009], arriving at compact, robust features but sacrificing explainability. Dimensionality reduction is often facilitated by these approaches, mitigating the problem of the curse of dimensionality [Bellman 2013] while simultaneously maintaining or improving the expressive power of the output feature set.

2.2.2 Approaches.

Aggregation methods have presented different avenues for fusing LLTM features. Modality-specific delineation of semantically meaningful terms play an important role to the usefulness of different approaches.

For text data, local LLTM methods deal with distinct words, characters, and ngrams; as a result, the generated vocabulary can quickly scale to very large sizes, reducing system performance and accuracy [Bengio et al. 2005]. Thus, aggregation methods for text will often aim at shifting the representation from the vocabulary space to one more dense and compact, and simultaneously preserves the majority of the information content of its original counterpart.

A popular class of AGB methods is matrix decomposition techniques [Golub 1969], where a \(M \times d\) matrix of M, d-dimensional features are converted to a \(M \times k\) matrix, \(k \le d\). For instance, a popular method is Latent Semantic Analysis (LSA) [Deerwester et al. 1990], which produces document mappings to a set of k latent concepts. This is achieved by truncating the Singular Value Decomposition factors [Trefethen and Bau III 1997] of a term-document matrix, keeping the vectors with the k largest singular values. LSA is used in Zhang et al. [2011] to post-process TFIDF features and a multi-word term vector approach [Justeson and Katz 1995] for news articles categorization with linear SVMs. Additionally, Principal Component Analysis (PCA) [Jolliffe 2011] applies a change of base to the principal components of the original collection, that correspond to the covariance matrix of the original instances. Keeping a subset of k instances retains an optimal tradeoff with respect to variance retention and dimensionality reduction. PCA has been used in early works, for example in Li and Jain [1998], where term count features are post-processed with PCA and hierarchical clustering for news categorization with multiple classifiers. Additionally, in Zareapoor and Seeja [2015] both PCA and LSA are investigated for compression, reduction of complexity, and processing time of BoW features. Further, Latent Dirichlet Allocation (LDA) [Blei et al. 2003] generates topic models over Dirichlet priors, assuming distributions from instances to topics and from a topic to words. In Ye et al. [2017], the authors use TFIDF to model outlier words that are overlooked by LDA and include them for sentiment classification.

Other works adopt evolutionary methods to modify the representation; in Škrlj et al. [2021], weights for given diverse sets of input space features undergo mutation and crossover, while fitness is estimated by aggregate classification performance with multiple learners trained with SGD. Further, n-gram graph aggregation methods in Giannakopoulos et al. [2012] map documents to graphs, subsequently merged into class-level constructs. These are then compared with document-level graphs for categorization of the latter using different graph similarity measures.

In images, LLTM features generally produce non-scalar information, yielding large volumes of highly local responses. Many approaches use vector quantization [Gersho and Gray 1992] to fuse descriptor vectors and reduce workload to subsequent classification components, redundancy, and noise. Clustering methods are popular for this task; for example, Kmeans [Jain and Dubes 1988] merges populations of N local features to arrive at a pre-determined size of \(k \ll d\) visual clusters. These serve as an artificial vocabulary for a visual BoW analog, with distance-based assignments of local features to “visual words”. The bag vector is then used as a global image feature. KMeans vocabularies are used, e.g., in Zhu et al. [2019], over different combinations of SURF and BRISK for detection and description.

Modifications to the visual BoW include Spatial Pyramid Matching (SPM) [Lazebnik et al. 2006], which partitions and separately groups keypoints to preset image subdivisions. Further, locality-constrained linear coding (LLC) [Wang et al. 2010] enforces locality constraints on KMeans via regularization and modification the membership procedure, making instances being supported by multiple codebook bases for reduced reconstruction error and sparsity. LLC is evaluated by the work in Zou et al. [2016], aggregating dense SIFT descriptors in combination with global LBP features. Transformation-oriented methods like PCA are applied in the visual domain in Bodine and Hochbaum [2022], where a modification of Decision Trees [Lewis and Ringuette 1994] is proposed, which uses a one-dimensional, single-feature Maximum Cut split criterion, in conjunction with two localized PCA-based methods for transforming and mapping decision features. The vectors of locally-aggregated descriptors (VLAD) approach [Delhumeau et al. 2013] replace binary cluster assignments with concatenations of accumulated subtraction residuals. Another modification is Clustering with Fixed Centers (CFC) [Srivastava et al. 2019], over LBP features and SURF keypoints. Descriptors are grouped into bags assuming a fixed cluster center with respect to response maxima per LBP histogram bin, with combined category-level bags being used as the global feature. Further, fisher vector encoding [Perronnin and Dance 2007] uses visual vocabularies built with Gaussian Mixture Models (GMM) [Reynolds 2009], using the log-likelihood GMM gradients under the data are as encoded codebook features. A generalization to Riemannian spaces is proposed in Ilea et al. [2016], generated by Riemmanian Gaussian Mixture (RGM) models. The method uses region covariance descriptors (RCovD) as input features, built by covariance information from on sliding image patches.

In the audio domain, AGB approaches may exploit sequential signal interdependencies to pool together feature responses spatially/spectrally close to each other, along aforementioned methods of segmentation into “acoustic words”.

Aggregation via generative/probabilistic models has been used for classification, like GMMs in Lee and Ellis [2010]. There, MFCC inputs are combined with single and multiple-gaussian GMMs—pLSA is then applied on the built acoustic vocabulary. In the study of Kim et al. [2012] “Acoustic words” are formed from MFCC features via the LBG-VQ quantizer [Gersho and Gray 1992]. LDA is subsequently employed to generate audio-related topics over the audio vocabulary, with resulting weights serving as latent vectors for classification.

Like other modalities, matrix decomposition has also been applied for audio. In Baniya et al. [2014], the authors use PCA to reduce a feature set of spectral, dynamic, harmonic, and rhythm characteristics, along with higher order moments. Sparse coding (SC) [Lee et al. 2007] introduces sparsity bias to the encoding objective, in the form of a weighted L1 regularization term on the encoding vectors. A shift-invariant modification [Olshausen and Field 1996] proposed in Grosse et al. [2007] reconstructs the input signal to basis functions in all possible shifts to build temporally-invariant encodings. The authors apply computational efficiency improvements and compare the encodings with MFCC and raw spectrogram features. The approach in Zeghidour et al. [2021] applies a pipeline of filtering, pooling and compression/normalization composed of learnable steps, including Normalized Gabor convolutions, Gaussian low-pass filters and per-Channel Energy normalization [Wang et al. 2017], fed to MLPs for different categorization scenarios.

Having examined aggregation engineering of low-level features, the next section concludes the examination of knowledge-agnostic approaches by considering automatic DRL techniques.

2.3 Deep Representation Learning

2.3.1 Overview.

Approaches covered so far used preconfigured templates locally interpolated on training data points, as well as manipulations of their responses in fixed, preconfigured steps. In this section, we focus on “deep” feature extractors, that aim at automatically learn multiple useful feature hierarchies from training data [Bengio 2009].

Typical representatives of this paradigm are graph-based computational models, such as the biologically inspired artificial neural networks (NNs) [Hubel and Wiesel 1962]. While deep generally refers to models that learn at semantically abstract, meaningful features, in the context of NNs, it can also refer to the size of the graph that implements the computation. Deep NNs are hierarchical models composed by multiple steps of nonlinear, learnable feature transformations. Contrary to previous paradigms, the line between features and classifiers is blurred: feature transformation, representation learning, and classification are simultaneously optimized in a holistic manner, learning the conversion of input data to prediction scores in an automatic, end-to-end fashion. This enables the design of efficient representation learners with fewer hard-coded, performance-critical parameters. As a result, deep approaches reflect further research efforts for improving representation semantics, diverging from rigorous feature and/or aggregation engineering and relying on a data-driven, automatic discovery of expressive features from scratch instead.

2.3.2 Approaches.

DRL methods generally exploit unsupervised pretraining [Bengio et al. 2007], fine-tuning, and transfer learning [Zhuang et al. 2020], with fitted model weights used as initializations for supervised training/fine-tuning for classification, or used directly as standalone features [Bengio et al. 2013; Pittaras et al. 2017].

Neural language models (NNLMs) [Arisoy et al. 2012] are popular approaches in text, applying distributional learning over big data to learning language structure and feature hierarchies via statistical learning of token distributions. NNLM-based pretraining and transfer learning DRL are an effective approach for deep feature extraction, as comparisons between NNLM-based methods, LLTM, and AGB techniques have indicated [Baroni et al. 2014],

Word embeddings are a popular example—for instance, Word2Vec [Mikolov et al. 2013] is pretrained via predicting center words given a context (CBOW), or the context, given the center word (Skipgram). Multiple studies exploit the effectiveness of transfer learning with Word2Vec features. In Yang et al. [2018], Word2Vec is pretrained over social media data with concatenated word vectors feeding a Convolutional Neural Network (CNN) with pooling components. Skipgram embeddings initialize the system in Liu et al. [2015], where LDA is used to build word and topic embeddings, exploring transfer learning variants to embed topics, word-topic pairs, or to concatenate topic and word vectors.

Some approaches utilize count-based inputs/seeds for faster operation and improved compatibility. In Chen et al. [2020], input TFIDF features are transformed into 2-dimensional matrices, fed to a CNN architecture equipped with pooling and fully-connected components. Other methods use transformers [Vaswani et al. 2017], relying on attention mechanisms [Bahdanau et al. 2015] to facilitate sequence modeling and learning, instead of recurrence or convolution. These methods have been shown to produce powerful, transferable representations, explored in studies like [Sun et al. 2019a], where the authors propose ways of fine-tuning BERT [Devlin et al. 2019] for efficient classification. Enhancements include additional pre-training, single/multi-task fine-tuning, and varying learning rates per transformer layer. In Pan et al. [2022], the authors introduce data augmentation by generating adversarial examples with FGSM [Goodfellow et al. 2015] by perturbing the word embedding matrix of a transformer, using the result for contrastive learning of noise invariant representations and improved performance on multiple NLU tasks including sentiment analysis.

In images, the large semantic gap between pixel-level and conceptual information has encouraged the design of DRL approaches. CNN architectures have been a very popular in this domain, with early successful approaches contributing to the rise in popularity of deep learning [Krizhevsky et al. 2012]. CNNs typically repeat layers of convolution, pooling and normalization, followed by fully-connected (dense) layers with dropout [Srivastava et al. 2014] and softmax normalization, and have been popular choices for large-scale image classification tasks [Russakovsky et al. 2015].

Different CNN topologies have been proposed to allow efficient training of larger models, e.g., residual connections, which preserve signals via additive identity operations and inception modules, which propose blocks of dimensionality reduction, pooling filtering and concatenation to improve scalability [Szegedy et al. 2015; He et al. 2016]. Additional approaches include the “densely connected” architecture, where layer outputs are linked to the inputs of all subsequent layers and vice versa, and residual connections replaced with feature concatenation. This topology is evaluated using the DenseNet model [Huang et al. 2017] on a wide range of image classification tasks. The model requires fewer parameters than ResNets, achieved by utilizing compression, bottlenecking and growth control techniques in the architecture. Further, in Touvron et al. [2022] image patches are processed via linear sublayers operating across patches and channels, equipped with residual connections and forming image-level predictions with average pooling.

After the success of transformers in NLP tasks, they have been utilized in the visual domain with similar success. In Dosovitskiy et al. [2021] a “Vision Transformer” (ViT) is used directly on raw patch sequences as well as CNN embeddings, using a classification head for Imagenet data categorization. The work in Liu et al. [2021] hierarchically merges image patches (like approaches in Lazebnik et al. [2006]) into larger visual tokens, computing self-attention to shifting windows rather than the entire token set, which achieves linear complexity and comparable performance. A similar approach is pursued in Chen et al. [2021], where image patches at different scales are encoded separately, and subsequently fused with cross-attention strategies of different granularity between encoded tokens of each scale.

Regarding deep approaches for audio, an established technique has been to first convert audio content into images and then apply a visual DRL pipeline. This conversion usually involves spectrograms, i.e., 2D time and frequency representations, produced by applying short-time Fourier transform [Sejdić et al. 2009] on segmented audio clips. Applying CNN-based architectures has been a broadly popular approach, in a manner similar to vision DRL models. For example, in Hershey et al. [2017], the authors perform a large-scale evaluation of popular CNN architectures designed for the visual domain [Krizhevsky et al. 2012; Simonyan and Zisserman 2015; Szegedy et al. 2015; He et al. 2016] on audio spectrograms, introducing variation to the amount of data and labelset size for each experiment. All visual models considerably outperform a dense NN baseline applied on raw spectrogram features on audio classification. in Nanni et al. [2021], ensembles of popular CNN models with different data augmentations (signal, spectral, visual, and so on.) are applied on spectrograms and evaluated at multiple benchmarks. Furthermore, the authors of Wang and Oord [2021] adopt a contrastive learning approach, using waveform and spectrogram-level representations in a siamese configuration. They employ CNNs and Residual architectures on MFCC/spectrogram and raw audio inputs for representation learning, applying fixed features on sound event classification tasks.

Approaches like deconvolution [Zeiler and Fergus 2014] reverses CNN computation to project filterbank activations back to the pixel space. This is exploited in Choi et al. [2016] towards reproducing audio corresponding to weights of trained CNN networks, aiming at improving explainability of learned features. They utilize an architecture of convolution, pooling, and fully-connected layers on audio spectrograms for music genre classification. Transformer models are also utilized, e.g., in Gong et al. [2021], where a transformer encoder pretrained on Imagenet is used on input audio on spectrograms, classifying output [CLS] embeddings to auditory classes and outperforming convolutional architectures.

3 Structured Knowledge

Having covered representation approaches in the literature, we now move to examine structured knowledge resources available in the research community, that can be utilized to enrich ML systems in the context of classification tasks.

3.1 The Need for Enrichment

A classification machine has to overcome multiple difficulties in order to facilitate efficient and robust classification. One of the key challenges is clarifying ambiguity and deducing missing information in the input, the lack of which may hinder its decision-making abilities. Additional obstacles, limitations, and challenges that can arise may include:

–

Critical contextual knowledge, disambiguating factors and crucial information that may be missing from training data: e.g., language ambiguity in text, unclear scale/orientation and pareidolia in audiovisual media

–

Incomplete domain-specific knowledge in the training data, such as named-entity information, audiovisual logos and/or artifacts of special importance and/or contextual meaning.

–

Factors inherent in the data generation setting. For example, subjective human attributes like education, writing style, cultural elements (prosody, dialects), shifting zeitgeist expressed in language/media (memes, slang, and so on.).

–

Inherent data ambiguity (e.g., language polysemy, optical/auditory illusions, leading to diverse interpretations)

–

Explainability/transparency standards in classifier outputs or internal operation: these may be required e.g., in use cases with both experts (e.g., in research, medicine, governance) and laymen (e.g., commercial products)

–

Technological limitations, spanning a large semantic gap between engineered and real concepts [Sikos 2017]

In this survey, we explore promising methods that inject structured, curated knowledge to classifiers [Silvestri et al. 2021], mined from knowledge repositories, databases, ontologies and lexicons. Early research efforts relied on rule-based systems [Anaya 2011] that maintained expert knowledge preconfigured into decision rules relevant to the task, serving as a decision-making oracle. or built knowledge databases, inferring relationships and conceptual information by human experts [Cadoli and Donini 1997; Deng 1990]. This effort, along with the drive to digitize human knowledge led to a wealth of information available in machine-readable format. As a result, a large number of resources can be readily exploited towards improving tasks for which a relevant and applicable resource exists [Aye et al. 2008].

3.2 Knowledge Resources

We move on to discuss an indicative but diverse set of resources usable for enriching representations for classification. Table 2 provides details such as information unit types, relational structure, compilation means, and other information.

Table 2.

name	unit type	relation	compilation	endpoint / format	url	language
Wordnet [Miller 1995]	concept	hierarchical, semantic	manual	multiple / multiple	web, nltk	multiple
SentiWordnet [Baccianella et al. 2010]	concept	polarity, hierarchical, semantic	automatic	python, web / text	nltk, code	English
Framenet [Fillmore et al. 2003]	semantic frame	frame-semantic	manual	python / -	web	multiple
Babelnet [Navigli and Ponzetto 2010]	concept / entity	hierarchical, semantic, similarity	automatic	multiple / multiple	web	multilingual
DBpedia [Lehmann et al. 2015]	property-value	linked-data	mixed	web, SPARQL, REST / RDF, JSON	web, code	multiple
Wikidata [Vrandečić and Krötzsch 2014]	property-value	linked-data	manual	multiple / multiple	web	English
ParaphraseDB [Ganitkevitch et al. 2013]	phrase	paraphrasal	automatic	web / text	web	multiple
Freebase [Bollacker et al. 2008]	property-value	linked-data	manual	web / text	web	English
Probase [Wu et al. 2012]	concept	hierarchical, semantic	automatic	-	web	English
ASER [Zhang et al. 2020]	phrase	semantic, causal	automatic	REST, python	web, github	English
YAGO [Suchanek et al. 2007]	entity	hierarchical, semantic	manual	REST, web / JSON	web	multiple
ConceptNet [Liu and Singh 2004]	concept	hierarchical, semantic, similarity	mixed	python, REST, web / JSON	web	multiple
CyC [Lenat et al. 1990]	concept / entity	hierarchical, semantic	manual	multiple / multiple	web, code	English
Imagenet [Deng et al. 2009]	concept	hierarchical	manual	web / XML	web	English\(^{*}\)
Visual Genome[Krishna et al. 2017]	concept	semantic	manual	multiple	web	English\(^{*}\)
Audioset [Gemmeke et al. 2017]	concept	hierarchical	manual	web / JSON	web, web	English\(^{*}\)
Music Ontology [Raimond et al. 2007]	concept	semantic	manual	multiple / multiple	web	English\(^{*}\)
Audio Feature Ontology [Allik et al. 2016]	concept	semantic, hierarchical	manual	RDF	web	English\(^{*}\)
COMUS [Rho et al. 2009]	concept	semantic, hierarchical	manual	RDF	web	English\(^{*}\)
E-ANEW [Warriner et al. 2013]	lexicon	affective	manual	web / csv	web	English
General Inquirer [Russell 1980]	lexicon	affective	manual	java, python / -	java, python	English
Labelsets	concept	hierarchical, semantic	-	-		-

Table 2. Knowledge Resources for Enrichment of Classification Tasks

For programmatic resource endpoints and labelsets, the format is irrelevant and is omitted. In the language column, entries with a asterisk superscript\(^*\) refer only to the language in which the resource elements are described in – i.e., the resource content itself is not linguistic and does not lend itself to a specific language.

A popular resource is Wordnet [Miller 1995], a manually compiled directed acyclic graph (DAG) where nodes contain sets of synonymous semantic concepts (“synsets”) along with example lexicalizations. Nodes connect with relations such as hypernymy/hyponymy (e.g., dog “is-a” animal) and meronymy (e.g., wheel “is-part-of” car) and additional metadata (e.g., examples and definitions). Sentiwordnet [Baccianella et al. 2010] is a related resource having automatic synset annotations with positive/negative/neutral polarities, scored in \([0.0, 1.0]\) and adding to unity. Babelnet [Navigli and Ponzetto 2010] is a network disambiguating and linking Wikipedia lemmas, encyclopedic content and synsets, while machine translation and cross-lingual links from Wikipedia provide multilingual information. Framenet [Baker et al. 1998] is a hierarchical lexical database annotated with descriptions of semantic roles pertaining to, e.g., events, relations, situations, and related entities, based on frame semantics [Fillmore 2006].

Further, CyC [Lenat et al. 1990] consists of a knowledge base that catalogues numerous formal assertions that represent existing human knowledge. Facts are encoded into a handcrafted knowledge representation language (CycL) using a logical framework, forming an ontology and inference engine developed by experts. The ConceptNet [Liu and Singh 2004] graph database captures knowledge via semi-structured natural language fragments, using nodes that represent both simple concepts and compound entities and events, built by combining primitive building blocks like verbs, noun and prepositional phrases (e.g., eat lunch, in the evening). Edges map relations like synonymy, meronymy, causality and affect, as well as probabilistic associations (e.g., “it often holds that”), corresponding to “informal everyday knowledge”. It was built by rule-based extraction on crowdsourced “fill-in-the-blank” commonsense tasks. Probase [Wu et al. 2012] is a large hierarchical taxonomy of 2.7 million concepts, automatically built by parsing large volumes of webpage content. It includes a probabilistic model to encapsulate ambiguity, degrees of certainty and learned inconsistencies from web source content. ASER (activities, states, events and relations) [Zhang et al. 2020] is an eventuality knowledge graph parsed from text data. Eventualities are verb phrases, linked with temporal, contingency, comparison, expansion and co-occurence relations, with the resulting graph conveying rich semantics.

DBpedia [Lehmann et al. 2015] utilizes Wikipedia article infoboxes, manual, and automatic rule-based extraction, named-entity recognition, and statistical methods to build a semantic web structure, using linked data technologies that facilitate cross-lingual cataloguing of Wikipedia literals and property-based relations. Wikidata [Vrandečić and Krötzsch 2014] is an open collaborative platform aiming at providing the data available on Wikipedia in a structured and easy-to-use format. It provides property-value pairs (e.g., author - George Orwell) as well as more complex contextual relationships via the use of “qualifiers” (e.g., conveying temporal information). The YAGO project [Suchanek et al. 2007] combines Wordnet and Wikipedia information via automatic heuristics, finalized by quality control-oriented post-processing. An ontology links entities with each other (with semantic, linguistic, hierarchical and real-world associations like “hasWonAward”) and with literals (e.g., to string lexicalizations and numbers). Freebase [Bollacker et al. 2008] is a large collaborative database consisting of entities, properties and assertions, organized as tuples in a graph data store. It emphasizes on scalability, collaborative maintenance and ease of use, towards research and open data-oriented community applications. Furthermore, ParaphraseDB [Ganitkevitch et al. 2013] is a database containing paraphrase pairs, i.e., pairs of semantically equivalent, but syntactically and/or lexically different phrases. Pairs are built by analyzing parallel corpora with bilingual pivoting: paraphrases are source pairs that translate to the same string in a foreign language. Subsequent refinements with distributional re-ranking take into account contextual information.

Multiple resources are organized in lexicon/name-value formats. Expanding on previous work [Bradley and Lang 1999], E-ANEW [Warriner et al. 2013] contains affective norms like valence, arousal and dominance, referring to degrees of pleasantness, emotion and power/control, over 14 thousand English words. Additionally, the General inquirer [Russell 1980] psycholinguistic model maps affect in spatial coordinates for over 8 thousand English words, mapped to 182 affective categories that emphasize and focus on coverage of a broad range of psychological states.

There are some knowledge resources that deal with multimedia. Imagenet [Deng et al. 2009] links 80 thousand Wordnet synsets to hundreds of high-resolution representative images, including animals, objects, scenes, and so on, and bounding box annotations, for a total of 3.5 million images. The data was compiled through automated retrieval methods and refined by quality control via crowdsourcing. Visual Genome[Krishna et al. 2017] provides 100 thousand images with dense crowdsourced annotations such as object regions, attributes, relationships, and scene graphs. Additionally, region descriptions, question-answer pairs and Wordnet linkage are included, towards “grounding visual concepts to language”. Further, the Audioset ontology [Gemmeke et al. 2017] organizes audio categories into a hierarchical structure with abstract classes such as “human sounds”, “music”, “natural sounds” and refined categories like “whistling”, “musical instrument” and “wind”. Concepts come with a short textual description and representative multimedia links. The Music Ontology [Raimond et al. 2007] involves musical concepts in three levels of granularity and expressiveness, from editorial (e.g., track/artist/album information), through performance-related concepts (e.g., performance, recording, and audio stream-level attributes), to decomposition to fine-grain musical elements (e.g., key, musical instrument, and temporal localization of an audio piece). An extension of the above work is the context-based music recommendation (COMUS) ontology [Rho et al. 2009]. It consists of hierarchical, music-related relationships, definitions and attributes, associating them with people, genre, mood states, locations, situations, and events. The Audio Feature Ontology [Allik et al. 2016] provides multiple levels of abstraction for describing the audio feature extraction process, containing a catalogue of \(\approx\)400 features. Entries are organized with respect to data density and temporal characteristics, ranging from abstract conceptualizations (e.g., “chromagram”) to specific extraction algorithms.

Alternatively, exploitable knowledge can be mined from the ground truth/labelset of the data, in a modality-agnostic way. For instance, a low confidence for a superclass in a label hierarchy automatically provides a bias against predicting its children. This can be exploited by, e.g., hierarchical classification [Silla and Freitas 2011], but also serves as knowledge that can be integrated in representations of the data in question. In addition, ground truth/metadata can be utilized in the representation construction phase, such as user tags and textual descriptions of multimedia instances.

Having presented a set of knowledge resources capable of enhancing classification, we move on to present related work that realize this, implementing various enrichment approaches for different data modalities.

4 Representation Enrichment Approaches for Classification

Here we combine the material covered into previous sections – i.e., content-based approaches from Section 2 and knowledge resources/representation enrichment motivation from Section 3—to a presentation of methods in the literature that inject knowledge into data representations, in order to enhance classification tasks for different modalities.

We separate and organize the outlined work in three broad categories, depending on how the enrichment is applied in the representation construction process. At the same time, we draw parallels to the content-based categories outlined in Section 2, to which this grouping bears meaningful analogies. Given this context, the enrichment paradigms include:

–

Input enrichment/modification (IEM), covering approaches that insert external knowledge at the input feature level, arriving at a configuration where knowledge-based information is fed as an input of the classification pipeline. The simplicity of this paradigm bears similarities to low-level/template-matching approaches (LLTM), since knowledge is treated as an auxiliary modality/channel to data content.

–

Knowledge-based refinement (KBR), which includes methods that transform/modify/ process existing low-level representations in a preconfigured manner that is directed, guided and/or quantified by external information. This approach builds upon the aggregation-based (AGB) paradigm for content-based representations, with the critical component of knowledge influencing the aggregation mechanism.

–

Knowledge-aware end-to-end (KAE) systems, where deep hierarchical architectures are built from both content-based and knowledge-based information. The paradigm expands upon deep representation learning (DRL) content-based models, by also including knowledge for building enriched feature hierarchies.

The following sections elaborate on each enrichment category: Section 4.1 deals with IEM methods, followed by KBR and KAE systems, in Sections 4.2 and 4.3, respectively. Covered studies are presented in Table 3, conveying similar information to the corresponding table in Section 2; here we additionally include the enrichment strategy (IEM/KBR/KAE) and the information about the knowledge resource and/or type. While non-exhaustive, the presented body of work constitutes a descriptive set of characteristic approaches for each proposed enrichment paradigm.

Table 3.

citation	mod.	enrichment - resource	category	representation	labelling	classifiers	metrics
[Elberrichi et al. 2008]	TXT	IEM - Wordnet	LLTM	TFIDF	MC-SL/ML	similarity	F1
[Kumar and Minz 2013]	TXT	IEM - SentiWordNet	AGB	TFIDF, PCA, LSA, GI, GR	MC-SL	SVM, NB, k-NN	ACC
[Nezreg et al. 2014]	TXT	IEM - Wordnet	LLTM	TFIDF	MC-SL/ML	SVM, DT, kNN	P
[Pittaras et al. 2020]	TXT	IEM - Wordnet	DRL	CBOW, BoW, TFIDF	MC-SL/ML	MLP	F1
[Škrlj et al. 2020]	TXT	IEM - Wordnet	LLTM	BoW, TFIDF	MC-SL	SVM, MLP, LSTM	F1
[Škrlj et al. 2021]	TXT	IEM - ConceptNet	LLTM	TFIDF, Word2Vec	MC-SL	SVM, NEURAL, LR	ACC
[Li et al. 2017]	TXT	KBR - Sentiment lexicon, synonyms	LLTM	GloVe	MC-SL	SVM, Adaboost, NB	F1
[Yu et al. 2017]	TXT	KBR - E-ANEW	DRL	skipgram, GloVe	MC-SL/ML	CNN, DAN, LSTM	ACC
[Glavaš and Vulić 2018]	TXT	KBR - Wordnet, synonyms/antonyms	DRL	skipgram, GloVe, fasttext	MC-SL	similarity	ACC
[Shi et al. 2019]	TXT	KBR - Paraphrases	DRL	ELMO	BIN	MLP	ACC
[Chen et al. 2018]	TXT	KAE - Wordnet	DRL	BILSTM, GloVe	MC-SL	NEURAL	ACC
[Sun et al. 2019b]	TXT	KAE - Entities (Dataset)	DRL	TRANSFORMER	BIN	NEURAL	ACC
[Zhang et al. 2019b]	TXT	KAE - Entities (TAGME, Wikidata)	DRL	TRANSFORMER	MC-SL	NEURAL	ACC
[Peters et al. 2019]	TXT	KAE - Entities (Wikipedia, YAGO, Wordnet)	DRL	TRANSFORMER	BIN/MC-SL	NEURAL	ACC
[Ke et al. 2020]	TXT	KAE - Sentiwordnet	DRL	TRANSFORMER	MC-SL	NEURAL	ACC
[Li et al. 2022]	TXT	KAE - Metadata	DRL	TRANSFORMER	MC-SL	NEURAL	ACC
[Liu et al. 2022]	TXT	KAE - Probase	DRL	CNN, TRANSFORMER	MC-SL	NEURAL	ACC
[Benitez and Chang 2003]	IMG	IEM - Wordnet, tags	AGB	TFIDF, color, KMeans, k-NN, SoM	MC-SL	NB, SVM, BN	ACC
[Vogel and Schiele 2007]	IMG	IEM - Labelset	LLTM	color, direction, signal	MC-SL	SVM	ACC
[Marszalek and Schmid 2007]	IMG	IEM - Wordnet, tags	AGB	SIFT, KMeans	MC-SL	SVM	ROC
[Kliegr et al. 2008]	IMG	IEM - Wordnet, tags, Wikipedia	LLTM	MPEG-7	MC-SL	evol. SVM	P, R
[Binder et al. 2009]	IMG	IEM - Labelset, VOC2006	AGB	SIFT, KMeans, SPM	MC-SL	SVM	ACC
[Li and Sun 2006]	IMG	KBR - Wordnet	AGB	color, texture, shape, KMeans	BIN,MC-SL	LM, SVM	P, R
[Wu et al. 2010]	IMG	KBR - region	AGB	SIFT, SPC	MC-SL	SVM	AUC
[Deselaers and Ferrari 2011]	IMG	KBR - Imagenet	LLTM	GIST	MC-SL	similarity, SVM	ROC
[Li et al. 2014]	IMG	KBR - Metadata	DRL	Decaf, TF	MC-SL	SVM	mAP
[Menglong et al. 2019]	IMG	KBR - Labelset	DRL	CNN	MC-SL	NEURAL	ACC
[He et al. 2021]	IMG	KBR - Labelset	DRL	CNN	MC-SL/ML	NEURAL	ACC
[Marino et al. 2017]	IMG	KAE - Wordnet, Visual Genome	DRL	CNN	MC-SL	NEURAL	mAP
[Zhang et al. 2019a]	IMG	KAE - Wordnet, Imagenet, Labelset	DRL	CNN	MC-SL	NEURAL	ACC
[Li et al. 2019]	IMG	KAE - Labelset	DRL	TRANSFORMER, CNN, DNN	MC-SL	NEURAL	ACC
[Noh et al. 2019]	IMG	KAE - Wordnet, Visual Genome	DRL	TRANSFORMER	MC-SL	NEURAL	VQA
[Jayathilaka et al. 2021]	IMG	KAE - Labelset	DRL	CNN, MLP	MC-SL	NEURAL	ACC
[Yang et al. 2022]	IMG	KAE - Wordnet	DRL	GNN, PCA, KMeans	MC-SL	NEURAL	ACC
[Cano et al. 2004]	AU	IEM - Labelset	LLTM	MFCC, spectral, psych.	MC-SL	k-NN	ACC
[Hu and Downie 2010]	AU	IEM - Wordnet, E-ANEW, GI	LLTM	TFIDF, signal, spectral, MFCC	MC-SL	SVM	ACC
[Jamdar et al. 2015]	AU	IEM - E-ANEW, Wordnet	LLTM	psych., musical	MC-SL	k-NN	ACC
[Cheng et al. 2008]	AU	KBR - Labelset	AGB	PCP, LCSS, MFCC, ngrams	MC-SL	k-NN	ACC
[Pachet and Roy 2009]	AU	KBR - signal proc.	LLTM	signal, spectral	MC-SL	kSVM	ACC
[Favory et al. 2020]	AU	KBR - Metadata	DRL	CNN, DNN, autoencoders	MC-SL	MLP	ACC
[Zharmagambetov et al. 2022]	AU	KBR - Labelset	DRL	CNN, LSTM	MC-SL/ML	NEURAL	F1
[Bertero and Fung 2016]	AU	KAE - Wordnet, SentiWordnet, Metadata	LLTM, DRL	MFCC, spectral, signal, psych., word2vec	MC-SL	k-NN	ACC
[Jiménez et al. 2018]	AU	KAE - Labelset	DRL	CNN, SIAMESE	MC-SL/ML	NEURAL	ACC
[Sun and Ghaffarzadegan 2020]	AU	KAE - Labelset	DRL	CNN, LSTM	MC-SL/ML	NEURAL	F1
[Zhang et al. 2021]	AU	KAE - ASER	DRL	GCN	MC-SL/ML	NEURAL	mAP

Table 3. Indicative Studies using Representation Enrichment from External Information Sources for Classification

Notation follows Table 1 – additionally, the enrichment-resource column contains entries in the form ENR-RES, where ENR refers to enrichment approaches described in Section 4 and RES to the resource(s) utilized for knowledge injection.

4.1 Input Enrichment and Modification

4.1.1 Overview.

A straight-forward avenue towards knowledge-rich approaches for classification is injecting knowledge alongside the content-based feature set C extracted from the input data. This involves modifying the collection of features fed to the learning model with high-level semantic, statistical, and conceptual information S. Such information is obtained by mining structured knowledge resources to extract knowledge relevant to the input instance, resulting in an enriched feature set \(E=f(C, S)\). The modification function f performs the fusion, e.g., by applying operations such as rule-based selection, concatenation, replacement, and so on.

4.1.2 Approaches.

A major component in IEM is selecting the resource that will provide the knowledge-based features, as these will merged to content-based features and be directly used by the downstream classifier.

Regarding text, resources such as Wordnet have been widely used for directly extracting conceptual/sense-level information from lexicalizations as well as exploiting hierarchical structures in its graph. Such schemes yield common features for different texts with similar semantics, improving generalization potential for the downstream learning machine. For example, the authors in Elberrichi et al. [2008] combine BoW with Wordnet sense statistics weighted via TFIDF. They explore different ways for mapping lexicalizations to senses (e.g., including expanding matches to hypernyms), concatenating the lexical and semantic channels and applying \(\chi ^2\)-based feature selection. A similar approach in Nezreg et al. [2014] improves upon Wordnet concept lookups by utilizing multi-word querying and POS information from Treetagger [Schmid 1994]. Further, conceptual mappings are expanded by considering hierarchical relations (e.g., hypernymy, antonymy) and different concept weighting schemes are investigated for the final concatenation of enriched TFIDF features. Further, the work in Pittaras et al. [2020] combines Wordnet sense statistics and CBOW embeddings [Mikolov et al. 2013] over different lexico-semantic fusion, disambiguation and concept weighting methods. Concept mapping is expanded and regularized with hypernymy-based spreading activation [Collins and Loftus 1975], diffusing semantics in a controlled manner per expansion step. Moreover, document-level taxonomies are built by utilizing hypernymy in Škrlj et al. [2020], leveraging different lexical BoF/TFIDF vector configurations for different terms. Semantic vectors are built via double-normalized TFIDF [Manning et al. 2008] after sense disambiguation, followed by statistical, graph-theoretic and ranking-based feature selection to arrive at a fixed dimensionality.

Additional resources exploited include SentiWordNet for sentiment mining in Kumar and Minz [2013]. The authors extract sentiment scores from mined senses, using TFIDF and various content-based aggregation/selection means. The work in Škrlj et al. [2021] mines ConceptNet to generate semantic relational features, by grounding triplets in the ontology during document traversal and collecting TF-IDF bags of relations for each document. These are subsequently concatenated with different token-level and distributed features for multiple classification tasks.

For images, many works employ relevant ground truth like class hierarchies, label semantics, and localized annotations to create high-level representations with IEM, given a lack of high-level visually-oriented knowledge resources.

For instance, visual annotations to image regions and segments have been exploited as prior semantic knowledge. Such metadata can be utilized for high-level semantic feature generation in classification tasks in a variety of ways, and are usually available as supplementary ground truth information in the Labelset. Early applications [Luo and Savakis 2001] use regions marked as “sky” or “grass” to build intermediate classifiers as semantic feature generators, combined with color and texture features. An array of pre-selected concept annotations to fixed-size regions is used in Vogel and Schiele [2007], fed to classifiers that generate conceptual model vectors as high-level features that are concatenated with multiple low-level features such as color, direction and intensity histograms.

Another useful source of knowledge in image datasets are tags, conveying high-level semantics exploitable for improving performance. Utilization avenues include adopting representation extraction techniques for text, using responses as higher-level semantic features. Given this textual modality data, additional conceptual information can be obtained by probing NLP-oriented knowledge resources such as Wordnet. To this end, the work in [Benitez and Chang 2003] use aggregated color features along with image tags mapped to TF and TFIDF vectors. Multimodal vectors are clustered into knowledge graph of mid-level concepts with various techniques, used as high-level semantics. Wordnet is used for sense disambiguation and facilitation of semantic similarity extraction between graph concepts. Additionally, the authors in Marszalek and Schmid [2007] use textual tags, annotations and labels to extract training examples from Wordnet, exploiting its relational structure (e.g., hypernymy, meronymy, and holonymy) with pruning applied to inhibit propagation towards too generic graph nodes. Sense activations are then combined with SIFT features and a visual BoW approach. Segment predictions are enhanced with tag-based semantics in Kliegr et al. [2008]: first, a named entity recognition (NER) procedure probes Wordnet to discover concepts most similar to an image tagset, via a targeted hypernym discovery process and special measures to improve coverage (e.g., by retrieving relevant Wikipedia articles in terms of textual similarity and article linkage). This procedure results in a collection of senses matching image tags, which are fused with MPEG7 features on regions segmented via self-organizing maps [Kohonen 1990] and particle swarm optimization [Kennedy and Eberhart 1995]. Other approaches include building taxonomy membership features, expressing instances with respect to class memberships in a taxonomic resource. For example, the work in Binder et al. [2009] utilizes the VOC2006 taxonomy [Everingham et al. 2005], exploiting hypernymy-based node membership of the image class in the taxonomy graph to build taxonomic features and affect the training loss by modifying the prediction target, in conjunction with visual features based on SIFT and visual BoW.

Regarding audio data, a similar setting persists, as the visual domain: namely, auxiliary metadata such as ground truth information and textual annotations are the facilitator of enrichment with an IEM strategy.

One such avenue can be found in the music classification domain, where text metadata in the form of song lyrics are exploited for enrichment, via taking advantage of IEM methods for text. Such an approach is adopted by Hu and Downie [2010], utilizing audio and text features: Wordnet, Wordnet-Affect [Strapparava et al. 2004], General Inquirer and E-ANEW provide semantic and psycholinguistic responses, generating statistical, POS and heuristic-based sense and text features in TF and TFIDF ngram bags. These are used to enrich of MFCC, spectral and statistical audio features in various combination configurations. Additionally, in Jamdar et al. [2015] E-ANEW is used to mine valence and arousal scores from song lyrics, with Wordnet powering mapping expansion and semantic disambiguation via sense synonyms and POS information, respectively. Knowledge-based features are adjusted to song-level values and combined with multiple musical, rhythmic and psychoacoustic features, followed by normalization and scaling.

Other approaches exploit relations found in the ground truth and/or labelset of the available audio data. Namely, an early approach uses coarse-grained classification as intermediate high-level features in Zhang and Kuo [1998], with audio-oriented concept scores being used as model vectors, subsequently classified into finer target classes (e.g., rain, bird sounds). Audio content is modeled via signal and spectrum statistics, psychoacoustic/musical features and rule-based clustering. The study in Cano et al. [2004] uses audio samples with Wordnet annotations expanded with additional audio-related concepts. This ground truth is utilized for training a classification system to produce concept-level confidence scores, such as producing assignments to musical instruments. MFCC, statistical, spectral and psychoacoustic features are utilized for the representation, fed to k-NN classifiers for instrument sound categorization.

4.2 Knowledge-based Refinement

4.2.1 Overview.

Here we present a knowledge-guided analog to AGB, modifying features in an informed manner. In KBR, knowledge-based relations can be used as aggregation criteria for similar operations on corresponding content-based features, acting as, e.g., filtering and membership indicators for informed data fusion. Moreover, representation learning methods can inject constraints, regularization, and scaling in the loss/similarity function or fitness objective. For example, taxonomic information linking the concepts “dog” and “wolf” (i.e., subspecies of “Canis”) may be exploited by an embedding generator to bring such data points closer in the embedding space. Likewise, a database of synonyms could make a BoW approach to merge weights matched to concepts “dog” and its synonym “Canis Familiaris”. Such modifications are informed/quantified by utilizing external knowledge of suitable encompassed relationships.

4.2.2 Approaches.

KBR methods for text exploit relations in NLP-related resources to influence representations. One specific approach utilizes word groups, based on categorical assignments, numerical scores or pairwise relations. Intra-group member vectors are then biased to lie close in the embedding space during training or post-processing phases.

A number of works follow this procedure with sentiment-oriented knowledge. For instance, the approach in Li et al. [2017] expands the GloVe model of Pennington et al. [2014] to apply sentiment-guided bias on the word-level, via positive/negative/neutral tags obtained by lexicons with synonym expansion, and globally, via document-level sentiment annotations. Moreover, the work in Yu et al. [2017] introduces refinement of word vectors in a two-stage re-ranking scheme: first, the 10 nearest neighbors of a word are retrieved, followed by re-ranking with respect to the degree of positive and negative sentiment expressed by each. The cosine similarity between word vectors is used to obtain semantic similarity, while the sentiment score is provided by the “valence” norm in the E-ANEW [Warriner et al. 2013] lexicon. The refined output of GloVe and Skipgram vectors then used as the final representation.

Other semantic relationships exploited are paraphrase pairs, i.e., text tuples of different contents but synonymous semantics. Namely, the work in Shi et al. [2019] retrofit contextualized ELMO embeddings [Peters et al. 2018] with paraphrasal information to learn an orthogonal transformation that maps paraphrasal pairs to collocated projections targets. Further, semantic word relationsips of synonymy and antonymy are utilized in works such as in Glavaš and Vulić [2018]. There, the authors seek to optimize with respect to antonymy and synonymy constraints while trying to maintain distances between remaining instance pairs. The investigation explores different optimization objectives, knowledge resources (i.e., Wordnet and Roget’s Thesaurus [Jarmasz and Szpakowicz 2004]) and source embeddings (Skipgram, GloVe, and FastText [Joulin et al. 2017]) to train a mapping with a dense neural network.

Regarding images, one KBR technique exploits metadata with text-oriented enrichment methods, where graph-based resources like Wordnet are useful since they provide multiple exploitable relations via node linkage. For example, early work in Srikanth et al. [2005] uses region descriptions, color, position, texture, shape, and blob-to-word translation associations. They build an “ontology-induced visual vocabulary” by reweighing KMeans with word contributions per cluster, while generative region modeling to Wordnet senses is used to facilitate classification. A similar approach in [Li and Sun 2006] builds a KMeans codebook from keyword-annotated regions, a process augmented with constraints from Wordnet relations that modify the clustering objective to consider the semantic similarity between keywords.

Additional knowledge resources have been exploited to modify representations in the visual domain, such as Imagenet, which provides a visual component to the hierarchy of the Wordnet semantic graph. It is exploited in Deselaers and Ferrari [2011] for computing semantic similarity by extracting the k visually nearest neighbors in Imagenet for each image in an input pair, using a similarity measure based on GIST [Oliva and Torralba 2001], with a final score computed either from pairwise neighbor similarities or the overlap of their category distributions.

Additionally, region-level ground truth and high-level semantic model vectors have been use leveraged in KBR methods. First, image region annotations specifying whole individual objects can instruct representation systems to pool together local descriptors extracted from that region. This is investigated in Wu et al. [2010], where object and foreground annotations bias features extracted from similar regions to be grouped together to same visual words, in a “semantics-preserving” codebook (SPC) process. SIFT with SPC and different distance metric learning techniques such as neighbourhood component analysis (NCA) [Goldberger et al. 2004] are used to build the final representation. Second, a model vector approach may leverage predictions of state-of-the-art classifiers to produce high-level semantics and structures usable for defining meaningful relationships for subsequent representation refinement. Such an approach is pursued by Menglong et al. [2019], using popular CNNs to build a knowledge graph and marking the top k classification results as related, forming pairwise category relationships. Content-based results are refined with a graph-based category similarity measure, improving the accuracy of different deep convolutional nets.

Another refinement avenue approach includes exploiting knowledge-enabled expansion of the available data pool, building the resulting representations with the support of additional data. For instance, in Li et al. [2014], the authors adopt a multi-instance learning approach with “privileged information” [Vapnik and Vashist 2009], i.e., additional extraneous information available during training. This is realized by using retrieval methods to obtain additional related training samples from the web. For training, metadata in the form of image text descriptions are exploited and to TF vectors, while convolutional Decaf features are used for visual content description [Donahue et al. 2014].

Regarding refinement approaches for audio data, existing methods utilize different subdomains of audio-related knowledge in a variety of forms, to produce enrichment via feature refining procedures. For instance, knowledge related to audio signal processing presents one such avenue: this approach is followed in Pachet and Roy [2009], where the “Analytical Features” framework encapsulate knowledge over audio feature engineering and design paradigms (e.g., different practices, heuristics and patterns). These are encoded into operators that process/transform/compose different signal/spectral-based information. The framework is evaluated over fine-grained categorization for artificial and natural sounds (e.g., dog barks, percussion) with a polynomial SVM.

As in other modalities, ground truth presents a rich source of knowledge in audio data. For instance, fine-grained musical annotations can be utilized as conceptual features and high-level representations, which is explored in Cheng et al. [2008] in the form of manually annotated chord transcriptions. These are used to train a chord-ngram HMM model for tasks such as music emotion classification, which captures sequential information of detected chord progressions, along content-based features such as PCP vectors, Longest Common Chord Subsequence similarity, histogram-based measures and MFCC. The work in Zharmagambetov et al. [2022] uses a CNN - LSTM architecture [Hochreiter and Schmidhuber 1997] jointly with a tree-based ontology built from the training labelset. The ontology is used in a decision tree structure to explicitly model the hierarchical probability distributions during learning, which additionally utilizes unlabeled data via consistency training. Further, in He et al. [2021], hierarchical ontologies are manually built by considering dataset and labelset-derived semantics, defining coarse and fine label separations. Ontology levels then determine batch sampling strategies for triplet generation, used to feed a CNN equipped with a triplet loss.

Other types of ground truth include higher level information, such as musical genre, which can be used as conceptual features for generic audio classification. This is investigated in Favory et al. [2020], where an autoencoder scheme is used for aligning content and knowledge in a dual channel architecture. The first channel involves a content-based convolutional autoencoder that ingests spectrogram representations of the audio signal and is trained with a reconstruction loss. The second channel supplies one-hot encodings of semantic audio tags to a feed-forward NN, fitted via cross-entropy. The two learned representations are biased to align via a contrastive loss producing semantically enriched features, fed to an MLP for sound event recognition, instrument and genre classification tasks.

4.3 Knowledge-aware End-to-end Systems

4.3.1 Overview.

Adhering to the popularity and performance of large end-to-end learning models, KAE approaches utilize knowledge in conjunction with deep learning and neural networks. Here, the injection of external information may occur as input and/or refinement operations, as in previous enrichment strategies—the distinguishing factor is that here, content-based and external information is jointly exploited in an end-to-end fashion, to automatically learn knowledge-enriched feature hierarchies. As in DRL, KAE approaches often jointly learn the representation and discrimination components that facilitate classification, and routinely arrive at highly transferable representations.

4.3.2 Approaches.

A rich collection of knowledge resources have been utilized for KAE methods, leveraging unsupervised training on enriched data inputs for high-performance transfer learning. Entity information is one such knowledge domain, assigning high-level semantics to sequences of words and linking together different entity lexicalizations.

Entity-aware language models for text are investigated in knowBERT [Peters et al. 2019], providing approaches for integrating pretrained models with generic knowledge bases structured as triplets or graphs. Self-attentive mention-span embeddings are computed for each candidate entity, followed by neural entity linking [Kolitsas et al. 2018; Lee et al. 2017]. Lexical embeddings are then re-contextualized with the entity-span vectors, using a multihead attention transformer layer. Wikipedia, YAGO and Wordnet are examined for entity identification, with embeddings built via skipgram vectors. Another approach integrating entities along with phrase semantics is ERNIE [Sun et al. 2019b]. ERNIE expands the BERT model [Devlin et al. 2019] to consider knowledge base information, introduced by knowledge-level masking: phrase-level and conceptual/named-entity-level masking is adopted, extending the byte-pair encoding (BPE) used in BERT. This modeling approach is utilized in pretraining, which is performed on multiple-domains and heterogeneous data. Another model focusing on named entities [Zhang et al. 2019b] uses distinct textual and knowledge encoders to handle lexical/syntactic and fine-grained entity-related information, respectively: in contrast with previous studies, content and knowledge data is fed to the model via different input channels. TransE [Bordes et al. 2013] and Wikidata are used to encode entities into embedding vector, which are combined with lexical token embeddings via multi-headed attention, while TAGME [Ferragina and Scaiella 2010] is used for entity extraction. Entity prediction is added as a pretraining task, utilizing special tokens for entities and masked language modeling.

Furthermore, sentiment and semantic word relationships have been used in KAE; such information can highlight relations between word/sequence pairs, which can be exploited and learned in neural architectures. For instance, in SentiLARE [Ke et al. 2020], POS information is used to extract word-level sentiment polarity from SentiWordnet [Baccianella et al. 2010] to train a multilayer transformer. Masked language modeling is used for pretraining, using context-aware sentiment attention to weigh polarities from individual words to the sentence-level sentiment score. Pretraining subtasks include predicting both sentence and word-level labels. in Chen et al. [2018], lexical relations including synonymy, antonymy and hypernymy are mined from Wordnet in order to enhance premise / hypothesis classification on sentence pairs. The relations are represented as binary vectors, mapped via graph embedding methods (i.e., TransE). The model uses two biLSTM layers for sequence encoding and decoding word relationships, respectively. Semantic lexical information is weighted by the alignment score between word pairs. Moreover, the work in Liu et al. [2022] extracts conceptual information of input words from the Probase taxonomy [Wu et al. 2012], applied in a short text classification setting. Temporal CNN and transformer architectures are used for embedding and representation construction respectively, with the content-based and conceptual channels being merged via cross-attention to enriched features.

Finally, knowledge in the form of dataset metadata is exploited in Li et al. [2022] with the DASK system; it identifies domain-independent words from dataset source information and builds knowledge graphs that encapsulate their relationship to domain-related content. They use a BERT variant on knowledge-injected data to classify user reviews.

Ground truth and labelset information is widely used in KAE methods for images. For instance, the work in Yang et al. [2022] utilizes content-based information with neighborhood embeddings and knowledge from Wordnet via vectorizing textual node descriptions, building a unified knowledge graph. The structure is sampled by graph attention modules for few-shot classification, using separate subgraphs for different tasks for debiasing. Visual Genome has been exploited to construct fitting knowledge structures for task-specific enrichment. A popular such format is visual knowledge graphs, utilized for enriched image classification. For instance, Visual Genome has been used to build a knowledge graph of candidate image labels in Marino et al. [2017], using encapsulated object-object/object-attribute relationships and fusing scene graphs with Wordnet semantics. Classification uses a Graph Search Neural Network [Li et al. 2016], that scores each node in the knowledge graph and provides a global aggregate prediction. An object detector/classifier is used to identify an initial visual node in the image, from which controlled propagation supplies additional neighbor nodes of visual objects. A visual knowledge graph is also built in Zhang et al. [2019a], containing semantic associations between content in the image to objects and scenes relevant to it. Scene labels are produced either by classifiers trained on Imagenet, or semantic associations of the image lexical label, which are extracted from Wordnet. A similarity score based on co-occurrence statistics of objects/scenes detected between images pairs is used to augment learning, fed to different CNNs to predict object/scene labels. Ground truth in the format of scene graphs has also been used in [Li et al. 2019] for visual question answering (VQA), which can be viewed as answer classification over visual and textual inputs. Neural perceptual modules (convolutional, attention-based and dense) are used as specialized operators, each adhering to specific semantic subtasks (e.g., boolean operations, localization of salient/relevant regions, inference) matching the question type structure. Scene graph annotations consisting of object region coordinates, attributes and relations are used to guide the layout generation and optimization, propagating through each module. Further, ground truth from Visual Genome is exploited in Noh et al. [2019] for VQA tasks. First, structured knowledge in the form of visual descriptions and answers/labels from Visual Genome is used to generate blanked image descriptions that characterize visual recognition tasks. To disambiguate candidate tasks, Wordnet sense-based modeling is used: sampling a task specification during training involves retrieving senses with large lexicalization overlaps with the input question. Sampled task-conditional visual classifiers are subsequently used to score candidate answers, using pretrained neural models and attention-based modeling. In Jayathilaka et al. [2021], explicit and inferred pairwise hierarchical label relations (e.g., subsumption, disjointness) are mapped to n-ball conceptual embeddings. They are joined with content-based DCNN features and projected to the conceptual space with an MLP, with the learning process mapping visual features into the conceptual space defined by the ontology embeddings.

Regarding enrichment of audio classification with KAE systems, a variety of knowledge exploitation methods have been tried. As in the visual modality and other enrichment avenues, text-based ground truth has been a popular choice and important knowledge contributor. For instance, in Bertero and Fung [2016], audio data and text transcriptions are used in a multimodal approach, employing MFCC, spectral, signal-based and psychoacoustic features for audio representation. For text, content is mapped to Word2Vec embeddings, bags of ngrams and features related to syntax, sentiment, antonyms, and speaker turn. Wordnet and SentiWordnet are used for extraction of semantics and sentiment polarity, with neural (CNN, RNN) and CRF [Lafferty et al. 2001] components for classification.

Furthermore, labelset-related ground truth has been explored for audio knowledge injection in NNs. For instance, ontology-aware approaches have utilized labelset class relationships—this is investigated in [Jiménez et al. 2018] in multiple ways: first, spectrograms are used with feed-forward NNs to directly model the ontology classes sequentially, i.e., reserving fully-connected layers to produce predictions for each ontology level. Each prediction is subsequently fed to the next level as input features. A second approach uses a siamese NN [Chicco 2021] trained on instance triplets with the Euclidean distance, with intra-class samples being encouraged to map to closely situated vectors in the embedding space. The network includes all potential cases in a two-layer ontology (i.e., matching subclass, only matching superclass, different superclass) and uses the same final classification method as the first architecture. Additional ontology-oriented studies include [Sun and Ghaffarzadegan 2020], where the authors consider labelset relations with a model consisting of a base CNN, followed by an LSTM with feed-forward and graph convolutional networks (GCN) [Zhang et al. 2019] modeling intra and interdependencies between levels of hierarchy in the ontology, respectively. Their system is evaluated in the single and multi-label classification of urban sounds. Additionally, the work in Zhang et al. [2021] utilizes the ASER eventuality knowledge graph [Zhang et al. 2020] to link acoustic event metadata descriptions with rich relations (e.g., conveying causality, temporal, and contingency relations). The generated associations are subsequently exploited via a relation-aware GCN variant for audio event categorization.

5 Comparative Analysis

Having investigated characteristic methods of content-based and enrichment paradigms in the literature, we now move on to provide a critical comparison between them. We discuss common representation desiderata, summarized in a qualitative analysis in Table 4. Additionally, we provide indicative performance estimates in terms of classification accuracy;¹ these showcase a general trend of average performance and stability improvement as the complexity of a content-based/enrichment paradigm increases. At the same time, while enrichment efforts show promise, they are currently outperformed by large content-based classifiers engineered to score very large datasets. As the field of knowledge-augmented ML matures, we expect enrichment approaches to reach and/or surpass their content-based counterparts, along with providing additional benefits that come with structured knowledge (e.g., explainability).

Table 4.

Desired Attributes	Content-based Paradigm			Enrichment Paradigm
	LLTM	AGB	DRL	IEM	KBR	KAE
High-level semantics	X	?	✓	✓	✓	✓
Explainable	✓	?	X	✓	?	X
Data-driven/learned	X	?	✓	X	?	✓
Low-dimensional/space-efficient	?	✓	✓	X	✓	✓
Data efficient/lean	✓	✓	X	✓	✓	X
Computationally efficient	?	X	X	✓	?	X
Reusable/transferable	X	?	✓	✓	✓	✓
Avg. accuracy % (indicative)	\(86.2 \pm 8.03\)	\(88.47 \pm 7.98\)	\(90.78 \pm 5.17\)	\(83.54 \pm 13.34\)	\(84.78 \pm 8.06\)	\(86.38 \pm 7.27\)

Table 4. Comparison between Content-based and Enrichment Paradigms

Green checkmarks, red X’s and yellow question marks indicate desired attributes that are true, false, or not determined / inconclusive, for the paradigm in the corresponding column. Paradigm average accuracy scores are approximated with the procedure detailed in the appendix A.1.

5.1 Knowledge-agnostic Representation Paradigms

Section 2 covered broad approaches and indicative related work on content-based representations, organized in terms of the richness of semantics encapsulated in the output features. We now move on to a finer comparison, considering desired qualities for representations [Bengio et al. 2013] and general trends observed in the proposed paradigms.²

LLTM methods generally have no data needs outside the dataset of interest and create explainable representations: a feature coordinate has clear, non-ambiguous meaning, easily understood by referring to the generation algorithm. However, LLTM heavily relies on handcrafted feature engineering, demanding human expert intervention, knowledge and familiarity to the application domain. Features comparatively lack in richness of encapsulated semantics: very large vector spaces may be needed to arrive at adequate expressive power for efficient classification, while outputs are generally not reusable, leading to high-dimensional, specialized and often computationally demanding representations. As a result, cases with a severe lack of available data, or with high transparency/explainability requirements (e.g., medical/governance domains) could benefit from utilizing LLTM approaches.

AGB methods can generally build space-efficient representations, often configurable to a desired size for needs of specific tasks. AGB workflows use LLTM responses as inputs: as a result, they may require more data to reach meaningful results, along with increased need for computing power to fuel the additional processing steps. AGB generally produce distributed features, severely harming explainability. However, this improves the expressive power of the final representation and allows some degree of reusability—but no easy method for fine-tuning feature sets. Finally, such approaches may employ some degree of feature learning, but do so by using fixed, preconfigured rules and analytic solutions. Thus, AGB approaches could be favored if explainability requirements are low and not enough resources are available for paradigms of increased complexity and efficiency.

Finally, DRL methods are fully data-driven, accumulating improvements in an incremental, partially stochastic manner. This generally produces semantically rich, distributed and compact features that are transferable and can be fine-tuned, but at the cost of creating black box-like feature extractors with low explainability regarding internal workings and output semantics—an issue which is an open and active area of research [Arras et al. 2017; Becker et al. 2018; Zhang and Zhu 2018]. Further, the reliance of these methods on distributional, data-driven operation renders them highly demanding with respect to the required amount of data and computational resources. As a result, deep representations could be the approach of choice if adequate compute and data resources can support their use, for tasks with very low explainability/transparency constraints.

Given the above, we observe that each approach has clear pros and cons with no definite, one-size-fits-all approach: the no free lunch theorem [Adam et al. 2019] appears to hold for representations approaches in classification. However, we propose that the historical progress of method evolution from low-level, to aggregation-based, to deep representations reflects research efforts to increase representation richness in semantics and conceptual information.

5.2 Representation Enrichment Approaches

We now move on to an analysis of representation enrichment paradigms and materials covered in Section 4. Overall, we can expect enrichment approaches to produce semantically rich features, provided an applicable knowledge resource of the desired domain and/or of sufficiently high-level information is utilized. Additionally, these paradigms generally present a high degree of compatibility with existing content-based features, i.e., can be utilized for transfer learning and feature reuse. Further, we provide knowledge-specific considerations for each enrichment paradigm below.

Input enrichment and modification (IEM) approaches require instance-level knowledge association: the resource must support mapping single, isolated instances to adequate (semantically and quantitatively meaningful) knowledge units. Thus, encompassed knowledge should meaningfully refer to singletons (e.g., E-ANEW word scores) and not be restricted to n-tuples (e.g., ParaphraseDB paraphrasal pairs). In the latter case, working solutions may be reachable by, e.g., partial/inverse mapping operations. Like the LLTM paradigm, IEM generally outputs enriched features without loss of explainability: content-based features are generally preserved, introducing knowledge-based information that is identifiable in the enriched representation and can be reviewed post-hoc. Since knowledge-based features constitute high-level conceptual information that can aid interpretation and provide intuition, this enrichment strategy could be favored by classification applications that prioritize explainability. The input space-oriented approach inherently supports reuse of content-based information—on the other hand, IEM’s direct usage of knowledge features may make it sensitive to local outliers and redundancies (e.g., many-to-one mappings) introduced by the knowledge resource, giving rise to the need for filtering and/or normalization post-processing steps. Finally, IEM methods does not have large requirements in terms of data and compute requirements, but directly inflate the dimensionality of input features.

Knowledge-based refinement (KBR) exploits knowledge to guide enhancements and aggregations in groups of content-based representations. Contrary to IEM, this approach may require the knowledge resource to meaningfully refer to n-tuples of data (e.g., paraphrase pairs, semantic triples), linking them together via conceptual high-level relationships. This may restrict the number of resources usable with this enrichment scheme; however, desirable modifications may be viable with alignment operations [Amrouch and Mostefai 2012], data-driven methods, or by engaging with the internal organization to produce usable interfacing solutions, at the cost of suitable preprocessing steps. KBR aims at reusing content-based features by guided refinement, supporting transfer learning without data-driven fine-tuning, allowing pre-existing features to be repurposed for different tasks/domains, exploiting different knowledge resources as each use case demands. However, careful configuration of the knowledge extraction process may be required to appropriately tune and regularize the contribution of different relevant components in the architecture, while the enriched representation may not be explainable (despite arising from explainable associations), as refining modifications may be applied in a distributed manner. Lastly, KBR enriches pre-existing features or generates knowledge-guided representations from scratch; as a result, the overall severity of its computational overhead is not clear.

Finally, knowledge-aware end-to-end (KAE) systems utilize knowledge in a holistic manner alongside content-based information, via deep representation learning. Knowledge utilization in KAE systems is versatile; instance/sub-instance knowledge unit (e.g., word/subword annotations in text) or n-tuple (e.g., ground truth tag/semantic relations) mappings may be used from a knowledge resource. Additionally, KAE shares attributes with its content-based counterpart (DRL), e.g., in terms of strong utilization of representation learning and representation dimensionality, but also with respect to the resulting enriched feature sets lacking explainability—i.e., relying on intuitions based on visualization / post-inspection techniques of model outputs and its internal representations [Yosinski et al. 2015; Mikolov et al. 2013]. Finally, training comes with requirements for large amounts of computational resources and data, which is only deteriorated by multiple resources and/or knowledge extraction and preprocessing steps that may be required.

In light of the above, in the next section, we discuss a high-level view of the material covered, along with findings, insights, and potential future directions of the representation enrichment field.

6 Discussion and Research Findings

In this section, we provide a high-level view of the totality of the covered material, organized in a set of research questions and findings. From a study of the related literature, we can recognize the following:

(a) Knowledge-based enrichment offers a route for enhancing explainability. In Machine Learning there exists a duality between learning and forming a representation: finding good representations for a given task implies facilitating learning. This has been further accentuated through deep learning approaches [Adadi and Berrada 2018; Burkart and Huber 2021] and emphasized in this work. Given this relation, one would expect that explainability requirements in the learning process could also directly address the representations itself. As argued in this study, enriching representations provides content-agnostic avenues for improving representation explainability, via enriched features that are grounded in—or directly contain—explainable knowledge. This potential seems especially promising in deep neural features, where the performance/explainability tradeoff is most severe.

(b) Research trends towards Representation Learning. The proposed content-based categories provide a view that reflects a general historical evolution of main research efforts in data representations and resulting increase in representation complexity and richness of encapsulated semantics: simpler LLTM-based features are becoming increasingly inadequate, while static aggregation/transformation employed by AGB is lacking. Instead, research has shifted to favor strategies with less inductive biases, like DRL. Notably, this shift appears to carry over to representation enrichment; observed trends show an increase in complexity, stochasticity and computational cost of knowledge utilization. Namely, IEM performs simple operations on the feature set of the content-based baseline (e.g., expansion), while KBR applies well-defined engineered modifications at predefined points in the computation, when/where knowledge application is deemed relevant. On the other hand, the enrichment mechanism in KAE systems is entirely delegated to data-driven learning that jointly leverages content and knowledge information, an approach favored in the enrichment domain, as reflected in our literature review and indicated by publication trends (e.g., appendix Figure A.1).

(c) Knowledge-based enrichment presents a promising alternative for rich representations. As stressed in Section 2, content-based paradigms have pros and cons with no one-size-fits-all solution. In light of this, knowledge-based enrichment remains a promising alternative for arriving at rich features, without paying the cost of AGB/DRL (e.g., over-engineering, large complexity jumps and data/compute requirements), while offering control of what knowledge is infused (e.g., pertaining to domains of interest), on which content-based features, and how this enrichment is applied.

(d) Further research is required for optimal knowledge-based representation enrichment. Current enrichment approaches are influenced by content-based paradigms, carrying over working solutions but perpetuating disadvantages. This makes selecting an overall optimal approach difficult, hence the fine-grained comparisons/use-case-specific suggestions provided in Section 5.2. Paradigm-specific constraints imply that careful selection of the resource/enrichment method may be necessary, on a per-application basis. Further research is required to arrive at robust methods, highly compatible with different types of instances. This is necessary, given that knowledge appears to still play an auxiliary/complementary role to content [Pittaras et al. 2020], a remark compounded by modality-centric issues highlighted in subsequent findings. However, novel techniques are continuously being invented, as exemplified by the successful integration of enrichment with deep learning methods in KAE, enabling the exploitation of knowledge in representation learning in a holistic manner, via leveraging both content-based and high-quality structured information in state-of-the-art classification pipelines. We believe that KAE shows promise in the search of optimally fusing distributional and knowledge-oriented information, exploiting the best of both worlds.

(e) Semantic delineation affects representations, knowledge compatibility and enrichment. In this survey we covered works handling text, image and audio data modalities, associated with qualitatively different semantic gaps: for text, high-quality semantic segmentation is readily available by virtue of existing rules in language, syntax and grammar. In contrast, such rules for multimedia are hidden away in our visual/auditory systems; representations for such data are limited to operations on signal values with little to no direct linkage to high-level information [Sethi et al. 2001]. As observed in Section 2, this shapes the description, localization and extraction of content-based representations. Moreover, these differences may render compiling complex knowledge for non-textual modalities difficult, leading enrichment techniques to largely rely on knowledge of linguistic nature, that has the capacity to encapsulate higher-level semantics. Indeed, many multimedia-related knowledge resorts to metadata annotations on large datasets, following a more data-driven approach. This has limited the enrichment of image/audio to indirect knowledge utilization—i.e., via exploiting tags and labelset structure through their textual descriptions and lexicalizations. To this end, producing knowledge resources suitable for direct exploitation in image and audio representation enrichment would be a step towards additional improvements on the performance and interpretation of image/audio classification models.

7 Conclusion

In this survey, we investigated representation enrichment approaches from external knowledge resources, covering different modalities (text, images, and audio) and feature generation techniques, in the context of classification. Related literature was organized in distinctive categories with respect to the representation paradigm employed and organized in summary tables to facilitate comparison and lookup. We began by cataloguing knowledge-agnostic representations, organized in three broad categories, namely (a) low-level/template-matching methods, that extract low-level information and handcrafted features (b) aggregation-based approaches, that combine, transform or post-process low-level results, and (c) deep representation systems, that build hierarchical feature sets via deep representation learning. Qualitative comparisons and use cases suggestions are provided for each paradigm.

We moved on to expand on the motivation for utilizing knowledge in classification, listing available exploitable resources, along with details regarding their information content, structure, and retrieval details. Enriched representations are covered next, i.e., studies that take advantage of such knowledge resources, grouping related work into groups of similar enrichment methodology, namely (a) input enrichment and modification, covering works that inject knowledge-based information in the input feature space, (b) knowledge-based refinement, which aggregates and/or transforms representations via knowledge-determined operations, and (c) knowledge-aware end-to-end systems, consisting of pipelines that jointly learn nonlinear feature hierarchies from content and knowledge inputs. Finally, we compare enrichment categories in representation and knowledge-oriented contexts and discuss research findings.

There are multiple ways to complement/extend the work in this survey. Our primary focus was the representation component of the classification pipeline—one avenue for future work would be the investigation of enrichment approaches that focus on the learning algorithm, i.e., independent of the representation approach. Additionally, an exploration of knowledge utilization between classification and other ML tasks (e.g., in a comparative study) or a task-agnostic approach would be beneficial towards a better understanding of the wider impact of enrichment. Further, meta-analysis on selected works with comparable experimental setups could be conducted to quantitatively assess the impact of different representation and enrichment paradigms. Finally, the representation building and enrichment paradigms proposed in this survey could be investigated in the context of the classification of multimodal instances.

Footnotes

Reported performance stems from aggregating different experiments and datasets - see A.1 for a discussion on details, limitations and further results.

While these observed trends generally hold and are indicative of paradigm approaches, edge cases, grey areas and exceptions are unavoidable.

A Appendix

A.1 Performance Comparison

Here we describe the approach for computing the aggregated performance scores for the proposed paradigms, as referenced in the table and relevant discussion in Section 5. Given that in this study we cover works that tackle different classification tasks and modalities, there is a great variability in benchmark datasets, dataset versions, slices and subsets, specific classification subtasks and evaluation metrics used in each related work item examined. This makes the exact comparison of approaches very difficult, since it has not been possible to determine a shared setting (e.g., common dataset and metric) for each modality in question. As a result, we are forced to group studies that use different evaluation protocols—we would thus like to stress that the outcomes of this comparison, while useful, should not be considered as concrete, definite evidence, but serve instead as general hints and noisy indications of performance trends for each representation paradigm.

We adopt accuracy as the evaluation metric, being the most prolific performance measure in the related work we discuss in the survey. For studies that evaluate multiple datasets, we average the top 3 results in terms of test set performance of the proposed approach. We collect accuracy scores, dataset/labelset sizes and number of datasets utilized in the evaluation. We average this information for each content-based and enrichment paradigm, using the two most recent studies from the related work discussed in Sections 2 and 4 that report accuracy-based classification. Results are thus computed by aggregating 6 articles per paradigm, with a pool of 36 articles used to produce the entire body of reported results. This information is summarized with mean and standard deviation values in Table A.1.

Table A.1.

Content-based Paradigms
Statistic	LLTM	AGB	DRL
Accuracy %	\(86.2 \pm 8.03\)	\(88.47 \pm 7.98\)	\(90.78 \pm 5.17\)
Dataset size	\(626.08 \pm 427.76\)	\(55,665.85 \pm 116,282.2\)	\(2,279,126.92 \pm 2,862,928.64\)
Number of datasets	\(1.33 \pm 0.52\)	\(5.5 \pm 5.43\)	\(2.83 \pm 3.54\)
Labelset size	\(4.67 \pm 3.01\)	\(43.82 \pm 87.9\)	\(2,104.18 \pm 4619.3\)
Enrichment Paradigms
Statistic	IEM	KBR	KAE
Accuracy %	\(83.54 \pm 13.34\)	\(84.78 \pm 8.06\)	\(86.38 \pm 7.27\)
Dataset size	\(3,108.02 \pm 3433.92\)	\(242,867.72 \pm 533,991.79\)	\(80,585.86 \pm 80,412.9\)
Number of datasets	\(3.17 \pm 5.31\)	\(2.17 \pm 1.33\)	\(2.67 \pm 1.75\)
Labelset size	\(5.63 \pm 2.87\)	\(420.56 \pm 415.23\)	\(231.58 \pm 266.11\)

Table A.1. Classification Performance (Accuracy %) and Dataset-related Statistics (Count, Sample / Labelset Size) of Articles Discussed in This Study, Averaged by Content-based and Knowledge-based Enrichment Paradigms Proposed in the Survey

See A.1 for a discussion on details and limitations of the approach by which these values where computed.

Figure A.1.

Along with trends for average accuracy appearing to increase as paradigms become more complex (discussed in the beginning of Section 5), we provide per-category scores for the number of datasets utilized for evaluation and their instance/labelset sizes. These also showcase upward trends in accordance with approach complexity for content-based methods (i.e., from LLTM to AGB to DRL); on the other hand, these statistics do not seem to follow such a monotonic trend for enrichment approaches. This could be explained by the fact that enrichment workflows include external knowledge resources in their processing and learning pipelines, the utilization of which often results in significant expansion of input information, and thus of the effective input dataset. This, along with additional computational cost requirements from knowledge integration in the learning method, could act as considerable limiting factors in scaling dataset counts and sizes when adopting representation strategies of greater complexity; this could in turn explain the more conservative experiment scaling in IEM, KBR, and KAE paradigms, compared to their content-based counterparts.

To conclude, we offer this brief comparative snapshot along with the larger, qualitative discussion in Section 5 in order to provide readers with an approximate view of how content-based and enrichment methods stand, with respect to classification performance and scale of empirical analysis. A rigorous quantitative investigation (e.g., application of selected systems to common, fixed benchmarks and evaluation protocols) is out of the scope of this survey, and we reserve such efforts for future work.

References

[1]

A. Adadi and M. Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160.

Abstract

1 Introduction

1.1 The Structure of a Classification Problem

1.2 Injecting Knowledge into Classification

1.3 Article Structure

2 Knowledge-agnostic Representations for Classification

2.1 Template Matching and Low-level Approaches

2.1.1 Overview.

2.1.2 Approaches.

2.2 Aggregation-based Methods

2.2.1 Overview.

2.2.2 Approaches.

2.3 Deep Representation Learning

2.3.1 Overview.

2.3.2 Approaches.

3 Structured Knowledge

3.1 The Need for Enrichment

3.2 Knowledge Resources

4 Representation Enrichment Approaches for Classification

4.1 Input Enrichment and Modification

4.1.1 Overview.

4.1.2 Approaches.

4.2 Knowledge-based Refinement

4.2.1 Overview.

4.2.2 Approaches.

4.3 Knowledge-aware End-to-end Systems

4.3.1 Overview.

4.3.2 Approaches.

5 Comparative Analysis

5.1 Knowledge-agnostic Representation Paradigms

5.2 Representation Enrichment Approaches

6 Discussion and Research Findings

7 Conclusion

Footnotes

A Appendix

A.1 Performance Comparison

References

Cited By

Index Terms

Recommendations

Knowledge representation with SOUL

Representations of strategic knowledge in design

Knowledge representation in the semantic web for Earth and environmental terminology (SWEET)

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations