research-article

Open access

CARIn: Constraint-Aware and Responsive Inference on Heterogeneous Devices for Single- and Multi-DNN Workloads

Authors:

Ioannis Panopoulos,

Stylianos Venieris,

Iakovos VenierisAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 4

Article No.: 60, Pages 1 - 32

https://doi.org/10.1145/3665868

Published: 29 June 2024 Publication History

PDF eReader

Abstract

The relentless expansion of deep learning applications in recent years has prompted a pivotal shift toward on-device execution, driven by the urgent need for real-time processing, heightened privacy concerns, and reduced latency across diverse domains. This article addresses the challenges inherent in optimising the execution of deep neural networks (DNNs) on mobile devices, with a focus on device heterogeneity, multi-DNN execution, and dynamic runtime adaptation. We introduce CARIn, a novel framework designed for the optimised deployment of both single- and multi-DNN applications under user-defined service-level objectives. Leveraging an expressive multi-objective optimisation framework and a runtime-aware sorting and search algorithm (RASS) as the MOO solver, CARIn facilitates efficient adaptation to dynamic conditions while addressing resource contention issues associated with multi-DNN execution. Notably, RASS generates a set of configurations, anticipating subsequent runtime adaptation, ensuring rapid, low-overhead adjustments in response to environmental fluctuations. Extensive evaluation across diverse tasks, including text classification, scene recognition, and face analysis, showcases the versatility of CARIn across various model architectures, such as Convolutional Neural Networks and Transformers, and realistic use cases. We observe a substantial enhancement in the fair treatment of the problem’s objectives, reaching 1.92× when compared to single-model designs and up to 10.69× in contrast to the state-of-the-art OODIn framework. Additionally, we achieve a significant gain of up to 4.06× over hardware-unaware designs in multi-DNN applications. Finally, our framework sustains its performance while effectively eliminating the time overhead associated with identifying the optimal design in response to environmental challenges.

1 Introduction

In recent years, the pervasive growth of deep learning (DL) applications has catalysed a paradigm shift in the field of artificial intelligence, rendering on-device execution a critical imperative [69]. The burgeoning demand for sophisticated deep neural networks (DNNs) spans a myriad of domains, from computer vision to natural language processing, necessitating the deployment of these models directly on mobile devices. This shift from centralised to decentralised computation arises from the intrinsic requirements of real-time processing, enhanced privacy concerns, and the need for reduced latency in diverse applications. As a consequence, the optimisation of executing deep neural networks on-device has emerged as a paramount research frontier.

While the shift toward on-device execution of DNNs represents a pivotal advancement, it is not without its formidable challenges. Device heterogeneity, characterised by the diverse array of hardware and computational capabilities across mobile devices, remains a persistent hurdle. Moreover, emerging challenges, such as the simultaneous execution of multiple DNNs on a single device [60] and the need for dynamic runtime adaptation [64] to evolving environmental conditions, add layers of complexity to the optimisation landscape. Multi-DNN execution introduces intricate dependencies and resource contention issues, necessitating sophisticated orchestration strategies. Runtime adaptation, however, mandates the development of intelligent mechanisms capable of dynamically adjusting model parameters and system configurations to optimise performance in real-time scenarios. Addressing these challenges is paramount to unlocking the full potential of on-device deep learning, as it paves the way for the seamless integration of advanced artificial intelligence (AI) capabilities into the fabric of our interconnected devices.

Enhancing the work presented in Reference [61], namely OODIn, this article presents CARIn, a novel framework for the optimised deployment of both single- and multi-DNN applications on mobile devices. The initial work in Reference [61] focused primarily on presenting a new highly parametrised software architecture for DL mobile apps, optimising single-DNN applications and evaluating solely on the image classification task. In this article, we build upon the architecture of OODIn that allows us to efficiently modify model and system parameters and introduce two novel components to meet the new demands of model multi-tenancy and efficient runtime adaptability. First, we develop an expressive multi-objective optimisation (MOO) framework that allows us to capture both single- and multi-DNN workloads and to formally model the performance requirements and constraints of DL applications. Second, we present RASS, a runtime-aware MOO solver that enables rapid, low-overhead adaptation while sustaining high performance under dynamic conditions. Contrary to existing optimisers that yield a single execution plan for a given device [28, 62], our solver generates a set of configurations to accommodate potential variations in resource availability. This eliminates the necessity to continually adjust and resolve the MOO problem whenever a runtime issue arises. Additionally, we broaden the scope of targeted tasks and further augment this by conducting a comprehensive evaluation spanning various model architectures, including Convolutional Neural Networks (CNNs) and Transformers, across a spectrum of realistic scenarios characterised by diverse performance demands.

2 Background and Related Work

2.1 Problem Statement

The primary aim of a DL application is to consistently uphold its performance goals or service-level objectives (SLOs), often referred to as quality of service targets. These SLOs encompass a multifaceted range of critical metrics, including but not confined to accuracy, latency, throughput, memory utilisation, and energy consumption. Achieving and sustaining these objectives requires careful consideration of the specific demands inherent to a given application or system. It is crucial to recognise that this challenge is compounded by two primary factors: (a) the inherent heterogeneity and (b) the dynamic nature of mobile and embedded devices. Adding to this complexity is the increasingly prevalent use of multiple models within DL applications, which places additional demands on the already intricate ecosystem of these devices.

2.1.1 Device Heterogeneity.

Compact devices involve a wide array of hardware configurations, characteristics, and capabilities, leading to a significant level of diversity [1, 2, 22, 66]. This diversity manifests not only across distinct devices, which is referred to as “inter-device heterogeneity,” but also within individual devices, a concept known as “intra-device heterogeneity.” Inter-device heterogeneity reflects the variations in size, processing power, memory capacity, energy efficiency, and more, across different devices. For example, a high-end smartphone will have markedly different hardware specifications than a low-cost IoT sensor node, illustrating the extent of diversity that exists across various devices. Intra-device heterogeneity, however, arises from the presence of multiple hardware components and subsystems within a device, each with its distinct attributes. In a standard smartphone setup, for instance, one may encounter a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU), all with varying clock speeds, energy consumption profiles, memory requirements, and parallelisation capabilities. Due to device heterogeneity, it is very challenging to design a universal DL model that performs efficiently across devices. For example, a model meticulously crafted and optimised to run seamlessly on Google’s Edge Tensor Processing Unit (TPU) may encounter performance issues and inefficiencies when deployed on a different processor, such as a conventional mobile GPU.

As a result, rather than pursuing a one-size-fits-all approach, where a single model is expected to excel universally, the focus shifts toward creating models that are fine-tuned and optimised for the unique features of each target device. These device-specific models are designed to leverage the strengths of the hardware and maximise performance, thereby addressing the inherent challenges associated with device heterogeneity.

2.1.2 Dynamic Environment.

In contemporary mobile and embedded computing ecosystems, the concurrent execution of multiple applications and processes is commonplace. This inherent multi-tasking characteristic introduces notable fluctuations in resource availability and workload demands, thereby rendering the acquisition of sufficient resources for performant task execution a challenging endeavour. Due to environment dynamicity, it is very challenging for a static execution configuration to consistently satisfy the application’s SLOs at any given time. For instance, if a user runs the application outdoors on a hot day, then the device’s temperature may rise and thermal throttling mechanisms can be triggered, causing the CPU or GPU to reduce their clock speeds to prevent overheating [54], resulting in reduced throughput or execution slowdown.

Such scenarios necessitate the ability to dynamically adapt to changing conditions and varying resource availability in real time. This adaptive behaviour is vital to ensure that the application consistently maintains satisfactory performance levels despite the variability in its operational environment.

2.1.3 Multiple DNNs.

Today’s growing demand for more advanced and intelligent systems has given rise to scenarios that mandate the simultaneous utilisation of multiple models, often referred to as “multi-DNN” configurations [60]. This paradigm shift is mainly driven by the need to address specific tasks or solve complex problems that demand a diversified approach, benefiting from the combined expertise of multiple specialised models. Multi-DNN applications showcase a high degree of adaptability in employing multiple models, as they can be harnessed to address a singular, intricate task [24, 73] or several distinct, autonomous tasks [8, 36]. In the former scenario, models typically exhibit interdependence and may require sequential execution, while in the latter scenario, models operate independently, affording them the capability to run in parallel. The efficient deployment of multi-DNN configurations introduces intricacies that pertain to resource allocation and load distribution. Particularly, parallel model execution presents a notably more intricate challenge compared to sequential execution, as models compete for the device’s finite resources. This concurrent operation, coupled with the simultaneous management of multiple tasks, amplifies the overall workload and poses new challenges to resource allocation and coordination.

The attainment of seamless orchestration and effective collaboration among multiple specialised models while upholding stringent performance and quality benchmarks stands as a substantial and multifaceted challenge within the domain of multi-DNN applications.

2.2 Related Work

The field of on-device deep learning has witnessed significant advancements in addressing the challenges posed by device heterogeneity and environment dynamicity in both single- and multi-DNN use cases. This progress reflects a profound shift in the landscape of deep learning research, where a growing emphasis has been placed on ensuring that DL models not only function effectively but also meet specific SLOs across a spectrum of computational environments, ranging from powerful high-end devices to resource-constrained edge computing platforms. Moreover, recent efforts have explored the integration of MOO techniques to further enhance the adaptability and efficiency of on-device DL solutions.

2.2.1 Service-level Objectives.

Most of the prior works directed toward achieving SLOs have predominantly concentrated on scenarios involving multiple DNNs, primarily investigating the tradeoff between accuracy and latency. Within this domain, the majority is focused on system development for edge servers [13, 29, 53, 83, 84], and only a limited number of studies have been devoted to on-device execution, specifically by orchestrating multiple inference requests across heterogeneous processors [24, 51, 73]. In contrast, our proposed framework demonstrates the capability to accommodate a diverse array of SLOs, including but not limited to accuracy, latency, memory footprint, size, and energy consumption, tailored for both single- and multi-DNN on-device applications.

2.2.2 Multi-objective Optimisation.

MOO has been widely employed in conjunction with neural architecture search (NAS) methodologies, culminating in the development of Multi-Objective Neural Architecture Search. This approach is particularly valuable for (a) designing DNNs with the goal of not solely optimising the accuracy but also considering resource consumption [11, 23, 56] and (b) for compressing pretrained models [18, 40, 63]. Notably, our framework represents one of the pioneering efforts to formulate and address device-specific MOO problems to achieve specific SLOs at the system level.

2.2.3 Device-specific Solutions.

The majority of endeavours aimed at addressing device heterogeneity predominantly concentrate on the model level, i.e., by identifying the most fitting DL architecture tailored to a specific hardware platform. Among the prominent model-level methodologies, NAS and model scaling have a central role.

Hardware-aware NAS (HW-NAS) approaches seek to optimise DNN architectures both for high predictive accuracy and for efficient execution on a target deployment platform. Its most prominent premise is the inclusion of (a) hardware constraints, and (b) latency, energy, and other system metrics, as objectives during the search process [3, 4, 65, 80]. HW-NAS usually involves performance prediction to guide the search algorithm. Nonetheless, estimating precise latency, memory, or energy figures can be challenging, and the method’s effectiveness heavily relies on the accuracy of these estimates. Such approaches can also be computationally intensive due to the need to train and evaluate a large number of candidate architectures.

Supernet-based NAS, also known as One-shot NAS, is an approach that leverages a supernet along with weight sharing to facilitate efficient architecture search [6, 30, 35, 64]. A supernet is a network containing all possible architectural choices of a given search space, and it enables the exploration of diverse neural architectures while significantly reducing computational overhead. While this approach reduces the training-time computational requirements, it may not be as effective at tailoring architectures to specific hardware constraints and weight sharing may restrict fine-grained control over architectural decisions.

Last, model scaling involves adjusting parameters such as the depth, width, and input size of a DNN to strike a balance between accuracy and efficiency [10, 58, 68]. This technique is often applied along with NAS or knowledge distillation methods to also accommodate resource constraints. However, model scaling might not fully exploit the unique hardware characteristics of specific devices, potentially leading to sub-optimal performance.

2.2.4 Runtime Adaptation.

In the context of runtime adaptation, research efforts span both the model and system levels. On the model level, the primary focus revolves around the development of techniques designed to dynamically adjust the model’s architecture in response to fluctuations in resource availability. These adaptive models possess the capability to modify their architecture and parameters in real time during inference, effectively responding to the evolving constraints of the computing environment. Prominent examples of such models comprise adaptive supernets [7, 16, 43, 64], adaptive model scaling [20, 72, 77], multi-branch networks [17], early-exit models [4, 34], and a variety of other innovative approaches. However, crafting adaptive mechanisms that seamlessly function across a diverse range of devices can pose significant technical challenges and the adaptability of these networks may introduce certain computational overhead, potentially impacting the performance of real-time applications. At the system level, complementary methods come into play, including dynamic compression [38], adaptive model selection [31], and efficient scheduling on available hardware [70], among others.

2.2.5 Multi-DNN Inference.

To facilitate multi-DNN inference, researchers have also explored solutions at both the model and system levels [60]. From the model perspective, the execution of multiple DNNs aligns closely with the principles of multi-task learning [21, 49, 78, 82], a technique that trains a single model to perform multiple related tasks simultaneously. Consequently, using a single multi-task model for inference can replace the need for concurrent inferences from multiple models. At the system level, most research efforts leverage the heterogeneous processors available on devices and aim to identify the highest-performing mapping strategy. This is typically achieved through approaches that partition the model at the layer level [24, 26, 27, 67] or by introducing task-level priorities [37]. Additionally, there exists a body of research focused on multi-tenant inference systems [12, 75], albeit predominantly concentrating on server-based configurations [71, 74] rather than on-device implementations [60].

3 Overview of CARIn

3.1 Proposed Solution

CARIn addresses the main challenges of on-device DL inference (Section 2.1) in two ways. First, we introduce a novel approach of modelling DL applications, utilising a MOO framework to encapsulate their characteristics (Section 4). Given the rising number and diversity of DL applications, CARIn is able to analytically represent their various performance requirements and constraints, with the required expressivity to support both single- and multi-DNN scenarios. Second, to enable runtime adaptation, we introduce RASS, a runtime-aware MOO solver that allows for low-overhead and effective dynamic adjustment of the execution. The key principle behind RASS’s design is to explicitly consider during the MOO solution stage that adaptation may subsequently be required at deployment time. As such, RASS operates in two steps: (i) It generates a set of alternative execution configurations with diverse tradeoffs prior to deployment and (ii) configures the inference engine with a policy of switching among them.

Toward alleviating the impact of device heterogeneity and resource fluctuation, CARIn operates exclusively at the system level, bypassing the need to produce an optimal model for each target device. Model-level solutions typically include the design, exploration, training, and adaptation of a DNN’s architecture to specific target devices and resource availability changes. These procedures can be cumbersome, time-consuming, and lead to complex pipelines. Instead, our framework employs a repository of pre-trained models with varying architectures and complexities. The singular requisite action in relation to the models entails the application of post-training quantisation (Section 6.1).

The design of our framework was driven by the fact that satisfying SLOs depends not only on the target model but also on the specific target device, especially the processor in use. Consequently, CARIn’s primary objective is to determine, at any given time, the most suitable model-processor¹ pair (or pairs) for a specified device. Internally, our MOO framework expresses this as a DL-based, device-centric problem, to effectively capture both the application’s SLOs and the unique characteristics of the target device. Given the device-specific nature of our MOO formulation, a distinct optimisation problem is formed for each given device, effectively circumventing the challenge posed by device heterogeneity. Additionally, to facilitate real-time adaptation, CARIn leverages the device’s intra-level heterogeneity, specifically the array of available processors, as well as the range of solutions offered by the RASS solver, which allow the adoption of a swift and efficient switching mechanism between execution plans.

3.2 Workflow

Figure 1 depicts CARIn’s operational flow, which is divided into the sequential offline and online phases. The offline component is responsible for constructing and resolving the device-specific MOO problem. Then, at runtime, the online component’s Runtime Manager (RM) constantly monitors the application’s dynamic behaviour, ensuring real-time adaptation to emergent changes. Algorithm 1 presents a comprehensive top-level overview of our framework, delineating its primary components and illustrating their main operations. These operations will be thoroughly elucidated in Section 4. The input parameters of our framework are as follows: (a) the designated DL task(s) associated with the application, (b) the stipulated SLOs, and (c) the target device’s characteristics, while the outputs consist of (a) the set of solutions (designs \(\mathcal {D}\)) and (b) the switching policy (SP).

Fig. 1.

The specified DL task or tasks dictate the set of models to be considered during the optimisation process. A model in CARIn is represented by the following tuple:

\[\begin{gather*} m = (arch, params, s_{\text{in}}, task, ds, pr), \end{gather*}\]

where arch is the model’s architecture (i.e., layers and connections), params are the model’s trained parameters, \(s_{\text{in}}\) is the input size, task is the target DL problem, ds is the name of the corresponding DL testing dataset, and pr is the numerical precision to account for quantised models.

The target device defines the hardware resources at the system’s disposal, which are represented by the tuple:

\[\begin{gather*} hw = (ce, op(ce)), \end{gather*}\]

where \(ce \in\mathcal {CE}\) is the compute engine (i.e., processor) performing the inference computations and \(op(ce)\) is a set of options tied to the given processor, e.g., the number of CPU threads or the GPU’s numerical precision. The tuple of tunable system parameters can be extended to capture a more detailed space, e.g., by including the DVFS governor selection that determines the dynamic voltage and frequency scaling policy of the device [61].

An individual model m running under the selected system parameters hw represents a single execution configuration:

\begin{equation} e = \left\lt m, hw\right\gt \in \mathcal {E}. \end{equation}

(1)

During the MOO Problem Formulation stage (lines 1–7), CARIn considers every generated space of execution configurations, \(\mathcal {E}_i\), to form the problem’s decision space, \(\mathcal {X}\), depending on whether the application requires single- or multi-DNN execution. At the same time, the application’s SLOs delineate the MOO problem’s objective functions and constraints, denoted as \(f_i\) and \(g_j\), respectively. Once the problem is formulated for the target device, the Objective Function Evaluation stage evaluates each function for every \(x \in \mathcal {X}\) (line 8). Following this, CARIn’s MOO Problem Solver is poised to solve the MOO problem (lines 9–12). The functions CalculateOptimality, Sort, and Search shown in Algorithm 1, which constitute the three stages of the solver, are discussed in detail in Section 4.3.

In order for CARIn to accommodate runtime adaptation, it is important to establish a robust system for perpetually monitoring the dynamic aspects of the executing application and the state of the device itself. This ongoing vigilance enables timely recognition of abrupt alterations in operational conditions, thereby facilitating immediate corrective measures. We call this subsystem the RM. The output of CARIn’s solving algorithm consists of a set \(\mathcal {D}\) of highest-performing solutions, called designs, which are passed to RM along with the appropriate SP. Leveraging a collection of periodically captured statistics s from the Application’s runtime, the RM module has the ability to discern dynamic changes in resource allocation (c in Algorithm 1) and rapidly switch to an alternative design \(d_{\text{new}}\) to effectively and robustly meet the application-level SLOs (lines 13–18).

4 Multi-Objective Optimisation Framework

MOO constitutes a mathematical and computational approach employed to find the best solutions or tradeoffs in scenarios that involve multiple interrelated and, at times, antagonistic objectives [15, 44]. The appropriateness of a MOO framework for our problem is underscored by (a) the inherent nature of DL application SLOs, which typically comprise objectives that exhibit conflicts, and (b) the inherent attribute of MOO to yield a solution space rich in diversity, which, in turn, can enable dynamic adaptation.

4.1 MOO Problem Formulation

For CARIn’s DL-based MOO formulation, we adopt the following mathematical description:

\[\begin{gather*} \begin{aligned}& \text{min/max} && f_i(x), && 1 \le i \le N \\ & \text{subject to} && g_j(x) = g_j(h_j(x)) \le 0, && 1 \le j \le P, \end{aligned} \end{gather*}\]

where x denotes the decision variable, N is the number of objective functions, \(f_i(x)\) is the ith objective function, P is the number of inequality constraints, and \(g_j(x)\) is the jth inequality constraint, which is always a composite function of a given inner function \(h_j(x)\). Note that when there is only a single objective function (\(N\!=\!1\)), then the problem is reduced to single-objective optimisation (SOO). The problem’s objective functions and constraints are extracted from the application’s SLOs, which can be split into two categories:

—

Broad SLOs: Such objectives define the problem’s objective functions and come in the form of \(\left\lt min/max, p \right\gt\), where p is a DL-related performance metric. For instance, \(\left\lt max, mIoU\right\gt\) means that the mean Intersection-over-Union (mIoU) accuracy metric should be maximised for an image segmentation task. For CARIn, this objective translates to the maximisation of the objective function \(f(x) = A(x) = mIoU(x)\).

—

Narrow SLOs: These objectives define the problem’s constraints and come in the form of \(\left\lt min/max/avg/std/n{\text{th}}, p, v \right\gt\), which means that the minimum, maximum, average, standard deviation or nth percentile value of p is bounded by a target value v. For instance, \(\left\lt avg, L, 15\right\gt\) means that the average latency needs to be less than 15 ms, which translates to the constraint \(g(x) \le 0\), where \(g(x) = g(h(x)) = g(L(x)) = \overline{L}(x) - 15\).

Given that both types of objectives concern the same set of performance metrics, it follows that both the objective and inner functions, \(f_i(x)\) and \(h_j(x)\), share a common function space, denoted by \(\mathcal {F}\), which encompasses the entirety of available functions associated with various DL performance metrics. For this reason, in cases where the application defines constraints without explicitly specifying objective functions, CARIn can duly regard all specified inner functions \(h_j(x)\) as objective functions as well.

4.1.1 Single-DNN Setting.

When there is only one DL task to optimise, the decision variable x is a single execution configuration e, as defined in Equation (1). Therefore, the execution configuration space \(\mathcal {E}\) effectively transforms into the decision space \(\mathcal {X}\):

\[\begin{gather*} x_{\text{single}} = e = \left\lt m, hw\right\gt \in \mathcal {X}_{\text{single}} = \mathcal {E}. \end{gather*}\]

For the objective functions, CARIn leverages the following DNN-specific performance metrics:

—

Size (S): Size is conventionally represented by either the total count of parameters within the neural network or the physical file size of the model stored in memory.

—

Workload (W): This metric is typically measured in terms of numerical operations, such as floating-point operations (FLOPs) or multiply-accumulate operations.

—

Accuracy (A): Accuracy is contingent upon the specific DL task in question, e.g., top-1 accuracy for classification tasks or exact match for question answering tasks.

—

Latency (L): Latency delineates the temporal lag between the transmission of input data to the DNN model and the reception of the corresponding output. It is quantified in units of milliseconds or seconds.

—

Throughput (TP): Throughput provides an indication of the model’s real-time processing capabilities and is computed as the total number of input samples (batch size), divided by the total inference latency. This metric is denominated in samples per second (e.g., images per second when images constitute the inputs).

—

Energy Consumption (E): This metric is of paramount importance for the evaluation of the energy efficiency of DNN applications in resource-constrained environments and is measured in energy units, such as watt-hours or joules.

—

Memory Footprint (MF): Memory footprint encapsulates the extent of random-access memory (RAM) required for the loading and execution of a DNN. It is traditionally assessed in terms of memory size units, such as megabytes (MB) or gigabytes (GB).

Overall, the set of potential objective functions in single-DNN cases is denoted as follows:

\[\begin{gather*} \mathcal {F}_{\text{single}} = \lbrace S, W, A, L, TP, E, MF \rbrace , \end{gather*}\]

collectively empowering a multifaceted assessment of DNN models and providing a holistic understanding of their performance across diverse dimensions.

It is important to recognise that the latency and energy consumption metrics are subject to inherent fluctuations when executing DNNs on mobile devices. These fluctuations can arise due to various factors, including device load, temperature, input values, and other environmental variables (see Section 4.3.2). As a result, relying on a single, instantaneous value may not provide a robust and representative assessment of system performance. To account for these fluctuations, CARIn considers statistical measures, such as the average or maximum energy consumption or the variance of the latency, as objective functions.

4.1.2 Multi-DNN Setting.

When there are M independent DNNs to optimise jointly, the decision variable x comprises of M distinct execution configurations \(e_i\), \(1 \le i \le M\). Hence, the decision space \(\mathcal {X}\) is an M-dimensional space, where each component of the decision variable can separately take values in the corresponding execution configuration space \(\mathcal {E}_i\):

\[\begin{gather*} x_{\text{multi}} = \lbrace e_1, \ldots , e_M \rbrace = \lbrace \left\lt m, hw\right\gt _1, \ldots , \left\lt m, hw\right\gt _M \rbrace \in \mathcal {X}_{\text{multi}} = \mathcal {E}_1 \times \cdots \times \mathcal {E}_M. \end{gather*}\]

The array of potential objective functions is expansively broadened to encompass an additional triad of performance metrics pertaining to parallel execution [12, 60]:

—

Normalised Turnaround Time (NTT): NTT serves as a quantifier of the perceived execution slowdown during multi-DNN execution. The NTT for the ith DNN is computed as follows:

\[\begin{gather*} NTT_i = \frac{L_i^{\text{M}}}{L_i^{\text{S}}}, \end{gather*}\]

where \(L_i^{\text{S}}\) and \(L_i^{\text{M}}\) are the average latencies of the ith DNN under the single- and multi-DNN modes. \(NTT_i\) is a value greater than or equal to 1, with lower values indicating superior performance. For the sake of standardisation across models, it is common practice to calculate the average or maximum NTT.

—

System Throughput (STP): STP quantifies the accumulated single-DNN progress under multi-DNN execution and is computed as follows:

\[\begin{gather*} STP = \sum _{i=1}^{M} NP_i = \sum _{i=1}^{M} \frac{1}{NTT_i} = \sum _{i=1}^{M} \frac{L_i^{\text{S}}}{L_i^{\text{M}}}, \end{gather*}\]

where \(NP_i\) is the normalised progress of the ith DNN. Its maximum magnitude is M, with higher values signifying enhanced performance.

—

Fairness (F): The concept of fairness in a multi-DNN execution environment is contingent upon the equitable relative progress experienced by co-executing DNNs, in comparison to their single-DNN execution counterparts. Fairness, as denoted herein, is quantified as the minimum ratio of relative normalised progress rates observed among any two DNNs concurrently operating within the following system:

\[\begin{gather*} F = \min _{i, j} \frac{NP_i}{NP_j}. \end{gather*}\]

This metric adheres to a higher-is-better paradigm with values within the range \([0, 1]\), where 0 signifies an absence of fairness and 1 perfect fairness.

As a consequence, we augment CARIn’s objective function set to encompass both single-DNN metrics, which pertain to individual tasks or DNNs, and multi-DNN metrics, which characterise the collective performance of the entire system during concurrent execution:

\[\begin{gather*} \mathcal {F}_{\text{multi}} = \lbrace S_i, W_i, A_i, L_i, TP_i, E_i, MF_i \rbrace \cup \lbrace STP, NTT, F \rbrace , 1 \le i \le M. \end{gather*}\]

4.2 Objective Function Evaluation

Upon the formulation of the device-specific MOO problem, it becomes necessary to assess each objective function across the entire set of decision variables \(x \in \mathcal {X}\). Assessing these functions is straightforward for certain objectives; however, it presents challenges for device-dependent functions like E and MF, and those relying on latency, including L, TP, STP, NTT, and F. The approach adopted by CARIn for this evaluation involves the profiling of functions on individual target devices. In practical terms, this entails the deployment of all candidate models on each target device, followed by the measurement of each device-reliant objective function for all feasible model-processor combinations. We acknowledge that this procedure, albeit comprehensive, is inherently time-consuming and, in many instances, such as in multi-DNN cases, infeasible for seamless integration into real-world scenarios and practical applications. However, the optimisation of the evaluation process itself does not constitute a primary objective of this work. We extensively discuss potential enhancements of this aspect in Section 8.

4.3 MOO Problem Solver

Following the formulation of the problem and the evaluation of objective functions, the conclusive stage involves resolving the optimisation problem. The initial step of the optimisation process is to apply the problem’s constraints. Consequently, the decision variables are bound to the constrained decision space \(\mathcal {X}^{\prime }\), defined as:

\[\begin{gather*} \mathcal {X}^{\prime } = \lbrace x\,|\, g_j(x) \le 0, \forall j \rbrace . \end{gather*}\]

MOO problems are frequently addressed using evolutionary algorithms, such as NSGA-II, SMS-EMOA, and MOEAD, or swarm-based algorithms, such as Ant Colony Optimisation and Particle Swarm Optimisation [52]. These algorithms systematically explore the decision variable space to discover the Pareto frontier, which represents the optimal tradeoffs among conflicting objectives. While these algorithms excel in identifying the optimal execution configuration, we acknowledge that potential runtime issues may either alter the solution space, consequently affecting the Pareto frontier of a MOO problem, or introduce new constraints that were not considered during the problem’s formulation. Consequently, to address the potential decline in performance, it becomes imperative to rerun these algorithms whenever a runtime issue arises, however, such repetitive executions are impractical for real-life applications and systems.

To address this challenge, we introduce a runtime-aware sorting and search algorithm, denoted as RASS, whose primary goal is to solve a device-specific MOO problem once, while concurrently addressing potential future runtime challenges. To achieve this, RASS considers both non-dominated and dominated solutions in a predictive manner, estimating the impact of possible runtime issues. In addition to providing the initial solution \(d_0\), RASS also yields a set of supplementary runtime designs \(d_i\), which serve as a proactive measure for runtime adaptation, i.e., in instances where the currently employed design encounters performance issues. This approach alleviates the need for repetitive executions of optimisation algorithms.

The operation of RASS involves a sorting stage followed by a search stage. To accommodate both non-dominated and dominated solutions, our solving algorithm initially sorts candidate solutions according to their optimality (Section 4.3.1), a metric quantifying the distance from the problem’s utopia point. Subsequently, based on this sorting, RASS identifies a set of solutions (Section 4.3.4) representing the various execution plans of the application that correspond to possible runtime issues (Section 4.3.2), along with a switching policy facilitating prompt transitioning between them (Section 4.3.3) for the RM module.

4.3.1 Optimality.

To quantify optimality for a given candidate solution \(x \in \mathcal {X}^{\prime }\), we first calculate the weighted Mahalanobis distance between the solution’s objective vector, which is defined as \(f(x) = [ f_{1}(x), f_{2}(x), \ldots, f_{n}(x) ]\), and the utopia point, represented as \(up = [ up_{1}, up_{2}, \ldots, up_{n} ]\):

\[\begin{gather*} d(x) = \sqrt {\sum _{i=1}^{n} w^2_i \frac{\left[ f_{i}(x) - up_i \right]^2}{s^2_i} }, \end{gather*}\]

where \(w_i\) is the user-supplied weight for the ith objective, \(s^2_i\) is the calculated variance of the ith objective, and each component of the utopia point depends on the corresponding objective function:

\[\begin{gather*} up_{i} = {\left\lbrace \begin{array}{l@{\quad}l} \max f_{i}, & \text{if } f_i \in \lbrace A, TP, STP, F\rbrace \\ \min f_{i}, & \text{if } f_i \in \lbrace S, W, L, E, MF, NTT\rbrace . \end{array}\right.} \end{gather*}\]

By utilising the Mahalanobis distance, we effectively accommodate the disparate scales of the diverse objectives. Consequently, optimality could also be regarded as a metric of fairness for the problem’s objective functions. However, it is important to acknowledge that these functions may carry distinct significance for the problem, and, hence, we afford users the opportunity to define weights, thereby introducing a formal mechanism for enabling tailored optimisation strategies. Notably, the calculated distances range within the interval \([0, d_{\text{max}}]\), where the maximum distance is as follows:

\[\begin{gather*} d_{\text{max}} = \sqrt {\sum _{i=1}^{n} w^2_i \frac{\left(\max f_{i} - \min f_i \right)^2}{s^2_i} }. \end{gather*}\]

This factor necessitates the use of normalisation, which results in the distance being confined to the \([0, 1]\) range:

\[\begin{gather*} d_{\text{s}}(x) = \frac{d(x)}{d_{\text{max}}}. \end{gather*}\]

The optimality metric for each \(x \in \mathcal {X}^{\prime }\) can then formally be defined as the reciprocal of the scaled weighted Mahalanobis distance, thus, its range extends from \([1, +\infty)\):

\[\begin{gather*} opt(x) = \frac{1}{d_{\text{s}}(x)}. \end{gather*}\]

Utilising these values, the candidate solutions are sorted in descending order, resulting in the creation of the sorted decision space \(\mathcal {X}_{\text{s}}\).

4.3.2 Runtime Challenges.

During the runtime of the application, a multitude of dynamic alterations in the device’s resource availability may occur. These fluctuations impact our problem formulation in different ways, thus necessitating targeted approaches for their management. CARIn focuses on addressing two main challenges, regarding the processors and memory of the target device:

—

Processor Overload or Overheating: Processor-related concerns manifest when the processor at use is continuously subjected to sustained processing demands exceeding its peak processing capacity, primarily due to resource-intensive computational tasks. The protracted imposition of such an overload condition may subsequently lead to overheating, which means the escalation of the System-on-Chip (SoC) temperature to a critical and potentially harmful level. Overheating may also result from insufficient cooling mechanisms or other impediments hindering the effective dissipation of heat by the SoC. As a protective measure against potential harm, mobile SoCs are equipped with thermal throttling capabilities, which are activated when temperatures exceed predefined thresholds. Thermal throttling encompasses the deliberate reduction of the processor’s clock speed and performance to mitigate heat generation and maintain a safe temperature range. The intricate interplay between processor overload and overheating significantly impacts performance and power consumption, underscoring the significance of diligent management and effective mitigation strategies.

—

Variability in RAM Utilisation: Owing to the multifaceted nature of mobile devices, the utilisation of RAM is also characterised by dynamic fluctuations. Within the execution scope of an application, numerous ancillary applications, processes, or services continually initiate and terminate in the background, potentially culminating in an unforeseen saturation of RAM capacity. Consequently, this phenomenon may precipitate performance-related challenges, encompassing lag, application crashes, and an overall deceleration of device functionality. Furthermore, the perpetual management of excessive RAM consumption may also entail elevated power consumption, thereby engendering consequential ramifications.

4.3.3 Model/Processor Switching.

In response to runtime fluctuations, CARIn’s RM adopts a strategic approach that involves alterations to either the model (change model (CM), processor (CP), or both (CB)) within the current execution plan. These three fundamental adjustments serve as effective measures for mitigating the challenges encountered during runtime. To this end, we introduce a prioritisation scheme. In the case of processor-related phenomena, CARIn prioritises transferring DNN execution from the currently used processors to inactive ones (CP or CB). This transition allows the overloaded or overheated processor to dissipate excess heat and gradually restore its performance. In cases where migration is not a viable option, such as in devices limited solely to CPU usage or multi-DNN scenarios where all processors are occupied, CARIn employs an alternative approach that involves replacing the current model with one of reduced computational workload (CM). Conversely, addressing the memory-related issue involves transitioning to a more compact model either on the same (CM) or a different processor (CB).

4.3.4 Design Selection and Switching Policy.

A primary principle guiding the design of RASS is to ensure low complexity to facilitate rapid switching. This objective manifests in the generation of a relatively small number of designs, which in turn offers two additional distinct benefits: First, minimise storage requirements for the models and, second, maintain a concise switching policy comprising only a limited number of transition rules. For RM to determine the appropriate timing to transition to a new execution plan, several system parameters, related to (a) the workload distribution across processors and (b) the aggregate memory utilisation, need to be continuously monitored. These parameters are represented by the Boolean variables \(c_{ce}\) and \(c_{\text{m}}\), indicating the presence of issues pertaining to a processor ce and the memory, respectively.

The first step in identifying the solutions to the problem involves identifying the sets of different model-to-processor mappings viable for processor switching, i.e., for reallocating DL execution to idle processors. Symbolising the number of these sets as T, in consideration of RASS’s need for simplicity, if \(T\gt 3\), then we retain only the top three sets, corresponding to the highest attained optimality scores. Next, we partition our sorted decision space \(\mathcal {X}_s\) into T distinct subspaces \(\mathcal {X}_i\), each corresponding to specific model-to-processor mappings and arranged in descending order of observed optimality. Regarding processor-related phenomena, we select designs associated with the highest optimality score within each set:

\[\begin{gather*} d_i = \mathcal {X}_i[0], i=0,\ldots ,T-1. \end{gather*}\]

For the memory-related issue, we extract the solution with the smallest memory footprint:

\[\begin{gather*} d_{\text{m}} = \operatorname*{argmin}_{\substack{x}} MF(x), \, \, x \in \mathcal {X}_i, \, \, i=0,\ldots ,T-1. \end{gather*}\]

Last, we extract complementary designs for two extreme (highly improbable) scenarios. The first one arises when all related processors present an issue, while the memory does not, prompting the extraction of the solution with the lightest workload:

\[\begin{gather*} d_{\text{w}} = \operatorname*{argmin}_{\substack{x}} W(x), \, \, x \in \mathcal {X}_i, \, \, i=0,\ldots ,T-1, \end{gather*}\]

and the second scenario surfaces when both the processors and memory encounter issues simultaneously, necessitating the identification of the solution that strikes the optimal balance between memory usage and workload among \(d_m\) and \(d_w\):

\[\begin{gather*} d_{\text{wm}} = {\left\lbrace \begin{array}{l@{\quad}l} d_{\text{w}}, & \text{if } C\left(MF(d_{\text{w}}), W(d_{\text{w}})\right) \lt C\left(MF(d_{\text{m}}), W(d_{\text{m}})\right) \\ d_{\text{m}}, & \text{else}, \end{array}\right.} \end{gather*}\]

where we use the normalised sum to compute the cost function C. Collectively, the set of designs is denoted as:

\[\begin{gather*} \mathcal {D} = \lbrace d_i, d_{\text{m}}, d_{\text{w}} \rbrace , i=0, \ldots , T-1, \end{gather*}\]

and therefore RASS can generate a maximum of five designs for a MOO problem, since \(T\le 3\).

After establishing the set of designs, RASS’s final step involves crafting the rule-based switching policy, which serves as a reference for the RM module, guiding its decision-making process each time the Boolean variables \(c_{ce}\) or \(c_{\text{m}}\) undergo a change in value. With the aim of ensuring simplicity and conciseness in the rule set, we ensure that the selection of a new design is contingent solely upon the state of the environmental variables and independent of the presently employed design. The rationale behind the construction of the rules is deliberately straightforward, as demonstrated in Section 7.2, where two representative use cases are presented and analysed.

5 Implementation

CARIn is implemented in Java for the Android operating system. Its primary integration leverages the TensorFlow Lite (TFLite) package in its nightly build to facilitate on-device DNN execution, as well as its delegates to access mobile accelerators. The concurrent execution of multiple DNNs is achieved through the utilisation of the java.util.concurrent Java package.

To create the model suite used for our framework’s evaluation, i.e., for model retrieval, training, and preparation, TensorFlow (v2.12.0) was employed in Python. Our framework seamlessly interfaces with TensorFlow Hub and Hugging Face model repositories, which allows researchers to easily access and experiment with a wide range of pre-trained models on publicly available datasets for their specific use cases. Additionally, the TFLite Converter’s optimisation module was utilised to apply post-training quantisation to the models to enhance inference speed, efficiency and accelerator compatibility.

Regarding the objective function evaluation process, a diverse set of tools and libraries is employed. For accuracy assessments, we use custom evaluation scripts, as well as TFLite’s image classification evaluation tool for the ImageNet ILSVRC 2012 task. Furthermore, to comprehensively capture the model’s computational complexity and resource requirements, we use the tflite Python package to count model parameters and FLOPs. Last, to assess the on-device performance of the models, we employ the C++-based TFLite benchmark tool.² This tool offers a comprehensive suite of measurements, encompassing execution time, memory utilisation, and other pertinent metrics, thereby providing a robust evaluation of the models’ real-world performance characteristics.

6 Experimental Methodology

In this section, we present the experimental methodology that underpins our research, offering insight into the comprehensive approach we have undertaken to investigate our study’s core objectives. We have structured this methodology into several key subsections, each addressing a crucial aspect of our experimental design. Figure 2 depicts the toolflow used to conduct our experiments.

Fig. 2.

6.1 Quantisation

CARIn embraces post-training quantisation as one of the most simple and mobile-friendly compression methods presently available, with benefits not only in model size but also in latency and memory requirements. Additionally, quantisation becomes indispensable for the execution of DNNs within Digital Signal Processors (DSPs) or NPUs designed to primarily support integer models [22], thus unlocking complete compatibility with mobile accelerators. Notably, additional methods that also introduce tradeoffs between accuracy and complexity, such as weight pruning or clustering, are orthogonal to our framework and amenable to integration. The potential synergy resulting from the combined application of various compression techniques merits further investigation.

Driven by the capabilities of the TFLite Converter, CARIn currently incorporates four distinct quantisation techniques, namely half-precision floating-point (FP16), 8-bit dynamic range (DR8), 8-bit fixed-point with float fallback (FX8) and full 8-bit fixed-point (FFX8). Table 1 enumerates the numerical types associated with inputs, outputs, weights, and activations for both the 32-bit floating-point (FP32) and the quantised models. It is important to note that the data type of the weights defines the storage requirements of the model. Specifically, FP16 quantisation leads to a 2× reduction in model size, while the remaining schemes (DR8, FX8, and FFX8) yield a 4× reduction in size. The operational procedures of these quantisation schemes are elucidated as follows:

Table 1.

Scheme	Inputs & Outputs	Weights	Activations
FP32	fp32/int32/int64	fp32	fp32
FP16	fp32/int32/int64	fp16	fp16/fp32
DR8	fp32/int32/int64	int8	fp32
FX8	fp32/int32/int64	int8	int8/fp32
FFX8	int8/int32	int8	int8

Table 1. Quantisation Schemes

—

FP16, by default, employs 16-bit floating-point computations, yet it possesses the flexibility to revert (fall back) to FP32 calculations when the hardware lacks support for 16-bit arithmetic. In such instances, the weights undergo a dequantisation process to 32-bit before the first inference. Concurrently, the activations are stored in 32-bit format. The most common processors with native support for FP16 operations are mobile GPUs.

—

In the case of DR8, weights are represented with 8 bits, while activations persistently remain in FP32. Nevertheless, certain activations may undergo dynamic quantisation during inference, utilising quantised kernels for faster execution. The utilisation of fixed-point arithmetic, whenever feasible, may result in reduced computation times compared to relying solely on floating-point arithmetic, contingent on the specific model’s characteristics.

—

FX8, analogous to FP16, represents an 8-bit equivalent and operates with integer kernels as the default mode of execution. However, it retains the ability to utilise 32-bit operators when integer implementations are unavailable on the given hardware (floating-point fallback). Importantly, in this scheme, the converted model maintains inputs and outputs in floating-point format, allowing the model itself to determine the quantisation parameters to minimise accuracy loss.

—

FFX8 enforces full integer quantization for all components of the model, encompassing weights, activations, operations, inputs, and outputs. This stringent quantisation scheme guarantees compatibility with integer-only devices and accelerators, such as microcontrollers, DSPs, and NPUs.

6.2 Application Scenarios, Models, and Tasks

In the following part, we outline four discrete application use cases, which form the bedrock of our experimental evaluation. These application scenarios represent diverse real-world settings in which our research findings will be tested and validated, providing valuable insights into the effectiveness of our proposed formulation and methodology.

Notably, the first two scenarios pertain to the execution of a single DNN, while the later two involve the execution of multiple DNNs in parallel, affording us the opportunity to assess performance outcomes in instances where dependencies among multiple DNNs exist. This dichotomy is instrumental in affording a comprehensive evaluation of our methodology’s versatility, applicability, and scalability. We define specific SLOs for each use case and showcase the list of models to be considered during evaluation, along with their device-independent evaluation, which includes (a) the accuracy of both the original and quantised variants, (b) the computational workload in FLOPs, and (c) the model size in terms of parameters.

6.2.1 Use Case #1 (UC1).

In our first single-DNN scenario, we examine the practical application of real-time image classification. In this setting, the camera of a mobile device continuously captures frames that require prompt and accurate recognition. The term “real time” is qualified by a temporal restriction mandating that the maximum permissible latency is 41.67 ms, underscoring the necessity to uphold a recognition rate of no less than 24 frames per second. The principal objectives of this use case encompass the joint maximisation of accuracy and throughput. Mathematically, this MOO problem comprises two objective functions and a single constraint as follows:

\[\begin{gather*} \begin{aligned}&& \max &&& A(x), TP(x) \\ && \text{subject to} &&& \text{max} \, L(x) \le 41.67\ \text{ms.} \end{aligned} \end{gather*}\]

For UC1 we used the ImageNet-1k dataset [47]. Table 2 lists the eight models under consideration, which are drawn from four distinct families: MobileNets [48], EfficientNets [57], RegNets [46], and MobileViTs [39]. The rationale behind this extensive model selection is to ensure a well-rounded exploration of compact and mobile-friendly architectures that span a broad spectrum, encompassing both conventional CNNs and emerging Transformer-based models. Each of these architectural paradigms exhibits unique characteristics and design principles.

Table 2.

DL Task	Architecture	Image Size	FLOPs	#Params	Top-1 Accuracy (%)
DL Task	Architecture	Image Size	FLOPs	#Params	FP32	FP16	DR8	FX8	FFX8
Image Classification on ImageNet-1k	MobileNet V2 1.0	224 × 224	0.60 G	3.49 M	71.92	71.96	71.65	71.28	71.26
	RegNetY 008	224 × 224	1.60 G	6.25 M	74.28	74.28	74.18	74.45	74.47
	MobileViT XS	256 × 256	2.10 G	2.31 M	74.61	74.61	—	—	—
	EfficientNet Lite0	224 × 224	0.77 G	4.63 M	75.19	75.23	75.14	75.09	75.11
	MobileNet V2 1.4	224 × 224	1.16 G	6.09 M	75.66	75.68	75.47	75.41	75.45
		RegNetY 016	224 × 224	3.23 G	11.18 M	76.76	76.76	76.62	76.92	76.84
		MobileViT S	256 × 256	4.06 G	5.57 M	78.31	78.30	—	—	—
		EfficientNet Lite4	300 × 300	5.11 G	12.95 M	80.81	80.80	80.78	80.69	80.71

Table 2. UC1 Models

It is worth noting that we also contemplated the inclusion of higher-accuracy models, such as NASNet and ConvNeXt, in our analysis. However, our assessment revealed that these models failed to meet the stipulated latency constraint. Consequently, they were excluded from our study to maintain adherence to the predefined performance criteria.

6.2.2 Use Case #2 (UC2).

In our second single-DNN scenario, we study the task of text classification, with a particular emphasis on the memory requirements of the models. To this end, we impose a memory constraint, stipulating that the executing DNN’s maximum memory footprint must not exceed 90 MB. The objectives of this use case revolve around three critical factors: minimising the average latency, reducing the model size, and maximising accuracy. Mathematically, this MOO problem encompasses three objective functions and a singular constraint:

\[\begin{gather*} \begin{aligned}&& \min &&& \overline{L}(x), S(x) \\ && \max &&& A(x) \\ && \text{subject to} &&& MF(x) \le 90\ \text{MB}. \end{aligned} \end{gather*}\]

For UC2, we obtained three pre-trained Transformer models on various large datasets, including Reddit comments and 2ORC citation pairs, and subsequently fine-tuned them on Emotions [50], a dataset comprising of English Twitter messages that is employed for the task of classifying input sequences into six distinct emotions. We adopted the dataset’s split configuration, which allocated 16k samples for training, 2k for validation, and 2k for testing. The reported top-1 accuracy corresponds to the dataset’s test set. The selected models, detailed in Table 3, encompass the traditional BERT architecture in a lightweight version, alongside two mobile-grade models: XtremeDistil [41] and MobileBERT [55]. The letter “L” in each model’s name stands for the number of Transformer layers and “H” stands for the hidden dimension. In preparation for training, we further optimised BERT and XtremeDistil to enhance mobile-friendliness by replacing the GELU activation function with ReLU and substituting Layer Normalisation with Batch Normalisation [42].

6.2.3 Use Case #3 (UC3).

In our first multi-DNN scenario, we employ two DNNs for the purpose of scene recognition. One DNN is dedicated to processing and classifying images, while the other can process audio data to identify sounds from the device’s surroundings. These models operate concurrently, running in parallel, and their outputs are collectively utilised to determine the specific scene within which the mobile device is situated.

In this scenario, we seek to minimise both the average latency and its standard deviation, while simultaneously maximising the attained accuracy. We impose two latency constraints for both tasks, mandating that (a) the average latency remains consistently below 100 ms to ensure near-real-time responsiveness and (b) the standard deviation of latency stays below 10 ms for minimal fluctuations. The inclusion of the latency’s standard deviation aims to minimise performance variability, which still constitutes a withstanding challenge for on-device inference [66]. Mathematically, this MOO problem is formulated as follows:

\[\begin{gather*} \begin{aligned}&& \min &&& \overline{L}_i(x), \sigma _{L_{i}}(x) \\ && \max &&& A_i(x), & i=1,2 \\ && \text{subject to} &&& \overline{L}_i(x) \le 100\ \text{ms}, \sigma _{L_{i}}(x) \le 10\ \text{ms.} \end{aligned} \end{gather*}\]

Table 4 presents the models for each task. For the vision task, we fine-tuned three EfficientNet Lite models on the MIT Indoor Scenes dataset [45], which includes 67 classes and 100 images per class (80 for training and 20 for testing). We report the top-1 accuracy on the test set. For the audio task, we use YAMNet, which is trained on the AudioSet dataset [14] for multi-label classification. The dataset consists of 521 sound events (classes) and 18k samples. We report the mean average precision on the validation set. YAMNet’s input waveform can vary in length. In our experiments, we use the model’s minimum possible length of 975 ms, which corresponds to 15,600 input samples and a total workload of 0.14 GFLOPs.

Table 3.

DL Task	Architecture	Sequence Lenght	FLOPs	#Params	Top-1 Accuracy (%)
DL Task	Architecture	Sequence Lenght	FLOPs	#Params	FP32	FP16	DR8	FX8	FFX8
Text Classificationon Emotions	BERT-L2-H128	64	0.05 G	4.31 M	92.10	92.10	91.90	91.75	91.75
	XtremeDistil-L6-H256	64	0.63 G	12.57 M	93.30	93.30	93.20	93.15	93.20
	MobileBERT-L24-H512	64	2.66 G	24.33 M	93.80	93.80	93.80	93.65	94.10

Table 3. UC2 Models

Table 4.

DL Task	Architecture	Input Size	FLOPs	#Params	Accuracy
DL Task	Architecture	Input Size	FLOPs	#Params	FP32	FP16	DR8	FX8	FFX8
Scene Classification on MIT Indoor Scenes	EfficientNet Lite0	224 × 224	0.59 G	3.44 M	69.78	69.70	68.96	69.18	69.18
	EfficientNet Lite2	260 × 260	1.51 G	4.87 M	76.72	76.72	77.16	77.69	77.54
	EfficientNet Lite4	300 × 300	4.57 G	11.76 M	79.33	79.33	79.18	79.78	79.48
Audio Classification on AudioSet	YAMNet	15,600	0.14 G	3.75 M	0.3756	0.3757	0.3620	—	—

Table 4. UC3 Models

6.2.4 Use Case #4 (UC4).

In our second multi-DNN scenario, we deploy three distinct models designed for facial attribute prediction tasks, namely gender, age, and ethnicity estimation. These models are conceptualised as the second stage of a face detection and attribute prediction pipeline, wherein they operate concurrently on the same set of input images. As such, it is imperative for these models to adhere to stringent latency constraints to ensure minimal impact on the overall pipeline. UC4’s objectives revolve around the collective optimisation of five key metrics for each model, specifically average latency, standard deviation of latency, size, memory footprint, and accuracy, all while adhering to a maximum latency threshold of 10 ms. Formally:

\[\begin{gather*} \begin{aligned}&& \min &&& \overline{L}_i(x), \sigma _{L_{i}}(x), S_i(x), MF_i(x) \\ && \max &&& A_i(x), & i=1,2,3 \\ && \text{subject to} &&& \text{max} \, {L}_i(x) \le 10\ \text{ms.} \end{aligned} \end{gather*}\]

In UC4, the training data are sourced from the UTKFace dataset [81]. To ensure relevance to real-time applications, the dataset is filtered to retain samples corresponding to the age range of 18–75. Consequently, the utilised dataset comprises 18.6k facial images, partitioned into training, validation, and testing sets with a ratio of 72/8/20, respectively. The employed models leverage MobileNetV2 as the backbone architecture, extracting 576 features of size 4×4, which are used for predicting the outcomes across the three distinct facial attribute prediction tasks. Notably, UC4 stands as the sole task within our study that incorporates batching during inference. Specifically, the models are configured with a batch size of 4, a choice motivated by the common case where the preceding face detection component identifies multiple faces within a single image. Table 5 details the attained accuracy metrics for each task on the filtered dataset’s test set: binary accuracy for gender recognition, mean absolute error for age recognition, and top-1 accuracy for ethnicity recognition across 5 output classes.

Table 5.

DL Task	Architecture	Image Size	FLOPs	#Params	Accuracy
DL Task	Architecture	Image Size	FLOPs	#Params	FP32	FP16	DR8	FX8	FFX8
Facial Attribute	GenderNet-MNV2	62 × 62	0.04 G	0.66 M	95.12	94.95	94.90	94.79	94.90
Prediction	AgeNet-MNV2	62 × 62	0.04 G	0.66 M	5.976	5.974	5.964	5.947	5.923
on UTKFace	EthniNet-MNV2	62 × 62	0.04 G	0.66 M	78.17	78.04	78.55	79.30	79.14

Table 5. UC4 Models

6.3 Mobile Devices

In our study, we have selected three smartphones for our evaluation: Google Pixel 7, Samsung Galaxy S20 FE, and Samsung Galaxy A71. These devices have been deliberately chosen to represent distinct categories within the modern mobile phone landscape. A71 serves as an archetype of a mid-tier device, while S20 and P7 exemplify the high-end category, showcasing state-of-the-art features and cutting-edge technology. A detailed overview of the specifications and processing capabilities of these smartphones is shown in Table 6.

Table 6.

Device	Google Pixel 7	Samsung Galaxy S20 FE	Samsung Galaxy A71
Launch	2022, October	2020, October	2020, January
SoC	Tensor G2	Exynos 990	Snapdragon 730
CPU	2 × 2.85 GHz Cortex-X1	2 × 2.73 GHz Exynos M5	2 × 2.20 GHz Kryo 470 Gold
	2 × 2.35 GHz Cortex-A76	2 × 2.50 GHz Cortex-A76	6 × 1.80 GHz Kryo 470 Silver
	4 × 1.80 GHz Cortex-A55	4 × 2.00 GHz Cortex-A55	6 × 1.80 GHz Kryo 470 Silver
GPU	Mali-G710 MP7 @850 MHz	Mali-G77 MP11 @800 MHz	Adreno 618 @700 MHz
NPU	Tensor Processing Unit	\(\checkmark\)	Hexagon Tensor Accelerator
RAM	8 GB @3200 MHz	6 GB @2750 MHz	6 GB @1866 MHz
TDP	7 W	9 W	5 W

Table 6. Target Devices

Each of the three devices is equipped with its own NPU. Concretely, P7 incorporates a custom mobile-oriented TPU; S20 features the EDEN API, which grants access to the Exynos NPU for fixed-point models and specialised GPU kernels for floating-point models; and, last, A71 hosts the Hexagon Tensor Accelerator, a dedicated compute engine for fixed-point CNNs. Additionally, it is noteworthy that among these three devices, only A71 offers access to the device’s DSP for DNN inference. Therefore, we result in the following compute engine sets for each device:

\begin{equation*} \mathcal {CE}_{\text{P7}} = \mathcal {CE}_{\text{S20}} = \lbrace \text{CPU}, \text{GPU}, \text{NPU}\rbrace \end{equation*}

\begin{equation*} \mathcal {CE}_{\text{A71}} = \lbrace \text{CPU}, \text{GPU}, \text{NPU}, \text{DSP} \rbrace . \end{equation*}

6.4 Profiling Details

In this section, we present the available configuration options for each compute engine, \(op(ce)\), within the context of an execution plan’s tunable hardware parameters, which are employed by CARIn during the profiling phase of the device-specific objective functions. In the case of CPUs, we have the capability to tune the number of threads employed for multithreading and utilise the XNNPACK delegate, which serves as a back-end for the CPU, leveraging the XNNPACK library to provide highly optimised implementations for 32- and 16-bit floating-point computations, as well as symmetrically quantised DNN operations. Since all the devices under consideration are equipped with eight CPU cores, the set of tunable options can be defined as follows:

\begin{equation*} op(\text{CPU}) = \lbrace N_{\text{threads}}, \text{XNNPACK} \rbrace , \end{equation*}

where \(N_{\text{threads}} = \lbrace 1, 2, 4, 8\rbrace\) and \(\text{XNNPACK} = \lbrace \text{TRUE}, \text{FALSE}\rbrace\), resulting in eight distinct CPU execution combinations. However, for GPUs and NPUs, CARIn exclusively employs fp16 arithmetic when feasible, as it offers reduced latency without compromising accuracy:

\begin{equation*} op(\text{GPU}) = op(\text{NPU}) = \lbrace \text{precision = fp16} \rbrace . \end{equation*}

Last, it should be noted that the DSP does not expose any configurable parameters, and thus its set of options can be defined as an empty set:

\begin{equation*} op(\text{DSP}) = \lbrace \rbrace . \end{equation*}

In terms of the profiling process, we initiate each execution configuration with five warm-up runs to stabilise the target processor’s performance and reduce variability. Subsequently, to gather statistically significant latency and energy consumption values, we execute each experiment 100 times. Last, to maintain consistent device temperatures and mitigate the risk of overheating, we incorporate a device idle period of 2 minutes prior to commencing the next set of runs.

7 Results

This section presents the outcomes of our comprehensive evaluation of CARIn. Our findings provide valuable insights into the effectiveness of our framework in mitigating the challenges stemming from device heterogeneity and runtime fluctuations across both single- and multi-DNN scenarios, while concurrently meeting predetermined SLOs.

7.1 Designs

Our initial assessment focuses on evaluating the performance of CARIn’s designs within each available state, i.e., single processor or combinations of processors for single- and multi-DNN applications respectively.

7.1.1 Comparison Methods.

To comprehensively evaluate CARIn’s performance against existing methodologies, we employ three simple and empirical baselines and additionally compare against our earlier work, OODIn. Our baseline methods are formulated upon empirical observations to offer a fundamental level of performance and aid in setting a minimum performance expectation in real-world applications.

—

Single-architecture baseline: The effectiveness of CARIn is contrasted with the traditional approach of considering a single model architecture, even if it is also accompanied by its quantised versions. This paradigm typically revolves around the selection of the model with the highest accuracy, optimal memory efficiency, compact size, and other relevant criteria.

—

Transferred baseline: To assess the extent to which CARIn addresses device heterogeneity, we utilise the transferred baseline, where the MOO problem is solved on a specific device, and the resultant designs are then applied to different devices. This baseline, being device-agnostic, overlooks the inherent characteristics and limitations of individual devices.

—

Multi-DNN-unaware baseline: The third baseline assesses the efficacy of our framework in handling concurrent model executions, particularly its capability to generate optimal model-to-processor mappings for multi-DNN workloads. The multi-DNN-unaware baseline dissects a multi-DNN MOO problem into M single-DNN uncorrelated problems, solves each one independently and then combines the solutions.

—

OODIn [61]: In our prior research, we utilised the weighted sum method as a means to address MOO problems. More precisely, OODIn aims to maximise the weighted sum obtained from the normalised objective functions. This approach fails to account for the inherent scale discrepancies among the diverse objective functions, particularly evident in DL metrics. While the utilisation of assigned weights may potentially mitigate this limitation, it necessitates prior knowledge of the statistical characteristics of the functions involved. When dealing with multi-DNN configurations, OODIn would operate as the multi-DNN-unaware baseline presented above, differing only in its utilisation of the weighted sum method instead of computing optimalities.

7.1.2 Single-DNN Execution.

Figures 3 and 4 delineate the benefits of CARIn in relation to the optimality metric for the two single-DNN use cases. We compare against two single-architecture baselines, specifically using the model with the highest accuracy (best accuracy, B-A) and the model with the smallest size (best size, B-S), the transferred baselines from the other two devices, collectively designated as \(T_{A71}\), \(T_{S20}\), and \(T_{P7}\), and OODIn. The initial designs \(d_0\) for each device are prominently indicated, affirming the presence of device heterogeneity. Patterned bars in the figures highlight instances where certain baselines fail to yield a solution due to non-compliance with the problem’s constraints (denoted by !) or inapplicability to different devices (denoted by N/A).

Fig. 3.

Fig. 4.

Takeaways: Our framework achieves a substantial improvement, with an average gain of 1.19× and 1.57× (up to 1.46× and 1.92×) over the B-A and B-S baselines, respectively. It is noteworthy that these baselines, primarily designed for SOO problems, prove inadequate in capturing the multi-objective nature inherent in DL applications. Regarding the transferred baselines, CARIn achieves an average improvement of 1.17× in optimality (up to 1.84×). Importantly, it not only enhances overall optimality but also exhibits improvements across all considered objective functions. Specifically, for UC1, we observe an average increase of 0.156 units in accuracy and a 32.7% boost in throughput, and for UC2, observable improvements include an average reduction of 2.8 MB in model size and a notable 19.9% latency speedup at the same accuracy level. Compared to OODIn, an optimality increase of 1.5× is achieved in average (up to 1.99×).

7.1.3 Multi-DNN Execution.

Figures 5 and 6 show the benefits of the CARIn framework concerning the optimality metric in the context of the two multi-DNN use cases. In these scenarios, we compare against the multi-DNN-unaware baseline, the transferred baselines from other devices, and OODIn. The horizontal axis illustrates combinations of processors for each device. In the case of UC3, all possible combinations are presented, while for UC4, due to the considerable number of combinations, we organise and display them based on optimality, showcasing the top 5 for each device.

Fig. 5.

Fig. 6.

Takeaways: In the context of UC3, CARIn delivers a significant average optimality improvement of 1.47× across devices (up to 3.24×) over the multi-DNN-unaware baseline and an even more substantial gain of 1.87× (up to 4.06×) over the transferred baselines. Notably, these enhancements extend across all specified objectives. Compared to OODIn, we observe a 2.83× improvement in optimality (up to 10.69×). Meanwhile, UC4 poses a distinctive challenge, where the majority of baselines struggle to produce a viable solution, primarily due to their inability to satisfy the stringent latency constraints inherent in this use case, underscoring the intricacy of UC4. It is noteworthy that, given the utilisation of a singular model per task, instances where baselines do not fail result in performance parity with our framework, emphasising the importance of accommodating a diverse array of models for each task.

7.2 Runtime Adaptation

In this section, we assess the responsiveness of the RM and its adept utilisation of designs generated by RASS to dynamically adapt to a series of runtime fluctuations. For our evaluation, we target the UC1 single-DNN scenario on S20 and the UC3 multi-DNN scenario on A71. Through this experiment, we aim to demonstrate the efficacy of the RM module to seamlessly respond to dynamic runtime conditions, thereby validating its role in enhancing the adaptability and performance of CARIn across diverse use cases and devices.

7.2.1 Single-DNN Execution.

Table 7 presents the selected designs and switching policy, while Figure 7 depicts the behaviour of RM in the single-DNN scenario. The initial design for UC1 on S20, \(d_0\), involves the utilisation of EfficientNet Lite0 FFX8 on the CPU with four threads and the enabled XNNPACK library, resulting in 75.11% accuracy and a 16-MB memory footprint. As the CPU gradually becomes overloaded, the throughput experiences a decline until RM identifies an alternative design as the current highest-performing solution. The new configuration, \(d_1\), entails the use of EfficientNet Lite0 FP16 on the GPU. Following further inferences, RM triggers another switch due to an impending memory issue. In this instance, RASS has identified the memory-efficient design, \(d_{\text{m}}\), to involve the device’s CPU.

Table 7.

\(\boldsymbol {c_{\text{CPU}}}\)	\(\boldsymbol {c_{\text{GPU}}}\)	\(\boldsymbol {c_{\text{NPU}}}\)	\(\boldsymbol {c_{\text{m}}}\)	\(\boldsymbol {d_{\text{new}}}\)
F	—	—	F	\(d_0 = \left\lt \text{EfficientNet Lite0 FFX8}, \text{CPU}_{4,\text{T}} \right\gt\)
T	F	—	F	\(d_1 = \left\lt \text{EfficientNet Lite0 FP16}, \text{GPU} \right\gt\)
T	T	F	F	\(d_2 = \left\lt \text{MobileNet V2 1.4 FP16}, \text{NPU} \right\gt\)
T	T	T	F	\(d_{\text{w}} = \left\lt \text{MobileNet V2 1.0 FX8}, \text{CPU}_{4,\text{T}} \right\gt\)
T	T	T	T	\(d_{\text{wm}} \equiv d_{\text{w}}\)
—	—	—	T	\(d_{\text{m}} = \left\lt \text{EfficientNet Lite0 FX8}, \text{CPU}_{8,\text{F}} \right\gt\)

Table 7. Selected Designs and Switching Policy for the Single-DNN UC1 Scenario on S20

Fig. 7.

Takeaways: It is worth highlighting that despite modifications in the execution plan, our framework consistently upholds accuracy levels, even when employing the memory-efficient design. This steadfast commitment to preserving user Quality of Experience (QoE) underscores CARIn’s resilience in the face of dynamic alterations.

7.2.2 Multi-DNN Execution.

Table 8 and Figure 8 correspond to the multi-DNN scenario. In the context of UC3, where two models with distinct workloads are employed, CARIn recognises the heavier workload associated with the second task and acknowledges that this specific task is primarily responsible for triggering the switching mechanisms. The figure illustrates the average latency, standard deviation of latency, and accuracy for the second task, as well as the combined memory footprint of both models. UC3 involves the processing of audio data, introducing the potential use of the device’s DSP for data capture and processing. Given the likelihood of DSP overload during DNN inference, suppose that the highest-performing GPU-based design, \(d_1\), is currently employed with EfficientNet Lite2 FX8. However, due to the impending threat of a memory issue arising from this design’s memory footprint, RM opts to switch to the memory-efficient design, \(d_{\text{m}}\), resulting in a saving of 92 MB of RAM. Subsequently, as RM observes a reduction in DSP overload, it triggers a switch to the highest-performing design, \(d_0\), characterised by lower latency and reduced memory requirements. In the event of a potential DSP overload resurgence, RM strategically avoids reverting to the GPU-based design to mitigate previous concerns of excessive memory usage. Instead, it selects the next design in line, transferring the second model to the CPU while maintaining accuracy levels.

Table 8.

\(\boldsymbol {c_{\text{DSP}}}\)	\(\boldsymbol {c_{\text{GPU}}}\)	\(\boldsymbol {c_{\text{CPU}}}\)	\(\boldsymbol {c_{\text{m}}}\)	\(\boldsymbol {d_{\text{new}}}\)
F	—	—	F	\(d_0 = \lbrace \left\lt \text{YAMNet FP16}, \text{CPU}_{2,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite2 FFX8}, \text{DSP} \right\gt \rbrace\)
T	F	—	F	\(d_1 = \lbrace \left\lt \text{YAMNet FP16}, \text{CPU}_{2,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite2 FX8}, \text{GPU} \right\gt \rbrace\)
T	T	F	F	\(d_2 = \lbrace \left\lt \text{YAMNet FP16}, \text{CPU}_{4,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite2 FFX8}, \text{CPU}_{1,\text{T}} \right\gt \rbrace\)
T	T	T	F	\(d_{\text{w}} = \lbrace \left\lt \text{YAMNet DR8}, \text{CPU}_{2,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite0 FFX8}, \text{CPU}_{4,\text{F}} \right\gt \rbrace\)
T	T	T	T	\(d_{\text{wm}} \equiv d_{\text{w}}\)
—	—	—	T	\(d_{\text{m}} \equiv d_{\text{w}}\)

Table 8. Selected Designs and Switching Policy for the Multi-DNN UC3 Scenario on A71

Fig. 8.

Takeaways: It is important to acknowledge that CARIn may not always maintain predefined metric levels. As demonstrated in this instance, transitioning to the memory-efficient design resulted in an 8.5% decrease in accuracy and an increase in jitter. However, such occurrences are considered temporary states of urgency, with a firm expectation that they will be swiftly rectified, thereby minimising impact on user QoE. Notably, the rise in average latency or the standard deviation of latency does not significantly affect user QoE, as these metrics already meet the specified latency constraints, which precede the optimisation of the objectives.

7.2.3 Comparison with OODIn.

In our previous work, we introduced the model/processor switching technique to mitigate runtime fluctuations. However, OODIn lacks the ability to predict forthcoming changes in resource availability, so upon detecting such events, the MOO problem necessitates readjustment to the new conditions and subsequent resolution to determine the new highest-performing solution. CARIn offers the advantage of solving the specified MOO problem once, prior to application initiation, thus switching to a new execution plan during runtime happens instantaneously and is based on the predetermined designs and switching policy. Table 9 presents the average and maximum observed solution times of OODIn across diverse applications and devices. The solution time primarily hinges on the number of objectives and the dimensionality of the decision space \(\mathcal {X}\), contingent upon the number of DL tasks, utilised models per task, compression techniques, and adjustable system parameters. Given that the time required for the TFLite interpreter to load on the CPU is typically around 3–4 ms, it becomes evident that revisiting the MOO problem can potentially become a bottleneck for the application, impacting the user’s QoE.

Table 9.

Decision Space	A71		S20		P7
Dimension	Average	Maximum	Average	Maximum	Average	Maximum
500	1.45	2.12	0.55	1.55	3.64	7.99
2000	2.80	5.94	1.70	3.04	4.94	9.38
5000	6.56	10.46	4.98	15.97	7.06	10.09
10000	12.14	16.07	11.09	34.25	10.41	13.38

Table 9. OODIn’s Solving Time in Milliseconds

Given that this time is inevitable whenever a runtime issue occurs, it has the potential to become a bottleneck, thus impeding the seamless execution of a DL application.

Aside from the time overhead incurred by repeated problem solving, OODIn also requires constant access to the entire array of considered models, necessitating their storage on the user’s device, which can impose limitations on the assortment of models and compression techniques initially considered. Our framework obviates the necessity to store all model variants, requiring only those selected by RASS. Table 10 elucidates this contrast in terms of model storage requirements for every examined use case.

Table 10.

	A71			S20			P7
	CARIn	OODIn	Reduction	CARIn	OODIn	Reduction	CARIn	OODIn	Reduction
UC1	13.83	276.36	19.98 ×	34.37	443.10	12.89×	34.19	443.10	12.96×
UC2	48.64	311.45	6.40×	40.98	311.45	7.60×	52.96	311.45	5.88×
UC3	25.74	205.22	7.97×	58.70	205.22	3.50×	52.81	205.22	3.89×
UC4	2.65	6.56	2.48×	3.95	6.56	1.66×	3.95	6.56	1.66×

Table 10. Storage Requirements of CARIn and OODIn in MB

8 Limitations and Future Directions

In spite of the challenges mitigated by CARIn, our system exhibits limitations that impede its performance when deployed in practical scenarios. First, as mentioned in Section 4.2, the computation of device-dependent metrics associated with objective functions or constraints across all candidate solutions is unsuitable for realistic mobile applications due to its substantial time requirements and the necessity of deploying entire models onto target devices, particularly within expansive decision spaces. Within the broader landscape of related studies, numerous works have harnessed performance prediction methodologies to estimate such metrics when executing DNNs on specific hardware platforms, without resorting to direct measurements [9, 19, 25, 79]. These models consider a range of inputs, encompassing (a) architectural characteristics of the DNN model such as network topology, layer configurations, and overall complexity; (b) hardware specifications including compute architecture, memory hierarchy, interconnectivity, and support for parallelism; and (c) environmental parameters like batch size, input data characteristics, runtime conditions, and temperature/power conditions. Such approaches are orthogonal to our framework and can be integrated within CARIn to provide a more expedient alternative to exhaustive profiling. In the future, the exploration of such methods is envisioned to furnish a comprehensive assessment of our framework’s performance and suitability for real-world scenarios.

An additional limitation arises from the selection of models for evaluation. In the contemporary landscape of generative AI [5, 59], the inclusion of generative models, such as autoregressive language models, becomes paramount. These models, characterised by their ability to generate outputs sequentially based on previously generated tokens, impose heightened demands [32, 76], particularly within the context of mobile environments [33]. Therefore, it is imperative to account for such intricacies when assessing the efficacy of AI frameworks intended for deployment in resource-constrained settings.

9 Conclusion

This research underscores the paramount significance of optimising the on-device execution of DNNs to meet the evolving demands of artificial intelligence applications. Building upon the foundational work of Reference [61], the presented framework, CARIn, aims to spearhead progress in this direction. While the challenges of device heterogeneity, runtime adaptation, and multi-DNN execution persist, CARIn provides a novel and comprehensive solution toward alleviating them. The integration of an expressive MOO framework and the introduction of RASS as a runtime-aware MOO solver manage to enable efficient adaptation to dynamic conditions while adhering to user-specified SLOs. RASS stands out for its ability to foresee upcoming runtime issues and generate a set of configurations that enable rapid, low-overhead adjustments in response to environmental fluctuations.

Footnotes

Processor here refers also to the exact configuration of a given processor, e.g., threads in a CPU or precision in a GPU.

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark

References

[1]

Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D. Lane. 2019. EmBench: Quantifying performance variations of deep neural networks across modern commodity devices. In Proceedings of the 3rd International Workshop on Deep Learning for Mobile Systems and Applications (EMDL’19). ACM, 1--6.

Abstract

1 Introduction

2 Background and Related Work

2.1 Problem Statement

2.1.1 Device Heterogeneity.

2.1.2 Dynamic Environment.

2.1.3 Multiple DNNs.

2.2 Related Work

2.2.1 Service-level Objectives.

2.2.2 Multi-objective Optimisation.

2.2.3 Device-specific Solutions.

2.2.4 Runtime Adaptation.

2.2.5 Multi-DNN Inference.

3 Overview of CARIn

3.1 Proposed Solution

3.2 Workflow

4 Multi-Objective Optimisation Framework

4.1 MOO Problem Formulation

4.1.1 Single-DNN Setting.

4.1.2 Multi-DNN Setting.

4.2 Objective Function Evaluation

4.3 MOO Problem Solver

4.3.1 Optimality.

4.3.2 Runtime Challenges.

4.3.3 Model/Processor Switching.

4.3.4 Design Selection and Switching Policy.

5 Implementation

6 Experimental Methodology

6.1 Quantisation

6.2 Application Scenarios, Models, and Tasks

6.2.1 Use Case #1 (UC1).

6.2.2 Use Case #2 (UC2).

6.2.3 Use Case #3 (UC3).

6.2.4 Use Case #4 (UC4).

6.3 Mobile Devices

6.4 Profiling Details

7 Results

7.1 Designs

7.1.1 Comparison Methods.

7.1.2 Single-DNN Execution.

7.1.3 Multi-DNN Execution.

7.2 Runtime Adaptation

7.2.1 Single-DNN Execution.

7.2.2 Multi-DNN Execution.

7.2.3 Comparison with OODIn.

8 Limitations and Future Directions

9 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Towards Ultra-Efficient DNN Inference Acceleration on Edge Devices for Wellbeing Applications

Preemptibility-aware responsive multi-core scheduling

GuardiaNN: Fast and Secure On-Device Inference in TrustZone Using Embedded SRAM and Cryptographic Hardware

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share