[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

CARIn: Constraint-Aware and Responsive Inference on Heterogeneous Devices for Single- and Multi-DNN Workloads

Published: 29 June 2024 Publication History

Abstract

The relentless expansion of deep learning applications in recent years has prompted a pivotal shift toward on-device execution, driven by the urgent need for real-time processing, heightened privacy concerns, and reduced latency across diverse domains. This article addresses the challenges inherent in optimising the execution of deep neural networks (DNNs) on mobile devices, with a focus on device heterogeneity, multi-DNN execution, and dynamic runtime adaptation. We introduce CARIn, a novel framework designed for the optimised deployment of both single- and multi-DNN applications under user-defined service-level objectives. Leveraging an expressive multi-objective optimisation framework and a runtime-aware sorting and search algorithm (RASS) as the MOO solver, CARIn facilitates efficient adaptation to dynamic conditions while addressing resource contention issues associated with multi-DNN execution. Notably, RASS generates a set of configurations, anticipating subsequent runtime adaptation, ensuring rapid, low-overhead adjustments in response to environmental fluctuations. Extensive evaluation across diverse tasks, including text classification, scene recognition, and face analysis, showcases the versatility of CARIn across various model architectures, such as Convolutional Neural Networks and Transformers, and realistic use cases. We observe a substantial enhancement in the fair treatment of the problem’s objectives, reaching 1.92× when compared to single-model designs and up to 10.69× in contrast to the state-of-the-art OODIn framework. Additionally, we achieve a significant gain of up to 4.06× over hardware-unaware designs in multi-DNN applications. Finally, our framework sustains its performance while effectively eliminating the time overhead associated with identifying the optimal design in response to environmental challenges.

1 Introduction

In recent years, the pervasive growth of deep learning (DL) applications has catalysed a paradigm shift in the field of artificial intelligence, rendering on-device execution a critical imperative [69]. The burgeoning demand for sophisticated deep neural networks (DNNs) spans a myriad of domains, from computer vision to natural language processing, necessitating the deployment of these models directly on mobile devices. This shift from centralised to decentralised computation arises from the intrinsic requirements of real-time processing, enhanced privacy concerns, and the need for reduced latency in diverse applications. As a consequence, the optimisation of executing deep neural networks on-device has emerged as a paramount research frontier.
While the shift toward on-device execution of DNNs represents a pivotal advancement, it is not without its formidable challenges. Device heterogeneity, characterised by the diverse array of hardware and computational capabilities across mobile devices, remains a persistent hurdle. Moreover, emerging challenges, such as the simultaneous execution of multiple DNNs on a single device [60] and the need for dynamic runtime adaptation [64] to evolving environmental conditions, add layers of complexity to the optimisation landscape. Multi-DNN execution introduces intricate dependencies and resource contention issues, necessitating sophisticated orchestration strategies. Runtime adaptation, however, mandates the development of intelligent mechanisms capable of dynamically adjusting model parameters and system configurations to optimise performance in real-time scenarios. Addressing these challenges is paramount to unlocking the full potential of on-device deep learning, as it paves the way for the seamless integration of advanced artificial intelligence (AI) capabilities into the fabric of our interconnected devices.
Enhancing the work presented in Reference [61], namely OODIn, this article presents CARIn, a novel framework for the optimised deployment of both single- and multi-DNN applications on mobile devices. The initial work in Reference [61] focused primarily on presenting a new highly parametrised software architecture for DL mobile apps, optimising single-DNN applications and evaluating solely on the image classification task. In this article, we build upon the architecture of OODIn that allows us to efficiently modify model and system parameters and introduce two novel components to meet the new demands of model multi-tenancy and efficient runtime adaptability. First, we develop an expressive multi-objective optimisation (MOO) framework that allows us to capture both single- and multi-DNN workloads and to formally model the performance requirements and constraints of DL applications. Second, we present RASS, a runtime-aware MOO solver that enables rapid, low-overhead adaptation while sustaining high performance under dynamic conditions. Contrary to existing optimisers that yield a single execution plan for a given device [28, 62], our solver generates a set of configurations to accommodate potential variations in resource availability. This eliminates the necessity to continually adjust and resolve the MOO problem whenever a runtime issue arises. Additionally, we broaden the scope of targeted tasks and further augment this by conducting a comprehensive evaluation spanning various model architectures, including Convolutional Neural Networks (CNNs) and Transformers, across a spectrum of realistic scenarios characterised by diverse performance demands.

2 Background and Related Work

2.1 Problem Statement

The primary aim of a DL application is to consistently uphold its performance goals or service-level objectives (SLOs), often referred to as quality of service targets. These SLOs encompass a multifaceted range of critical metrics, including but not confined to accuracy, latency, throughput, memory utilisation, and energy consumption. Achieving and sustaining these objectives requires careful consideration of the specific demands inherent to a given application or system. It is crucial to recognise that this challenge is compounded by two primary factors: (a) the inherent heterogeneity and (b) the dynamic nature of mobile and embedded devices. Adding to this complexity is the increasingly prevalent use of multiple models within DL applications, which places additional demands on the already intricate ecosystem of these devices.

2.1.1 Device Heterogeneity.

Compact devices involve a wide array of hardware configurations, characteristics, and capabilities, leading to a significant level of diversity [1, 2, 22, 66]. This diversity manifests not only across distinct devices, which is referred to as “inter-device heterogeneity,” but also within individual devices, a concept known as “intra-device heterogeneity.” Inter-device heterogeneity reflects the variations in size, processing power, memory capacity, energy efficiency, and more, across different devices. For example, a high-end smartphone will have markedly different hardware specifications than a low-cost IoT sensor node, illustrating the extent of diversity that exists across various devices. Intra-device heterogeneity, however, arises from the presence of multiple hardware components and subsystems within a device, each with its distinct attributes. In a standard smartphone setup, for instance, one may encounter a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU), all with varying clock speeds, energy consumption profiles, memory requirements, and parallelisation capabilities. Due to device heterogeneity, it is very challenging to design a universal DL model that performs efficiently across devices. For example, a model meticulously crafted and optimised to run seamlessly on Google’s Edge Tensor Processing Unit (TPU) may encounter performance issues and inefficiencies when deployed on a different processor, such as a conventional mobile GPU.
As a result, rather than pursuing a one-size-fits-all approach, where a single model is expected to excel universally, the focus shifts toward creating models that are fine-tuned and optimised for the unique features of each target device. These device-specific models are designed to leverage the strengths of the hardware and maximise performance, thereby addressing the inherent challenges associated with device heterogeneity.

2.1.2 Dynamic Environment.

In contemporary mobile and embedded computing ecosystems, the concurrent execution of multiple applications and processes is commonplace. This inherent multi-tasking characteristic introduces notable fluctuations in resource availability and workload demands, thereby rendering the acquisition of sufficient resources for performant task execution a challenging endeavour. Due to environment dynamicity, it is very challenging for a static execution configuration to consistently satisfy the application’s SLOs at any given time. For instance, if a user runs the application outdoors on a hot day, then the device’s temperature may rise and thermal throttling mechanisms can be triggered, causing the CPU or GPU to reduce their clock speeds to prevent overheating [54], resulting in reduced throughput or execution slowdown.
Such scenarios necessitate the ability to dynamically adapt to changing conditions and varying resource availability in real time. This adaptive behaviour is vital to ensure that the application consistently maintains satisfactory performance levels despite the variability in its operational environment.

2.1.3 Multiple DNNs.

Today’s growing demand for more advanced and intelligent systems has given rise to scenarios that mandate the simultaneous utilisation of multiple models, often referred to as “multi-DNN” configurations [60]. This paradigm shift is mainly driven by the need to address specific tasks or solve complex problems that demand a diversified approach, benefiting from the combined expertise of multiple specialised models. Multi-DNN applications showcase a high degree of adaptability in employing multiple models, as they can be harnessed to address a singular, intricate task [24, 73] or several distinct, autonomous tasks [8, 36]. In the former scenario, models typically exhibit interdependence and may require sequential execution, while in the latter scenario, models operate independently, affording them the capability to run in parallel. The efficient deployment of multi-DNN configurations introduces intricacies that pertain to resource allocation and load distribution. Particularly, parallel model execution presents a notably more intricate challenge compared to sequential execution, as models compete for the device’s finite resources. This concurrent operation, coupled with the simultaneous management of multiple tasks, amplifies the overall workload and poses new challenges to resource allocation and coordination.
The attainment of seamless orchestration and effective collaboration among multiple specialised models while upholding stringent performance and quality benchmarks stands as a substantial and multifaceted challenge within the domain of multi-DNN applications.

2.2 Related Work

The field of on-device deep learning has witnessed significant advancements in addressing the challenges posed by device heterogeneity and environment dynamicity in both single- and multi-DNN use cases. This progress reflects a profound shift in the landscape of deep learning research, where a growing emphasis has been placed on ensuring that DL models not only function effectively but also meet specific SLOs across a spectrum of computational environments, ranging from powerful high-end devices to resource-constrained edge computing platforms. Moreover, recent efforts have explored the integration of MOO techniques to further enhance the adaptability and efficiency of on-device DL solutions.

2.2.1 Service-level Objectives.

Most of the prior works directed toward achieving SLOs have predominantly concentrated on scenarios involving multiple DNNs, primarily investigating the tradeoff between accuracy and latency. Within this domain, the majority is focused on system development for edge servers [13, 29, 53, 83, 84], and only a limited number of studies have been devoted to on-device execution, specifically by orchestrating multiple inference requests across heterogeneous processors [24, 51, 73]. In contrast, our proposed framework demonstrates the capability to accommodate a diverse array of SLOs, including but not limited to accuracy, latency, memory footprint, size, and energy consumption, tailored for both single- and multi-DNN on-device applications.

2.2.2 Multi-objective Optimisation.

MOO has been widely employed in conjunction with neural architecture search (NAS) methodologies, culminating in the development of Multi-Objective Neural Architecture Search. This approach is particularly valuable for (a) designing DNNs with the goal of not solely optimising the accuracy but also considering resource consumption [11, 23, 56] and (b) for compressing pretrained models [18, 40, 63]. Notably, our framework represents one of the pioneering efforts to formulate and address device-specific MOO problems to achieve specific SLOs at the system level.

2.2.3 Device-specific Solutions.

The majority of endeavours aimed at addressing device heterogeneity predominantly concentrate on the model level, i.e., by identifying the most fitting DL architecture tailored to a specific hardware platform. Among the prominent model-level methodologies, NAS and model scaling have a central role.
Hardware-aware NAS (HW-NAS) approaches seek to optimise DNN architectures both for high predictive accuracy and for efficient execution on a target deployment platform. Its most prominent premise is the inclusion of (a) hardware constraints, and (b) latency, energy, and other system metrics, as objectives during the search process [3, 4, 65, 80]. HW-NAS usually involves performance prediction to guide the search algorithm. Nonetheless, estimating precise latency, memory, or energy figures can be challenging, and the method’s effectiveness heavily relies on the accuracy of these estimates. Such approaches can also be computationally intensive due to the need to train and evaluate a large number of candidate architectures.
Supernet-based NAS, also known as One-shot NAS, is an approach that leverages a supernet along with weight sharing to facilitate efficient architecture search [6, 30, 35, 64]. A supernet is a network containing all possible architectural choices of a given search space, and it enables the exploration of diverse neural architectures while significantly reducing computational overhead. While this approach reduces the training-time computational requirements, it may not be as effective at tailoring architectures to specific hardware constraints and weight sharing may restrict fine-grained control over architectural decisions.
Last, model scaling involves adjusting parameters such as the depth, width, and input size of a DNN to strike a balance between accuracy and efficiency [10, 58, 68]. This technique is often applied along with NAS or knowledge distillation methods to also accommodate resource constraints. However, model scaling might not fully exploit the unique hardware characteristics of specific devices, potentially leading to sub-optimal performance.

2.2.4 Runtime Adaptation.

In the context of runtime adaptation, research efforts span both the model and system levels. On the model level, the primary focus revolves around the development of techniques designed to dynamically adjust the model’s architecture in response to fluctuations in resource availability. These adaptive models possess the capability to modify their architecture and parameters in real time during inference, effectively responding to the evolving constraints of the computing environment. Prominent examples of such models comprise adaptive supernets [7, 16, 43, 64], adaptive model scaling [20, 72, 77], multi-branch networks [17], early-exit models [4, 34], and a variety of other innovative approaches. However, crafting adaptive mechanisms that seamlessly function across a diverse range of devices can pose significant technical challenges and the adaptability of these networks may introduce certain computational overhead, potentially impacting the performance of real-time applications. At the system level, complementary methods come into play, including dynamic compression [38], adaptive model selection [31], and efficient scheduling on available hardware [70], among others.

2.2.5 Multi-DNN Inference.

To facilitate multi-DNN inference, researchers have also explored solutions at both the model and system levels [60]. From the model perspective, the execution of multiple DNNs aligns closely with the principles of multi-task learning [21, 49, 78, 82], a technique that trains a single model to perform multiple related tasks simultaneously. Consequently, using a single multi-task model for inference can replace the need for concurrent inferences from multiple models. At the system level, most research efforts leverage the heterogeneous processors available on devices and aim to identify the highest-performing mapping strategy. This is typically achieved through approaches that partition the model at the layer level [24, 26, 27, 67] or by introducing task-level priorities [37]. Additionally, there exists a body of research focused on multi-tenant inference systems [12, 75], albeit predominantly concentrating on server-based configurations [71, 74] rather than on-device implementations [60].

3 Overview of CARIn

3.1 Proposed Solution

CARIn addresses the main challenges of on-device DL inference (Section 2.1) in two ways. First, we introduce a novel approach of modelling DL applications, utilising a MOO framework to encapsulate their characteristics (Section 4). Given the rising number and diversity of DL applications, CARIn is able to analytically represent their various performance requirements and constraints, with the required expressivity to support both single- and multi-DNN scenarios. Second, to enable runtime adaptation, we introduce RASS, a runtime-aware MOO solver that allows for low-overhead and effective dynamic adjustment of the execution. The key principle behind RASS’s design is to explicitly consider during the MOO solution stage that adaptation may subsequently be required at deployment time. As such, RASS operates in two steps: (i) It generates a set of alternative execution configurations with diverse tradeoffs prior to deployment and (ii) configures the inference engine with a policy of switching among them.
Toward alleviating the impact of device heterogeneity and resource fluctuation, CARIn operates exclusively at the system level, bypassing the need to produce an optimal model for each target device. Model-level solutions typically include the design, exploration, training, and adaptation of a DNN’s architecture to specific target devices and resource availability changes. These procedures can be cumbersome, time-consuming, and lead to complex pipelines. Instead, our framework employs a repository of pre-trained models with varying architectures and complexities. The singular requisite action in relation to the models entails the application of post-training quantisation (Section 6.1).
The design of our framework was driven by the fact that satisfying SLOs depends not only on the target model but also on the specific target device, especially the processor in use. Consequently, CARIn’s primary objective is to determine, at any given time, the most suitable model-processor1 pair (or pairs) for a specified device. Internally, our MOO framework expresses this as a DL-based, device-centric problem, to effectively capture both the application’s SLOs and the unique characteristics of the target device. Given the device-specific nature of our MOO formulation, a distinct optimisation problem is formed for each given device, effectively circumventing the challenge posed by device heterogeneity. Additionally, to facilitate real-time adaptation, CARIn leverages the device’s intra-level heterogeneity, specifically the array of available processors, as well as the range of solutions offered by the RASS solver, which allow the adoption of a swift and efficient switching mechanism between execution plans.

3.2 Workflow

Figure 1 depicts CARIn’s operational flow, which is divided into the sequential offline and online phases. The offline component is responsible for constructing and resolving the device-specific MOO problem. Then, at runtime, the online component’s Runtime Manager (RM) constantly monitors the application’s dynamic behaviour, ensuring real-time adaptation to emergent changes. Algorithm 1 presents a comprehensive top-level overview of our framework, delineating its primary components and illustrating their main operations. These operations will be thoroughly elucidated in Section 4. The input parameters of our framework are as follows: (a) the designated DL task(s) associated with the application, (b) the stipulated SLOs, and (c) the target device’s characteristics, while the outputs consist of (a) the set of solutions (designs \(\mathcal {D}\)) and (b) the switching policy (SP).
Fig. 1.
Fig. 1. High-level workflow of CARIn.
The specified DL task or tasks dictate the set of models to be considered during the optimisation process. A model in CARIn is represented by the following tuple:
\[\begin{gather*} m = (arch, params, s_{\text{in}}, task, ds, pr), \end{gather*}\]
where arch is the model’s architecture (i.e., layers and connections), params are the model’s trained parameters, \(s_{\text{in}}\) is the input size, task is the target DL problem, ds is the name of the corresponding DL testing dataset, and pr is the numerical precision to account for quantised models.
The target device defines the hardware resources at the system’s disposal, which are represented by the tuple:
\[\begin{gather*} hw = (ce, op(ce)), \end{gather*}\]
where \(ce \in\mathcal {CE}\) is the compute engine (i.e., processor) performing the inference computations and \(op(ce)\) is a set of options tied to the given processor, e.g., the number of CPU threads or the GPU’s numerical precision. The tuple of tunable system parameters can be extended to capture a more detailed space, e.g., by including the DVFS governor selection that determines the dynamic voltage and frequency scaling policy of the device [61].
An individual model m running under the selected system parameters hw represents a single execution configuration:
\begin{equation} e = \left\lt m, hw\right\gt \in \mathcal {E}. \end{equation}
(1)
During the MOO Problem Formulation stage (lines 1–7), CARIn considers every generated space of execution configurations, \(\mathcal {E}_i\), to form the problem’s decision space, \(\mathcal {X}\), depending on whether the application requires single- or multi-DNN execution. At the same time, the application’s SLOs delineate the MOO problem’s objective functions and constraints, denoted as \(f_i\) and \(g_j\), respectively. Once the problem is formulated for the target device, the Objective Function Evaluation stage evaluates each function for every \(x \in \mathcal {X}\) (line 8). Following this, CARIn’s MOO Problem Solver is poised to solve the MOO problem (lines 9–12). The functions CalculateOptimality, Sort, and Search shown in Algorithm 1, which constitute the three stages of the solver, are discussed in detail in Section 4.3.
In order for CARIn to accommodate runtime adaptation, it is important to establish a robust system for perpetually monitoring the dynamic aspects of the executing application and the state of the device itself. This ongoing vigilance enables timely recognition of abrupt alterations in operational conditions, thereby facilitating immediate corrective measures. We call this subsystem the RM. The output of CARIn’s solving algorithm consists of a set \(\mathcal {D}\) of highest-performing solutions, called designs, which are passed to RM along with the appropriate SP. Leveraging a collection of periodically captured statistics s from the Application’s runtime, the RM module has the ability to discern dynamic changes in resource allocation (c in Algorithm 1) and rapidly switch to an alternative design \(d_{\text{new}}\) to effectively and robustly meet the application-level SLOs (lines 13–18).

4 Multi-Objective Optimisation Framework

MOO constitutes a mathematical and computational approach employed to find the best solutions or tradeoffs in scenarios that involve multiple interrelated and, at times, antagonistic objectives [15, 44]. The appropriateness of a MOO framework for our problem is underscored by (a) the inherent nature of DL application SLOs, which typically comprise objectives that exhibit conflicts, and (b) the inherent attribute of MOO to yield a solution space rich in diversity, which, in turn, can enable dynamic adaptation.

4.1 MOO Problem Formulation

For CARIn’s DL-based MOO formulation, we adopt the following mathematical description:
\[\begin{gather*} \begin{aligned}& \text{min/max} && f_i(x), && 1 \le i \le N \\ & \text{subject to} && g_j(x) = g_j(h_j(x)) \le 0, && 1 \le j \le P, \end{aligned} \end{gather*}\]
where x denotes the decision variable, N is the number of objective functions, \(f_i(x)\) is the ith objective function, P is the number of inequality constraints, and \(g_j(x)\) is the jth inequality constraint, which is always a composite function of a given inner function \(h_j(x)\). Note that when there is only a single objective function (\(N\!=\!1\)), then the problem is reduced to single-objective optimisation (SOO). The problem’s objective functions and constraints are extracted from the application’s SLOs, which can be split into two categories:
Broad SLOs: Such objectives define the problem’s objective functions and come in the form of \(\left\lt min/max, p \right\gt\), where p is a DL-related performance metric. For instance, \(\left\lt max, mIoU\right\gt\) means that the mean Intersection-over-Union (mIoU) accuracy metric should be maximised for an image segmentation task. For CARIn, this objective translates to the maximisation of the objective function \(f(x) = A(x) = mIoU(x)\).
Narrow SLOs: These objectives define the problem’s constraints and come in the form of \(\left\lt min/max/avg/std/n{\text{th}}, p, v \right\gt\), which means that the minimum, maximum, average, standard deviation or nth percentile value of p is bounded by a target value v. For instance, \(\left\lt avg, L, 15\right\gt\) means that the average latency needs to be less than 15 ms, which translates to the constraint \(g(x) \le 0\), where \(g(x) = g(h(x)) = g(L(x)) = \overline{L}(x) - 15\).
Given that both types of objectives concern the same set of performance metrics, it follows that both the objective and inner functions, \(f_i(x)\) and \(h_j(x)\), share a common function space, denoted by \(\mathcal {F}\), which encompasses the entirety of available functions associated with various DL performance metrics. For this reason, in cases where the application defines constraints without explicitly specifying objective functions, CARIn can duly regard all specified inner functions \(h_j(x)\) as objective functions as well.

4.1.1 Single-DNN Setting.

When there is only one DL task to optimise, the decision variable x is a single execution configuration e, as defined in Equation (1). Therefore, the execution configuration space \(\mathcal {E}\) effectively transforms into the decision space \(\mathcal {X}\):
\[\begin{gather*} x_{\text{single}} = e = \left\lt m, hw\right\gt \in \mathcal {X}_{\text{single}} = \mathcal {E}. \end{gather*}\]
For the objective functions, CARIn leverages the following DNN-specific performance metrics:
Size (S): Size is conventionally represented by either the total count of parameters within the neural network or the physical file size of the model stored in memory.
Workload (W): This metric is typically measured in terms of numerical operations, such as floating-point operations (FLOPs) or multiply-accumulate operations.
Accuracy (A): Accuracy is contingent upon the specific DL task in question, e.g., top-1 accuracy for classification tasks or exact match for question answering tasks.
Latency (L): Latency delineates the temporal lag between the transmission of input data to the DNN model and the reception of the corresponding output. It is quantified in units of milliseconds or seconds.
Throughput (TP): Throughput provides an indication of the model’s real-time processing capabilities and is computed as the total number of input samples (batch size), divided by the total inference latency. This metric is denominated in samples per second (e.g., images per second when images constitute the inputs).
Energy Consumption (E): This metric is of paramount importance for the evaluation of the energy efficiency of DNN applications in resource-constrained environments and is measured in energy units, such as watt-hours or joules.
Memory Footprint (MF): Memory footprint encapsulates the extent of random-access memory (RAM) required for the loading and execution of a DNN. It is traditionally assessed in terms of memory size units, such as megabytes (MB) or gigabytes (GB).
Overall, the set of potential objective functions in single-DNN cases is denoted as follows:
\[\begin{gather*} \mathcal {F}_{\text{single}} = \lbrace S, W, A, L, TP, E, MF \rbrace , \end{gather*}\]
collectively empowering a multifaceted assessment of DNN models and providing a holistic understanding of their performance across diverse dimensions.
It is important to recognise that the latency and energy consumption metrics are subject to inherent fluctuations when executing DNNs on mobile devices. These fluctuations can arise due to various factors, including device load, temperature, input values, and other environmental variables (see Section 4.3.2). As a result, relying on a single, instantaneous value may not provide a robust and representative assessment of system performance. To account for these fluctuations, CARIn considers statistical measures, such as the average or maximum energy consumption or the variance of the latency, as objective functions.

4.1.2 Multi-DNN Setting.

When there are M independent DNNs to optimise jointly, the decision variable x comprises of M distinct execution configurations \(e_i\), \(1 \le i \le M\). Hence, the decision space \(\mathcal {X}\) is an M-dimensional space, where each component of the decision variable can separately take values in the corresponding execution configuration space \(\mathcal {E}_i\):
\[\begin{gather*} x_{\text{multi}} = \lbrace e_1, \ldots , e_M \rbrace = \lbrace \left\lt m, hw\right\gt _1, \ldots , \left\lt m, hw\right\gt _M \rbrace \in \mathcal {X}_{\text{multi}} = \mathcal {E}_1 \times \cdots \times \mathcal {E}_M. \end{gather*}\]
The array of potential objective functions is expansively broadened to encompass an additional triad of performance metrics pertaining to parallel execution [12, 60]:
Normalised Turnaround Time (NTT): NTT serves as a quantifier of the perceived execution slowdown during multi-DNN execution. The NTT for the ith DNN is computed as follows:
\[\begin{gather*} NTT_i = \frac{L_i^{\text{M}}}{L_i^{\text{S}}}, \end{gather*}\]
where \(L_i^{\text{S}}\) and \(L_i^{\text{M}}\) are the average latencies of the ith DNN under the single- and multi-DNN modes. \(NTT_i\) is a value greater than or equal to 1, with lower values indicating superior performance. For the sake of standardisation across models, it is common practice to calculate the average or maximum NTT.
System Throughput (STP): STP quantifies the accumulated single-DNN progress under multi-DNN execution and is computed as follows:
\[\begin{gather*} STP = \sum _{i=1}^{M} NP_i = \sum _{i=1}^{M} \frac{1}{NTT_i} = \sum _{i=1}^{M} \frac{L_i^{\text{S}}}{L_i^{\text{M}}}, \end{gather*}\]
where \(NP_i\) is the normalised progress of the ith DNN. Its maximum magnitude is M, with higher values signifying enhanced performance.
Fairness (F): The concept of fairness in a multi-DNN execution environment is contingent upon the equitable relative progress experienced by co-executing DNNs, in comparison to their single-DNN execution counterparts. Fairness, as denoted herein, is quantified as the minimum ratio of relative normalised progress rates observed among any two DNNs concurrently operating within the following system:
\[\begin{gather*} F = \min _{i, j} \frac{NP_i}{NP_j}. \end{gather*}\]
This metric adheres to a higher-is-better paradigm with values within the range \([0, 1]\), where 0 signifies an absence of fairness and 1 perfect fairness.
As a consequence, we augment CARIn’s objective function set to encompass both single-DNN metrics, which pertain to individual tasks or DNNs, and multi-DNN metrics, which characterise the collective performance of the entire system during concurrent execution:
\[\begin{gather*} \mathcal {F}_{\text{multi}} = \lbrace S_i, W_i, A_i, L_i, TP_i, E_i, MF_i \rbrace \cup \lbrace STP, NTT, F \rbrace , 1 \le i \le M. \end{gather*}\]

4.2 Objective Function Evaluation

Upon the formulation of the device-specific MOO problem, it becomes necessary to assess each objective function across the entire set of decision variables \(x \in \mathcal {X}\). Assessing these functions is straightforward for certain objectives; however, it presents challenges for device-dependent functions like E and MF, and those relying on latency, including L, TP, STP, NTT, and F. The approach adopted by CARIn for this evaluation involves the profiling of functions on individual target devices. In practical terms, this entails the deployment of all candidate models on each target device, followed by the measurement of each device-reliant objective function for all feasible model-processor combinations. We acknowledge that this procedure, albeit comprehensive, is inherently time-consuming and, in many instances, such as in multi-DNN cases, infeasible for seamless integration into real-world scenarios and practical applications. However, the optimisation of the evaluation process itself does not constitute a primary objective of this work. We extensively discuss potential enhancements of this aspect in Section 8.

4.3 MOO Problem Solver

Following the formulation of the problem and the evaluation of objective functions, the conclusive stage involves resolving the optimisation problem. The initial step of the optimisation process is to apply the problem’s constraints. Consequently, the decision variables are bound to the constrained decision space \(\mathcal {X}^{\prime }\), defined as:
\[\begin{gather*} \mathcal {X}^{\prime } = \lbrace x\,|\, g_j(x) \le 0, \forall j \rbrace . \end{gather*}\]
MOO problems are frequently addressed using evolutionary algorithms, such as NSGA-II, SMS-EMOA, and MOEAD, or swarm-based algorithms, such as Ant Colony Optimisation and Particle Swarm Optimisation [52]. These algorithms systematically explore the decision variable space to discover the Pareto frontier, which represents the optimal tradeoffs among conflicting objectives. While these algorithms excel in identifying the optimal execution configuration, we acknowledge that potential runtime issues may either alter the solution space, consequently affecting the Pareto frontier of a MOO problem, or introduce new constraints that were not considered during the problem’s formulation. Consequently, to address the potential decline in performance, it becomes imperative to rerun these algorithms whenever a runtime issue arises, however, such repetitive executions are impractical for real-life applications and systems.
To address this challenge, we introduce a runtime-aware sorting and search algorithm, denoted as RASS, whose primary goal is to solve a device-specific MOO problem once, while concurrently addressing potential future runtime challenges. To achieve this, RASS considers both non-dominated and dominated solutions in a predictive manner, estimating the impact of possible runtime issues. In addition to providing the initial solution \(d_0\), RASS also yields a set of supplementary runtime designs \(d_i\), which serve as a proactive measure for runtime adaptation, i.e., in instances where the currently employed design encounters performance issues. This approach alleviates the need for repetitive executions of optimisation algorithms.
The operation of RASS involves a sorting stage followed by a search stage. To accommodate both non-dominated and dominated solutions, our solving algorithm initially sorts candidate solutions according to their optimality (Section 4.3.1), a metric quantifying the distance from the problem’s utopia point. Subsequently, based on this sorting, RASS identifies a set of solutions (Section 4.3.4) representing the various execution plans of the application that correspond to possible runtime issues (Section 4.3.2), along with a switching policy facilitating prompt transitioning between them (Section 4.3.3) for the RM module.

4.3.1 Optimality.

To quantify optimality for a given candidate solution \(x \in \mathcal {X}^{\prime }\), we first calculate the weighted Mahalanobis distance between the solution’s objective vector, which is defined as \(f(x) = [ f_{1}(x), f_{2}(x), \ldots, f_{n}(x) ]\), and the utopia point, represented as \(up = [ up_{1}, up_{2}, \ldots, up_{n} ]\):
\[\begin{gather*} d(x) = \sqrt {\sum _{i=1}^{n} w^2_i \frac{\left[ f_{i}(x) - up_i \right]^2}{s^2_i} }, \end{gather*}\]
where \(w_i\) is the user-supplied weight for the ith objective, \(s^2_i\) is the calculated variance of the ith objective, and each component of the utopia point depends on the corresponding objective function:
\[\begin{gather*} up_{i} = {\left\lbrace \begin{array}{l@{\quad}l} \max f_{i}, & \text{if } f_i \in \lbrace A, TP, STP, F\rbrace \\ \min f_{i}, & \text{if } f_i \in \lbrace S, W, L, E, MF, NTT\rbrace . \end{array}\right.} \end{gather*}\]
By utilising the Mahalanobis distance, we effectively accommodate the disparate scales of the diverse objectives. Consequently, optimality could also be regarded as a metric of fairness for the problem’s objective functions. However, it is important to acknowledge that these functions may carry distinct significance for the problem, and, hence, we afford users the opportunity to define weights, thereby introducing a formal mechanism for enabling tailored optimisation strategies. Notably, the calculated distances range within the interval \([0, d_{\text{max}}]\), where the maximum distance is as follows:
\[\begin{gather*} d_{\text{max}} = \sqrt {\sum _{i=1}^{n} w^2_i \frac{\left(\max f_{i} - \min f_i \right)^2}{s^2_i} }. \end{gather*}\]
This factor necessitates the use of normalisation, which results in the distance being confined to the \([0, 1]\) range:
\[\begin{gather*} d_{\text{s}}(x) = \frac{d(x)}{d_{\text{max}}}. \end{gather*}\]
The optimality metric for each \(x \in \mathcal {X}^{\prime }\) can then formally be defined as the reciprocal of the scaled weighted Mahalanobis distance, thus, its range extends from \([1, +\infty)\):
\[\begin{gather*} opt(x) = \frac{1}{d_{\text{s}}(x)}. \end{gather*}\]
Utilising these values, the candidate solutions are sorted in descending order, resulting in the creation of the sorted decision space \(\mathcal {X}_{\text{s}}\).

4.3.2 Runtime Challenges.

During the runtime of the application, a multitude of dynamic alterations in the device’s resource availability may occur. These fluctuations impact our problem formulation in different ways, thus necessitating targeted approaches for their management. CARIn focuses on addressing two main challenges, regarding the processors and memory of the target device:
Processor Overload or Overheating: Processor-related concerns manifest when the processor at use is continuously subjected to sustained processing demands exceeding its peak processing capacity, primarily due to resource-intensive computational tasks. The protracted imposition of such an overload condition may subsequently lead to overheating, which means the escalation of the System-on-Chip (SoC) temperature to a critical and potentially harmful level. Overheating may also result from insufficient cooling mechanisms or other impediments hindering the effective dissipation of heat by the SoC. As a protective measure against potential harm, mobile SoCs are equipped with thermal throttling capabilities, which are activated when temperatures exceed predefined thresholds. Thermal throttling encompasses the deliberate reduction of the processor’s clock speed and performance to mitigate heat generation and maintain a safe temperature range. The intricate interplay between processor overload and overheating significantly impacts performance and power consumption, underscoring the significance of diligent management and effective mitigation strategies.
Variability in RAM Utilisation: Owing to the multifaceted nature of mobile devices, the utilisation of RAM is also characterised by dynamic fluctuations. Within the execution scope of an application, numerous ancillary applications, processes, or services continually initiate and terminate in the background, potentially culminating in an unforeseen saturation of RAM capacity. Consequently, this phenomenon may precipitate performance-related challenges, encompassing lag, application crashes, and an overall deceleration of device functionality. Furthermore, the perpetual management of excessive RAM consumption may also entail elevated power consumption, thereby engendering consequential ramifications.

4.3.3 Model/Processor Switching.

In response to runtime fluctuations, CARIn’s RM adopts a strategic approach that involves alterations to either the model (change model (CM), processor (CP), or both (CB)) within the current execution plan. These three fundamental adjustments serve as effective measures for mitigating the challenges encountered during runtime. To this end, we introduce a prioritisation scheme. In the case of processor-related phenomena, CARIn prioritises transferring DNN execution from the currently used processors to inactive ones (CP or CB). This transition allows the overloaded or overheated processor to dissipate excess heat and gradually restore its performance. In cases where migration is not a viable option, such as in devices limited solely to CPU usage or multi-DNN scenarios where all processors are occupied, CARIn employs an alternative approach that involves replacing the current model with one of reduced computational workload (CM). Conversely, addressing the memory-related issue involves transitioning to a more compact model either on the same (CM) or a different processor (CB).

4.3.4 Design Selection and Switching Policy.

A primary principle guiding the design of RASS is to ensure low complexity to facilitate rapid switching. This objective manifests in the generation of a relatively small number of designs, which in turn offers two additional distinct benefits: First, minimise storage requirements for the models and, second, maintain a concise switching policy comprising only a limited number of transition rules. For RM to determine the appropriate timing to transition to a new execution plan, several system parameters, related to (a) the workload distribution across processors and (b) the aggregate memory utilisation, need to be continuously monitored. These parameters are represented by the Boolean variables \(c_{ce}\) and \(c_{\text{m}}\), indicating the presence of issues pertaining to a processor ce and the memory, respectively.
The first step in identifying the solutions to the problem involves identifying the sets of different model-to-processor mappings viable for processor switching, i.e., for reallocating DL execution to idle processors. Symbolising the number of these sets as T, in consideration of RASS’s need for simplicity, if \(T\gt 3\), then we retain only the top three sets, corresponding to the highest attained optimality scores. Next, we partition our sorted decision space \(\mathcal {X}_s\) into T distinct subspaces \(\mathcal {X}_i\), each corresponding to specific model-to-processor mappings and arranged in descending order of observed optimality. Regarding processor-related phenomena, we select designs associated with the highest optimality score within each set:
\[\begin{gather*} d_i = \mathcal {X}_i[0], i=0,\ldots ,T-1. \end{gather*}\]
For the memory-related issue, we extract the solution with the smallest memory footprint:
\[\begin{gather*} d_{\text{m}} = \operatorname*{argmin}_{\substack{x}} MF(x), \, \, x \in \mathcal {X}_i, \, \, i=0,\ldots ,T-1. \end{gather*}\]
Last, we extract complementary designs for two extreme (highly improbable) scenarios. The first one arises when all related processors present an issue, while the memory does not, prompting the extraction of the solution with the lightest workload:
\[\begin{gather*} d_{\text{w}} = \operatorname*{argmin}_{\substack{x}} W(x), \, \, x \in \mathcal {X}_i, \, \, i=0,\ldots ,T-1, \end{gather*}\]
and the second scenario surfaces when both the processors and memory encounter issues simultaneously, necessitating the identification of the solution that strikes the optimal balance between memory usage and workload among \(d_m\) and \(d_w\):
\[\begin{gather*} d_{\text{wm}} = {\left\lbrace \begin{array}{l@{\quad}l} d_{\text{w}}, & \text{if } C\left(MF(d_{\text{w}}), W(d_{\text{w}})\right) \lt C\left(MF(d_{\text{m}}), W(d_{\text{m}})\right) \\ d_{\text{m}}, & \text{else}, \end{array}\right.} \end{gather*}\]
where we use the normalised sum to compute the cost function C. Collectively, the set of designs is denoted as:
\[\begin{gather*} \mathcal {D} = \lbrace d_i, d_{\text{m}}, d_{\text{w}} \rbrace , i=0, \ldots , T-1, \end{gather*}\]
and therefore RASS can generate a maximum of five designs for a MOO problem, since \(T\le 3\).
After establishing the set of designs, RASS’s final step involves crafting the rule-based switching policy, which serves as a reference for the RM module, guiding its decision-making process each time the Boolean variables \(c_{ce}\) or \(c_{\text{m}}\) undergo a change in value. With the aim of ensuring simplicity and conciseness in the rule set, we ensure that the selection of a new design is contingent solely upon the state of the environmental variables and independent of the presently employed design. The rationale behind the construction of the rules is deliberately straightforward, as demonstrated in Section 7.2, where two representative use cases are presented and analysed.

5 Implementation

CARIn is implemented in Java for the Android operating system. Its primary integration leverages the TensorFlow Lite (TFLite) package in its nightly build to facilitate on-device DNN execution, as well as its delegates to access mobile accelerators. The concurrent execution of multiple DNNs is achieved through the utilisation of the java.util.concurrent Java package.
To create the model suite used for our framework’s evaluation, i.e., for model retrieval, training, and preparation, TensorFlow (v2.12.0) was employed in Python. Our framework seamlessly interfaces with TensorFlow Hub and Hugging Face model repositories, which allows researchers to easily access and experiment with a wide range of pre-trained models on publicly available datasets for their specific use cases. Additionally, the TFLite Converter’s optimisation module was utilised to apply post-training quantisation to the models to enhance inference speed, efficiency and accelerator compatibility.
Regarding the objective function evaluation process, a diverse set of tools and libraries is employed. For accuracy assessments, we use custom evaluation scripts, as well as TFLite’s image classification evaluation tool for the ImageNet ILSVRC 2012 task. Furthermore, to comprehensively capture the model’s computational complexity and resource requirements, we use the tflite Python package to count model parameters and FLOPs. Last, to assess the on-device performance of the models, we employ the C++-based TFLite benchmark tool.2 This tool offers a comprehensive suite of measurements, encompassing execution time, memory utilisation, and other pertinent metrics, thereby providing a robust evaluation of the models’ real-world performance characteristics.

6 Experimental Methodology

In this section, we present the experimental methodology that underpins our research, offering insight into the comprehensive approach we have undertaken to investigate our study’s core objectives. We have structured this methodology into several key subsections, each addressing a crucial aspect of our experimental design. Figure 2 depicts the toolflow used to conduct our experiments.
Fig. 2.
Fig. 2. Toolflow for the evaluation of CARIn.

6.1 Quantisation

CARIn embraces post-training quantisation as one of the most simple and mobile-friendly compression methods presently available, with benefits not only in model size but also in latency and memory requirements. Additionally, quantisation becomes indispensable for the execution of DNNs within Digital Signal Processors (DSPs) or NPUs designed to primarily support integer models [22], thus unlocking complete compatibility with mobile accelerators. Notably, additional methods that also introduce tradeoffs between accuracy and complexity, such as weight pruning or clustering, are orthogonal to our framework and amenable to integration. The potential synergy resulting from the combined application of various compression techniques merits further investigation.
Driven by the capabilities of the TFLite Converter, CARIn currently incorporates four distinct quantisation techniques, namely half-precision floating-point (FP16), 8-bit dynamic range (DR8), 8-bit fixed-point with float fallback (FX8) and full 8-bit fixed-point (FFX8). Table 1 enumerates the numerical types associated with inputs, outputs, weights, and activations for both the 32-bit floating-point (FP32) and the quantised models. It is important to note that the data type of the weights defines the storage requirements of the model. Specifically, FP16 quantisation leads to a 2× reduction in model size, while the remaining schemes (DR8, FX8, and FFX8) yield a 4× reduction in size. The operational procedures of these quantisation schemes are elucidated as follows:
Table 1.
SchemeInputs & OutputsWeightsActivations
FP32fp32/int32/int64fp32fp32
FP16fp32/int32/int64fp16fp16/fp32
DR8fp32/int32/int64int8fp32
FX8fp32/int32/int64int8int8/fp32
FFX8int8/int32int8int8
Table 1. Quantisation Schemes
FP16, by default, employs 16-bit floating-point computations, yet it possesses the flexibility to revert (fall back) to FP32 calculations when the hardware lacks support for 16-bit arithmetic. In such instances, the weights undergo a dequantisation process to 32-bit before the first inference. Concurrently, the activations are stored in 32-bit format. The most common processors with native support for FP16 operations are mobile GPUs.
In the case of DR8, weights are represented with 8 bits, while activations persistently remain in FP32. Nevertheless, certain activations may undergo dynamic quantisation during inference, utilising quantised kernels for faster execution. The utilisation of fixed-point arithmetic, whenever feasible, may result in reduced computation times compared to relying solely on floating-point arithmetic, contingent on the specific model’s characteristics.
FX8, analogous to FP16, represents an 8-bit equivalent and operates with integer kernels as the default mode of execution. However, it retains the ability to utilise 32-bit operators when integer implementations are unavailable on the given hardware (floating-point fallback). Importantly, in this scheme, the converted model maintains inputs and outputs in floating-point format, allowing the model itself to determine the quantisation parameters to minimise accuracy loss.
FFX8 enforces full integer quantization for all components of the model, encompassing weights, activations, operations, inputs, and outputs. This stringent quantisation scheme guarantees compatibility with integer-only devices and accelerators, such as microcontrollers, DSPs, and NPUs.

6.2 Application Scenarios, Models, and Tasks

In the following part, we outline four discrete application use cases, which form the bedrock of our experimental evaluation. These application scenarios represent diverse real-world settings in which our research findings will be tested and validated, providing valuable insights into the effectiveness of our proposed formulation and methodology.
Notably, the first two scenarios pertain to the execution of a single DNN, while the later two involve the execution of multiple DNNs in parallel, affording us the opportunity to assess performance outcomes in instances where dependencies among multiple DNNs exist. This dichotomy is instrumental in affording a comprehensive evaluation of our methodology’s versatility, applicability, and scalability. We define specific SLOs for each use case and showcase the list of models to be considered during evaluation, along with their device-independent evaluation, which includes (a) the accuracy of both the original and quantised variants, (b) the computational workload in FLOPs, and (c) the model size in terms of parameters.

6.2.1 Use Case #1 (UC1).

In our first single-DNN scenario, we examine the practical application of real-time image classification. In this setting, the camera of a mobile device continuously captures frames that require prompt and accurate recognition. The term “real time” is qualified by a temporal restriction mandating that the maximum permissible latency is 41.67 ms, underscoring the necessity to uphold a recognition rate of no less than 24 frames per second. The principal objectives of this use case encompass the joint maximisation of accuracy and throughput. Mathematically, this MOO problem comprises two objective functions and a single constraint as follows:
\[\begin{gather*} \begin{aligned}&& \max &&& A(x), TP(x) \\ && \text{subject to} &&& \text{max} \, L(x) \le 41.67\ \text{ms.} \end{aligned} \end{gather*}\]
For UC1 we used the ImageNet-1k dataset [47]. Table 2 lists the eight models under consideration, which are drawn from four distinct families: MobileNets [48], EfficientNets [57], RegNets [46], and MobileViTs [39]. The rationale behind this extensive model selection is to ensure a well-rounded exploration of compact and mobile-friendly architectures that span a broad spectrum, encompassing both conventional CNNs and emerging Transformer-based models. Each of these architectural paradigms exhibits unique characteristics and design principles.
Table 2.
DL TaskArchitectureImage SizeFLOPs#ParamsTop-1 Accuracy (%)
FP32FP16DR8FX8FFX8
Image Classification on ImageNet-1kMobileNet V2 1.0224 × 2240.60 G3.49 M71.9271.9671.6571.2871.26
RegNetY 008224 × 2241.60 G6.25 M74.2874.2874.1874.4574.47
MobileViT XS256 × 2562.10 G2.31 M74.6174.61
EfficientNet Lite0224 × 2240.77 G4.63 M75.1975.2375.1475.0975.11
MobileNet V2 1.4224 × 2241.16 G6.09 M75.6675.6875.4775.4175.45
 RegNetY 016224 × 2243.23 G11.18 M76.7676.7676.6276.9276.84
 MobileViT S256 × 2564.06 G5.57 M78.3178.30
 EfficientNet Lite4300 × 3005.11 G12.95 M80.8180.8080.7880.6980.71
Table 2. UC1 Models
It is worth noting that we also contemplated the inclusion of higher-accuracy models, such as NASNet and ConvNeXt, in our analysis. However, our assessment revealed that these models failed to meet the stipulated latency constraint. Consequently, they were excluded from our study to maintain adherence to the predefined performance criteria.

6.2.2 Use Case #2 (UC2).

In our second single-DNN scenario, we study the task of text classification, with a particular emphasis on the memory requirements of the models. To this end, we impose a memory constraint, stipulating that the executing DNN’s maximum memory footprint must not exceed 90 MB. The objectives of this use case revolve around three critical factors: minimising the average latency, reducing the model size, and maximising accuracy. Mathematically, this MOO problem encompasses three objective functions and a singular constraint:
\[\begin{gather*} \begin{aligned}&& \min &&& \overline{L}(x), S(x) \\ && \max &&& A(x) \\ && \text{subject to} &&& MF(x) \le 90\ \text{MB}. \end{aligned} \end{gather*}\]
For UC2, we obtained three pre-trained Transformer models on various large datasets, including Reddit comments and 2ORC citation pairs, and subsequently fine-tuned them on Emotions [50], a dataset comprising of English Twitter messages that is employed for the task of classifying input sequences into six distinct emotions. We adopted the dataset’s split configuration, which allocated 16k samples for training, 2k for validation, and 2k for testing. The reported top-1 accuracy corresponds to the dataset’s test set. The selected models, detailed in Table 3, encompass the traditional BERT architecture in a lightweight version, alongside two mobile-grade models: XtremeDistil [41] and MobileBERT [55]. The letter “L” in each model’s name stands for the number of Transformer layers and “H” stands for the hidden dimension. In preparation for training, we further optimised BERT and XtremeDistil to enhance mobile-friendliness by replacing the GELU activation function with ReLU and substituting Layer Normalisation with Batch Normalisation [42].

6.2.3 Use Case #3 (UC3).

In our first multi-DNN scenario, we employ two DNNs for the purpose of scene recognition. One DNN is dedicated to processing and classifying images, while the other can process audio data to identify sounds from the device’s surroundings. These models operate concurrently, running in parallel, and their outputs are collectively utilised to determine the specific scene within which the mobile device is situated.
In this scenario, we seek to minimise both the average latency and its standard deviation, while simultaneously maximising the attained accuracy. We impose two latency constraints for both tasks, mandating that (a) the average latency remains consistently below 100 ms to ensure near-real-time responsiveness and (b) the standard deviation of latency stays below 10 ms for minimal fluctuations. The inclusion of the latency’s standard deviation aims to minimise performance variability, which still constitutes a withstanding challenge for on-device inference [66]. Mathematically, this MOO problem is formulated as follows:
\[\begin{gather*} \begin{aligned}&& \min &&& \overline{L}_i(x), \sigma _{L_{i}}(x) \\ && \max &&& A_i(x), & i=1,2 \\ && \text{subject to} &&& \overline{L}_i(x) \le 100\ \text{ms}, \sigma _{L_{i}}(x) \le 10\ \text{ms.} \end{aligned} \end{gather*}\]
Table 4 presents the models for each task. For the vision task, we fine-tuned three EfficientNet Lite models on the MIT Indoor Scenes dataset [45], which includes 67 classes and 100 images per class (80 for training and 20 for testing). We report the top-1 accuracy on the test set. For the audio task, we use YAMNet, which is trained on the AudioSet dataset [14] for multi-label classification. The dataset consists of 521 sound events (classes) and 18k samples. We report the mean average precision on the validation set. YAMNet’s input waveform can vary in length. In our experiments, we use the model’s minimum possible length of 975 ms, which corresponds to 15,600 input samples and a total workload of 0.14 GFLOPs.
Table 3.
DL TaskArchitectureSequence LenghtFLOPs#ParamsTop-1 Accuracy (%)
FP32FP16DR8FX8FFX8
Text Classificationon EmotionsBERT-L2-H128640.05 G4.31 M92.1092.1091.9091.7591.75
XtremeDistil-L6-H256640.63 G12.57 M93.3093.3093.2093.1593.20
MobileBERT-L24-H512642.66 G24.33 M93.8093.8093.8093.6594.10
Table 3. UC2 Models
Table 4.
DL TaskArchitectureInput SizeFLOPs#ParamsAccuracy
FP32FP16DR8FX8FFX8
Scene Classification on MIT Indoor ScenesEfficientNet Lite0224 × 2240.59 G3.44 M69.7869.7068.9669.1869.18
EfficientNet Lite2260 × 2601.51 G4.87 M76.7276.7277.1677.6977.54
EfficientNet Lite4300 × 3004.57 G11.76 M79.3379.3379.1879.7879.48
Audio Classification on AudioSetYAMNet15,6000.14 G3.75 M0.37560.37570.3620
Table 4. UC3 Models

6.2.4 Use Case #4 (UC4).

In our second multi-DNN scenario, we deploy three distinct models designed for facial attribute prediction tasks, namely gender, age, and ethnicity estimation. These models are conceptualised as the second stage of a face detection and attribute prediction pipeline, wherein they operate concurrently on the same set of input images. As such, it is imperative for these models to adhere to stringent latency constraints to ensure minimal impact on the overall pipeline. UC4’s objectives revolve around the collective optimisation of five key metrics for each model, specifically average latency, standard deviation of latency, size, memory footprint, and accuracy, all while adhering to a maximum latency threshold of 10 ms. Formally:
\[\begin{gather*} \begin{aligned}&& \min &&& \overline{L}_i(x), \sigma _{L_{i}}(x), S_i(x), MF_i(x) \\ && \max &&& A_i(x), & i=1,2,3 \\ && \text{subject to} &&& \text{max} \, {L}_i(x) \le 10\ \text{ms.} \end{aligned} \end{gather*}\]
In UC4, the training data are sourced from the UTKFace dataset [81]. To ensure relevance to real-time applications, the dataset is filtered to retain samples corresponding to the age range of 18–75. Consequently, the utilised dataset comprises 18.6k facial images, partitioned into training, validation, and testing sets with a ratio of 72/8/20, respectively. The employed models leverage MobileNetV2 as the backbone architecture, extracting 576 features of size 4×4, which are used for predicting the outcomes across the three distinct facial attribute prediction tasks. Notably, UC4 stands as the sole task within our study that incorporates batching during inference. Specifically, the models are configured with a batch size of 4, a choice motivated by the common case where the preceding face detection component identifies multiple faces within a single image. Table 5 details the attained accuracy metrics for each task on the filtered dataset’s test set: binary accuracy for gender recognition, mean absolute error for age recognition, and top-1 accuracy for ethnicity recognition across 5 output classes.
Table 5.
DL TaskArchitectureImage SizeFLOPs#ParamsAccuracy
FP32FP16DR8FX8FFX8
Facial AttributeGenderNet-MNV262 × 620.04 G0.66 M95.1294.9594.9094.7994.90
PredictionAgeNet-MNV262 × 620.04 G0.66 M5.9765.9745.9645.9475.923
on UTKFaceEthniNet-MNV262 × 620.04 G0.66 M78.1778.0478.5579.3079.14
Table 5. UC4 Models

6.3 Mobile Devices

In our study, we have selected three smartphones for our evaluation: Google Pixel 7, Samsung Galaxy S20 FE, and Samsung Galaxy A71. These devices have been deliberately chosen to represent distinct categories within the modern mobile phone landscape. A71 serves as an archetype of a mid-tier device, while S20 and P7 exemplify the high-end category, showcasing state-of-the-art features and cutting-edge technology. A detailed overview of the specifications and processing capabilities of these smartphones is shown in Table 6.
Table 6.
DeviceGoogle Pixel 7Samsung Galaxy S20 FESamsung Galaxy A71
Launch2022, October2020, October2020, January
SoCTensor G2Exynos 990Snapdragon 730
CPU2 × 2.85 GHz Cortex-X12 × 2.73 GHz Exynos M52 × 2.20 GHz Kryo 470 Gold
2 × 2.35 GHz Cortex-A762 × 2.50 GHz Cortex-A766 × 1.80 GHz Kryo 470 Silver
4 × 1.80 GHz Cortex-A554 × 2.00 GHz Cortex-A55
GPUMali-G710 MP7 @850 MHzMali-G77 MP11 @800 MHzAdreno 618 @700 MHz
NPUTensor Processing Unit\(\checkmark\)Hexagon Tensor Accelerator
RAM8 GB @3200 MHz6 GB @2750 MHz6 GB @1866 MHz
TDP7 W9 W5 W
Table 6. Target Devices
Each of the three devices is equipped with its own NPU. Concretely, P7 incorporates a custom mobile-oriented TPU; S20 features the EDEN API, which grants access to the Exynos NPU for fixed-point models and specialised GPU kernels for floating-point models; and, last, A71 hosts the Hexagon Tensor Accelerator, a dedicated compute engine for fixed-point CNNs. Additionally, it is noteworthy that among these three devices, only A71 offers access to the device’s DSP for DNN inference. Therefore, we result in the following compute engine sets for each device:
\begin{equation*} \mathcal {CE}_{\text{P7}} = \mathcal {CE}_{\text{S20}} = \lbrace \text{CPU}, \text{GPU}, \text{NPU}\rbrace \end{equation*}
\begin{equation*} \mathcal {CE}_{\text{A71}} = \lbrace \text{CPU}, \text{GPU}, \text{NPU}, \text{DSP} \rbrace . \end{equation*}

6.4 Profiling Details

In this section, we present the available configuration options for each compute engine, \(op(ce)\), within the context of an execution plan’s tunable hardware parameters, which are employed by CARIn during the profiling phase of the device-specific objective functions. In the case of CPUs, we have the capability to tune the number of threads employed for multithreading and utilise the XNNPACK delegate, which serves as a back-end for the CPU, leveraging the XNNPACK library to provide highly optimised implementations for 32- and 16-bit floating-point computations, as well as symmetrically quantised DNN operations. Since all the devices under consideration are equipped with eight CPU cores, the set of tunable options can be defined as follows:
\begin{equation*} op(\text{CPU}) = \lbrace N_{\text{threads}}, \text{XNNPACK} \rbrace , \end{equation*}
where \(N_{\text{threads}} = \lbrace 1, 2, 4, 8\rbrace\) and \(\text{XNNPACK} = \lbrace \text{TRUE}, \text{FALSE}\rbrace\), resulting in eight distinct CPU execution combinations. However, for GPUs and NPUs, CARIn exclusively employs fp16 arithmetic when feasible, as it offers reduced latency without compromising accuracy:
\begin{equation*} op(\text{GPU}) = op(\text{NPU}) = \lbrace \text{precision = fp16} \rbrace . \end{equation*}
Last, it should be noted that the DSP does not expose any configurable parameters, and thus its set of options can be defined as an empty set:
\begin{equation*} op(\text{DSP}) = \lbrace \rbrace . \end{equation*}
In terms of the profiling process, we initiate each execution configuration with five warm-up runs to stabilise the target processor’s performance and reduce variability. Subsequently, to gather statistically significant latency and energy consumption values, we execute each experiment 100 times. Last, to maintain consistent device temperatures and mitigate the risk of overheating, we incorporate a device idle period of 2 minutes prior to commencing the next set of runs.

7 Results

This section presents the outcomes of our comprehensive evaluation of CARIn. Our findings provide valuable insights into the effectiveness of our framework in mitigating the challenges stemming from device heterogeneity and runtime fluctuations across both single- and multi-DNN scenarios, while concurrently meeting predetermined SLOs.

7.1 Designs

Our initial assessment focuses on evaluating the performance of CARIn’s designs within each available state, i.e., single processor or combinations of processors for single- and multi-DNN applications respectively.

7.1.1 Comparison Methods.

To comprehensively evaluate CARIn’s performance against existing methodologies, we employ three simple and empirical baselines and additionally compare against our earlier work, OODIn. Our baseline methods are formulated upon empirical observations to offer a fundamental level of performance and aid in setting a minimum performance expectation in real-world applications.
Single-architecture baseline: The effectiveness of CARIn is contrasted with the traditional approach of considering a single model architecture, even if it is also accompanied by its quantised versions. This paradigm typically revolves around the selection of the model with the highest accuracy, optimal memory efficiency, compact size, and other relevant criteria.
Transferred baseline: To assess the extent to which CARIn addresses device heterogeneity, we utilise the transferred baseline, where the MOO problem is solved on a specific device, and the resultant designs are then applied to different devices. This baseline, being device-agnostic, overlooks the inherent characteristics and limitations of individual devices.
Multi-DNN-unaware baseline: The third baseline assesses the efficacy of our framework in handling concurrent model executions, particularly its capability to generate optimal model-to-processor mappings for multi-DNN workloads. The multi-DNN-unaware baseline dissects a multi-DNN MOO problem into M single-DNN uncorrelated problems, solves each one independently and then combines the solutions.
OODIn [61]: In our prior research, we utilised the weighted sum method as a means to address MOO problems. More precisely, OODIn aims to maximise the weighted sum obtained from the normalised objective functions. This approach fails to account for the inherent scale discrepancies among the diverse objective functions, particularly evident in DL metrics. While the utilisation of assigned weights may potentially mitigate this limitation, it necessitates prior knowledge of the statistical characteristics of the functions involved. When dealing with multi-DNN configurations, OODIn would operate as the multi-DNN-unaware baseline presented above, differing only in its utilisation of the weighted sum method instead of computing optimalities.

7.1.2 Single-DNN Execution.

Figures 3 and 4 delineate the benefits of CARIn in relation to the optimality metric for the two single-DNN use cases. We compare against two single-architecture baselines, specifically using the model with the highest accuracy (best accuracy, B-A) and the model with the smallest size (best size, B-S), the transferred baselines from the other two devices, collectively designated as \(T_{A71}\), \(T_{S20}\), and \(T_{P7}\), and OODIn. The initial designs \(d_0\) for each device are prominently indicated, affirming the presence of device heterogeneity. Patterned bars in the figures highlight instances where certain baselines fail to yield a solution due to non-compliance with the problem’s constraints (denoted by !) or inapplicability to different devices (denoted by N/A).
Fig. 3.
Fig. 3. UC1 evaluation.
Fig. 4.
Fig. 4. UC2 evaluation.
Takeaways: Our framework achieves a substantial improvement, with an average gain of 1.19× and 1.57× (up to 1.46× and 1.92×) over the B-A and B-S baselines, respectively. It is noteworthy that these baselines, primarily designed for SOO problems, prove inadequate in capturing the multi-objective nature inherent in DL applications. Regarding the transferred baselines, CARIn achieves an average improvement of 1.17× in optimality (up to 1.84×). Importantly, it not only enhances overall optimality but also exhibits improvements across all considered objective functions. Specifically, for UC1, we observe an average increase of 0.156 units in accuracy and a 32.7% boost in throughput, and for UC2, observable improvements include an average reduction of 2.8 MB in model size and a notable 19.9% latency speedup at the same accuracy level. Compared to OODIn, an optimality increase of 1.5× is achieved in average (up to 1.99×).

7.1.3 Multi-DNN Execution.

Figures 5 and 6 show the benefits of the CARIn framework concerning the optimality metric in the context of the two multi-DNN use cases. In these scenarios, we compare against the multi-DNN-unaware baseline, the transferred baselines from other devices, and OODIn. The horizontal axis illustrates combinations of processors for each device. In the case of UC3, all possible combinations are presented, while for UC4, due to the considerable number of combinations, we organise and display them based on optimality, showcasing the top 5 for each device.
Fig. 5.
Fig. 5. UC3 evaluation.
Fig. 6.
Fig. 6. UC4 evaluation.
Takeaways: In the context of UC3, CARIn delivers a significant average optimality improvement of 1.47× across devices (up to 3.24×) over the multi-DNN-unaware baseline and an even more substantial gain of 1.87× (up to 4.06×) over the transferred baselines. Notably, these enhancements extend across all specified objectives. Compared to OODIn, we observe a 2.83× improvement in optimality (up to 10.69×). Meanwhile, UC4 poses a distinctive challenge, where the majority of baselines struggle to produce a viable solution, primarily due to their inability to satisfy the stringent latency constraints inherent in this use case, underscoring the intricacy of UC4. It is noteworthy that, given the utilisation of a singular model per task, instances where baselines do not fail result in performance parity with our framework, emphasising the importance of accommodating a diverse array of models for each task.

7.2 Runtime Adaptation

In this section, we assess the responsiveness of the RM and its adept utilisation of designs generated by RASS to dynamically adapt to a series of runtime fluctuations. For our evaluation, we target the UC1 single-DNN scenario on S20 and the UC3 multi-DNN scenario on A71. Through this experiment, we aim to demonstrate the efficacy of the RM module to seamlessly respond to dynamic runtime conditions, thereby validating its role in enhancing the adaptability and performance of CARIn across diverse use cases and devices.

7.2.1 Single-DNN Execution.

Table 7 presents the selected designs and switching policy, while Figure 7 depicts the behaviour of RM in the single-DNN scenario. The initial design for UC1 on S20, \(d_0\), involves the utilisation of EfficientNet Lite0 FFX8 on the CPU with four threads and the enabled XNNPACK library, resulting in 75.11% accuracy and a 16-MB memory footprint. As the CPU gradually becomes overloaded, the throughput experiences a decline until RM identifies an alternative design as the current highest-performing solution. The new configuration, \(d_1\), entails the use of EfficientNet Lite0 FP16 on the GPU. Following further inferences, RM triggers another switch due to an impending memory issue. In this instance, RASS has identified the memory-efficient design, \(d_{\text{m}}\), to involve the device’s CPU.
Table 7.
\(\boldsymbol {c_{\text{CPU}}}\)\(\boldsymbol {c_{\text{GPU}}}\)\(\boldsymbol {c_{\text{NPU}}}\)\(\boldsymbol {c_{\text{m}}}\)\(\boldsymbol {d_{\text{new}}}\)
FF\(d_0 = \left\lt \text{EfficientNet Lite0 FFX8}, \text{CPU}_{4,\text{T}} \right\gt\)
TFF\(d_1 = \left\lt \text{EfficientNet Lite0 FP16}, \text{GPU} \right\gt\)
TTFF\(d_2 = \left\lt \text{MobileNet V2 1.4 FP16}, \text{NPU} \right\gt\)
TTTF\(d_{\text{w}} = \left\lt \text{MobileNet V2 1.0 FX8}, \text{CPU}_{4,\text{T}} \right\gt\)
TTTT\(d_{\text{wm}} \equiv d_{\text{w}}\)
T\(d_{\text{m}} = \left\lt \text{EfficientNet Lite0 FX8}, \text{CPU}_{8,\text{F}} \right\gt\)
Table 7. Selected Designs and Switching Policy for the Single-DNN UC1 Scenario on S20
Fig. 7.
Fig. 7. CARIn’s runtime behaviour targeting the single-DNN UC1 scenario on S20.
Takeaways: It is worth highlighting that despite modifications in the execution plan, our framework consistently upholds accuracy levels, even when employing the memory-efficient design. This steadfast commitment to preserving user Quality of Experience (QoE) underscores CARIn’s resilience in the face of dynamic alterations.

7.2.2 Multi-DNN Execution.

Table 8 and Figure 8 correspond to the multi-DNN scenario. In the context of UC3, where two models with distinct workloads are employed, CARIn recognises the heavier workload associated with the second task and acknowledges that this specific task is primarily responsible for triggering the switching mechanisms. The figure illustrates the average latency, standard deviation of latency, and accuracy for the second task, as well as the combined memory footprint of both models. UC3 involves the processing of audio data, introducing the potential use of the device’s DSP for data capture and processing. Given the likelihood of DSP overload during DNN inference, suppose that the highest-performing GPU-based design, \(d_1\), is currently employed with EfficientNet Lite2 FX8. However, due to the impending threat of a memory issue arising from this design’s memory footprint, RM opts to switch to the memory-efficient design, \(d_{\text{m}}\), resulting in a saving of 92 MB of RAM. Subsequently, as RM observes a reduction in DSP overload, it triggers a switch to the highest-performing design, \(d_0\), characterised by lower latency and reduced memory requirements. In the event of a potential DSP overload resurgence, RM strategically avoids reverting to the GPU-based design to mitigate previous concerns of excessive memory usage. Instead, it selects the next design in line, transferring the second model to the CPU while maintaining accuracy levels.
Table 8.
\(\boldsymbol {c_{\text{DSP}}}\)\(\boldsymbol {c_{\text{GPU}}}\)\(\boldsymbol {c_{\text{CPU}}}\)\(\boldsymbol {c_{\text{m}}}\)\(\boldsymbol {d_{\text{new}}}\)
FF\(d_0 = \lbrace \left\lt \text{YAMNet FP16}, \text{CPU}_{2,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite2 FFX8}, \text{DSP} \right\gt \rbrace\)
TFF\(d_1 = \lbrace \left\lt \text{YAMNet FP16}, \text{CPU}_{2,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite2 FX8}, \text{GPU} \right\gt \rbrace\)
TTFF\(d_2 = \lbrace \left\lt \text{YAMNet FP16}, \text{CPU}_{4,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite2 FFX8}, \text{CPU}_{1,\text{T}} \right\gt \rbrace\)
TTTF\(d_{\text{w}} = \lbrace \left\lt \text{YAMNet DR8}, \text{CPU}_{2,\text{F}} \right\gt , \left\lt \text{EfficientNet Lite0 FFX8}, \text{CPU}_{4,\text{F}} \right\gt \rbrace\)
TTTT\(d_{\text{wm}} \equiv d_{\text{w}}\)
T\(d_{\text{m}} \equiv d_{\text{w}}\)
Table 8. Selected Designs and Switching Policy for the Multi-DNN UC3 Scenario on A71
Fig. 8.
Fig. 8. CARIn’s runtime behaviour targeting the multi-DNN UC3 scenario on A71.
Takeaways: It is important to acknowledge that CARIn may not always maintain predefined metric levels. As demonstrated in this instance, transitioning to the memory-efficient design resulted in an 8.5% decrease in accuracy and an increase in jitter. However, such occurrences are considered temporary states of urgency, with a firm expectation that they will be swiftly rectified, thereby minimising impact on user QoE. Notably, the rise in average latency or the standard deviation of latency does not significantly affect user QoE, as these metrics already meet the specified latency constraints, which precede the optimisation of the objectives.

7.2.3 Comparison with OODIn.

In our previous work, we introduced the model/processor switching technique to mitigate runtime fluctuations. However, OODIn lacks the ability to predict forthcoming changes in resource availability, so upon detecting such events, the MOO problem necessitates readjustment to the new conditions and subsequent resolution to determine the new highest-performing solution. CARIn offers the advantage of solving the specified MOO problem once, prior to application initiation, thus switching to a new execution plan during runtime happens instantaneously and is based on the predetermined designs and switching policy. Table 9 presents the average and maximum observed solution times of OODIn across diverse applications and devices. The solution time primarily hinges on the number of objectives and the dimensionality of the decision space \(\mathcal {X}\), contingent upon the number of DL tasks, utilised models per task, compression techniques, and adjustable system parameters. Given that the time required for the TFLite interpreter to load on the CPU is typically around 3–4 ms, it becomes evident that revisiting the MOO problem can potentially become a bottleneck for the application, impacting the user’s QoE.
Table 9.
Decision SpaceA71S20P7
DimensionAverageMaximumAverageMaximumAverageMaximum
5001.452.120.551.553.647.99
20002.805.941.703.044.949.38
50006.5610.464.9815.977.0610.09
1000012.1416.0711.0934.2510.4113.38
Table 9. OODIn’s Solving Time in Milliseconds
Given that this time is inevitable whenever a runtime issue occurs, it has the potential to become a bottleneck, thus impeding the seamless execution of a DL application.
Aside from the time overhead incurred by repeated problem solving, OODIn also requires constant access to the entire array of considered models, necessitating their storage on the user’s device, which can impose limitations on the assortment of models and compression techniques initially considered. Our framework obviates the necessity to store all model variants, requiring only those selected by RASS. Table 10 elucidates this contrast in terms of model storage requirements for every examined use case.
Table 10.
 A71S20P7
 CARInOODInReductionCARInOODInReductionCARInOODInReduction
UC113.83276.3619.98 ×34.37443.1012.89×34.19443.1012.96×
UC248.64311.456.40×40.98311.457.60×52.96311.455.88×
UC325.74205.227.97×58.70205.223.50×52.81205.223.89×
UC42.656.562.48×3.956.561.66×3.956.561.66×
Table 10. Storage Requirements of CARIn and OODIn in MB

8 Limitations and Future Directions

In spite of the challenges mitigated by CARIn, our system exhibits limitations that impede its performance when deployed in practical scenarios. First, as mentioned in Section 4.2, the computation of device-dependent metrics associated with objective functions or constraints across all candidate solutions is unsuitable for realistic mobile applications due to its substantial time requirements and the necessity of deploying entire models onto target devices, particularly within expansive decision spaces. Within the broader landscape of related studies, numerous works have harnessed performance prediction methodologies to estimate such metrics when executing DNNs on specific hardware platforms, without resorting to direct measurements [9, 19, 25, 79]. These models consider a range of inputs, encompassing (a) architectural characteristics of the DNN model such as network topology, layer configurations, and overall complexity; (b) hardware specifications including compute architecture, memory hierarchy, interconnectivity, and support for parallelism; and (c) environmental parameters like batch size, input data characteristics, runtime conditions, and temperature/power conditions. Such approaches are orthogonal to our framework and can be integrated within CARIn to provide a more expedient alternative to exhaustive profiling. In the future, the exploration of such methods is envisioned to furnish a comprehensive assessment of our framework’s performance and suitability for real-world scenarios.
An additional limitation arises from the selection of models for evaluation. In the contemporary landscape of generative AI [5, 59], the inclusion of generative models, such as autoregressive language models, becomes paramount. These models, characterised by their ability to generate outputs sequentially based on previously generated tokens, impose heightened demands [32, 76], particularly within the context of mobile environments [33]. Therefore, it is imperative to account for such intricacies when assessing the efficacy of AI frameworks intended for deployment in resource-constrained settings.

9 Conclusion

This research underscores the paramount significance of optimising the on-device execution of DNNs to meet the evolving demands of artificial intelligence applications. Building upon the foundational work of Reference [61], the presented framework, CARIn, aims to spearhead progress in this direction. While the challenges of device heterogeneity, runtime adaptation, and multi-DNN execution persist, CARIn provides a novel and comprehensive solution toward alleviating them. The integration of an expressive MOO framework and the introduction of RASS as a runtime-aware MOO solver manage to enable efficient adaptation to dynamic conditions while adhering to user-specified SLOs. RASS stands out for its ability to foresee upcoming runtime issues and generate a set of configurations that enable rapid, low-overhead adjustments in response to environmental fluctuations.

Footnotes

1
Processor here refers also to the exact configuration of a given processor, e.g., threads in a CPU or precision in a GPU.

References

[1]
Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D. Lane. 2019. EmBench: Quantifying performance variations of deep neural networks across modern commodity devices. In Proceedings of the 3rd International Workshop on Deep Learning for Mobile Systems and Applications (EMDL’19). ACM, 1--6.
[2]
Mario Almeida, Stefanos Laskaridis, Abhinav Mehrotra, Lukasz Dudziak, Ilias Leontiadis, and Nicholas D. Lane. 2021. Smart at what cost? Characterising mobile deep neural networks in the wild. In Proceedings of the 21st Internet Measurement Conference (IMC). ACM, 658–672.
[3]
Maxim Berman, Leonid Pishchulin, Ning Xu, Matthew B. Blaschko, and Gérard G. Medioni. 2020. AOWS: Adaptive and optimal network width search with latency constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). Computer Vision Foundation/IEEE, 11214–11223.
[4]
Halima Bouzidi, Mohanad Odema, Hamza Ouarnoughi, Mohammad Abdullah Al Faruque, and Smaïl Niar. 2023. HADAS: Hardware-aware dynamic neural architecture search for edge performance scaling. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’23). IEEE, 1–6.
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (NeurIPS'20). Curran Associates, Inc., 1877–1901.
[6]
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2020. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20). OpenReview.net.
[7]
Bohong Chen, Mingbao Lin, Rongrong Ji, and Liujuan Cao. 2023. Prioritized subnet sampling for resource-adaptive supernet training. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 9 (2023), 11108–11119.
[8]
Bart Cox, Jeroen Galjaard, Amirmasoud Ghiassi, Robert Birke, and Lydia Y Chen. 2021. Masa: Responsive multi-DNN inference on the edge. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications (PerCom’21). IEEE, 1--10.
[9]
Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, and Niraj K. Jha. 2019. ChamNet: Towards efficient network design through platform-aware model adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). Computer Vision Foundation/IEEE, 11398–11407.
[10]
Piotr Dollár, Mannat Singh, and Ross B. Girshick. 2021. Fast and accurate model scaling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21). Computer Vision Foundation/IEEE, 924–932.
[11]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Efficient multi-objective neural architecture search via lamarckian evolution. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). OpenReview.net.
[12]
Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008), 42–53.
[13]
Hongxiang Fan, Stylianos I. Venieris, Alexandros Kouris, and Nicholas D. Lane. 2023. Sparse-DySta: Sparsity-aware dynamic and static scheduling for sparse multi-DNN workloads. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’23). ACM, 353--366.
[14]
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776–780.
[15]
Nyoman Gunantara. 2018. A review of multi-objective optimization: Methods and its applications. Cogent Engineering 5, 1 (2018), 1502242.
[16]
Junpeng Guo, Shengqing Xia, and Chunyi Peng. 2023. OPA: One-predict-all for efficient deployment. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM’23). IEEE, 1–10.
[17]
Peizhen Guo, Bo Hu, and Wenjun Hu. 2021. Mistify: Automating DNN model porting for on-device inference at the edge. In Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI’21), James Mickens and Renata Teixeira (Eds.). USENIX Association, 705–719.
[18]
Hai Victor Habi, Roy H. Jennings, and Arnon Netzer. 2020. HMQ: Hardware friendly mixed precision quantization block for CNNs. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Part XXVI, Lecture Notes in Computer Science, Vol. 12371, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 448–463.
[19]
Myeonggyun Han and Woongki Baek. 2021. HERTI: A reinforcement learning-augmented system for efficient real-time inference on heterogeneous embedded systems. In Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques (PACT’21), Jaejin Lee and Albert Cohen (Eds.). IEEE, 90–102.
[20]
Rui Han, Qinglong Zhang, Chi Harold Liu, Guoren Wang, Jian Tang, and Lydia Y. Chen. 2021. LegoDNN: Block-grained scaling of deep neural networks for mobile vision. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom’21). ACM, 406–419.
[21]
Xiaoxi He, Xu Wang, Zimu Zhou, Jiahang Wu, Zheng Yang, and Lothar Thiele. 2023. On-device deep multi-task inference via multi-task zipping. IEEE Transactions on Mobile Computing 22, 5 (2023), 2878–2891.
[22]
Andrey Ignatov et al. 2019. AI benchmark: All about deep learning on smartphones in 2019. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW’19). IEEE, 3617--3635.
[23]
Md Shahriar Iqbal, Jianhai Su, Lars Kotthoff, and Pooyan Jamshidi. 2023. FlexiBO: A decoupled cost-aware multi-objective optimization approach for deep neural networks. Journal of Artificial Intelligence Research 77 (2023), 645–682.
[24]
Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. 2022. Band: Coordinated multi-DNN inference on heterogeneous mobile processors. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (MobiSys’22), Nirupama Bulusu, Ehsan Aryafar, Aruna Balasubramanian, and Junehwa Song (Eds.). ACM, 235–247.
[25]
Fucheng Jia, Deyu Zhang, Ting Cao, Shiqi Jiang, Yunxin Liu, Ju Ren, and Yaoxue Zhang. 2022. CoDL: Efficient CPU-GPU co-execution for deep learning inference on mobile devices. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (MobiSys’22), Nirupama Bulusu, Ehsan Aryafar, Aruna Balasubramanian, and Junehwa Song (Eds.). ACM, 209–221.
[26]
Andreas Karatzas and Iraklis Anagnostopoulos. 2023. OmniBoost: Boosting throughput of heterogeneous embedded devices under multi-DNN workload. In Proceedings of the 60th ACM/IEEE Design Automation Conference (DAC'23). IEEE, 1--6.
[27]
Jangryul Kim and Soonhoi Ha. 2023. Energy-aware scenario-based mapping of deep learning applications onto heterogeneous processors under real-time constraints. IEEE Transactions on Computers 72, 6 (2023), 1666–1680.
[28]
Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. 2019. \(\mu\)layer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. In Proceedings of the 14th EuroSys Conference (EuroSys’19). 45:1--45:15.
[29]
Alexandros Kouris, Stylianos I. Venieris, Stefanos Laskaridis, and Nicholas D. Lane. 2023. Fluid batching: Exit-aware preemptive serving of early-exit neural networks on edge NPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’23).
[30]
Achintya Kundu, Laura Wynter, Rhui Dih Lee, and Luis Angel D. Bathen. 2023. Transfer-once-for-all: AI model optimization for edge. In Proceedings of the IEEE International Conference on Edge Computing and Communications (EDGE’23), Claudio A. Ardagna, Feras M. Awaysheh, Hongyi Bian, Carl K. Chang, Rong N. Chang, Flávia Coimbra Delicato, Nirmit Desai, Jing Fan, Geoffrey C. Fox, Andrzej Goscinski, Zhi Jin, Anna Kobusinska, and Omer F. Rana (Eds.). IEEE, 26–35.
[31]
Basar Kütükçü, Sabur Baidya, Anand Raghunathan, and Sujit Dey. 2022. Contention grading and adaptive model selection for machine vision in embedded systems. ACM Transactions on Embedded Computing Systems 21, 5 (2022), 55:1–55:29.
[32]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP’23). ACM, 611–626.
[33]
Stefanos Laskaridis, Kleomenis Kateveas, Lorenzo Minto, and Hamed Haddadi. 2024. MELTing point: Mobile evaluation of language transformers. Retrieved from https://arxiv.org/abs/2403.12844
[34]
Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, and Nicholas D. Lane. 2020. HAPI: Hardware-aware progressive inference. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’20). ACM, 91:1--91:9.
[35]
Jaeseong Lee, Jungsub Rhim, Duseok Kang, and Soonhoi Ha. 2022. SNAS: Fast hardware-aware neural architecture search methodology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 11 (2022), 4826–4836.
[36]
Seulki Lee and Shahriar Nirjon. 2020. Fast and scalable in-memory deep multitask learning via neural weight virtualization. In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services (MobiSys’20). ACM, 175--190.
[37]
Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing, and Daqi Xie. 2021. RT-mDL: Supporting real-time mixed deep learning tasks on edge platforms. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems (SenSys’21), Jorge Sá Silva, Fernando Boavida, André Rodrigues, Andrew Markham, and Rong Zheng (Eds.). ACM, 1–14.
[38]
Sicong Liu, Bin Guo, Ke Ma, Zhiwen Yu, and Junzhao Du. 2021. AdaSpring: Context-adaptive and runtime-evolutionary deep model compression for mobile applications. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1 (2021), 24:1–24:22.
[39]
Sachin Mehta and Mohammad Rastegari. 2022. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In Proceedings of the 10th International Conference on Learning Representations (ICLR’22). OpenReview.net.
[40]
Santiago Miret, Vui Seng Chua, Mattias Marder, Mariano Phiellip, Nilesh Jain, and Somdeb Majumdar. 2022. Neuroevolution-enhanced multi-objective optimization for mixed-precision quantization. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’22), Jonathan E. Fieldsend and Markus Wagner (Eds.). ACM, 1057–1065.
[41]
Subhabrata Mukherjee, Ahmed Hassan Awadallah, and Jianfeng Gao. 2021. Xtreme Distil Transformers: Task Transfer for Task-agnostic Distillation. arxiv:2106.04563 [cs.CL]. Retrieved from https://arxiv.org/abs/2106.04563
[42]
Ioannis Panopoulos, Sokratis Nikolaidis, Stylianos I. Venieris, and Iakovos S. Venieris. 2023. Exploring the performance and efficiency of transformer models for NLP on mobile devices. In Proceedings of the IEEE Symposium on Computers and Communications (ISCC’23). IEEE, 1--4.
[43]
Hishan Parry, Lei Xun, Amin Sabet, Jia Bi, Jonathon S. Hare, and Geoff V. Merrett. 2021. Dynamic transformer for efficient machine translation on embedded devices. In Proceedings of the 3rd ACM/IEEE Workshop on Machine Learning for CAD (MLCAD’21). IEEE, 1–6.
[44]
João Luiz Junho Pereira, Guilherme Antônio Oliver, Matheus Brendon Francisco, Sebastião Simões Cunha, and Guilherme Ferreira Gomes. 2022. A review of multi-objective optimization: Methods and algorithms in mechanical engineering problems. Archives of Computational Methods in Engineering 29, 4 (2022), 2285–2308.
[45]
Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE Computer Society, 413–420.
[46]
Ilija Radosavovic, Raj Prateek Kosaraju, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2020. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). Computer Vision Foundation / IEEE, 10425–10433.
[47]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[48]
Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). Computer Vision Foundation/IEEE Computer Society, 4510–4520.
[49]
Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. A hierarchical multi-task approach for learning embeddings from semantic tasks. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19), the 31st Innovative Applications of Artificial Intelligence Conference (IAAI’19), the 9th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI’19). AAAI Press, 6949–6956.
[50]
Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. 2018. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3687–3697.
[51]
Wonik Seo, Sanghoon Cha, Yeonjae Kim, Jaehyuk Huh, and Jongse Park. 2021. SLO-aware inference scheduler for heterogeneous processors in edge platforms. ACM Transactions on Architecture and Code Optimization 18, 4 (2021), 43:1–43:26.
[52]
Shubhkirti Sharma and Vijay Kumar. 2022. A comprehensive review on multi-objective optimization techniques: Past, present and future. Archives of Computational Methods in Engineering 29, 7 (2022), 5605–5633.
[53]
Yechao She, Minming Li, Yang Jin, Meng Xu, Jianping Wang, and Bin Liu. 2023. On-demand edge inference scheduling with accuracy and deadline guarantee. In Proceedings of the 31st IEEE/ACM International Symposium on Quality of Service (IWQoS’23). IEEE, 1–10.
[54]
Amit Kumar Singh, Somdip Dey, Klaus McDonald-Maier, Karunakar Reddy Basireddy, Geoff V. Merrett, and Bashir M. Al-Hashimi. 2020. Dynamic energy and thermal management of multi-core mobile platforms: A survey. IEEE Design & Test 37, 5 (2020), 25–33.
[55]
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 2158–2170.
[56]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). Computer Vision Foundation/IEEE, 2820–2828.
[57]
Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML’19), Proceedings of Machine Learning Research, Vol. 97, Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 6105–6114. http://proceedings.mlr.press/v97/tan19a.html
[58]
Mingxing Tan and Quoc V. Le. 2021. EfficientNetV2: Smaller models and faster training. In Proceedings of the 38th International Conference on Machine Learning (ICML'21), Proceedings Machine Learning Research, Marina Meila and Tong Zhang (Eds.). Vol. 139, PMLR, 10096–10106. http://proceedings.mlr.press/v139/tan21a.html
[59]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. LLaMa: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https://arxiv.org/abs/2302.13971
[60]
Stylianos I. Venieris, Christos-Savvas Bouganis, and Nicholas D. Lane. 2023. Multiple-deep neural network accelerators for next-generation artificial intelligence systems. Computer 56, 3 (2023), 70–79.
[61]
Stylianos I. Venieris, Ioannis Panopoulos, and Iakovos S. Venieris. 2021. OODIn: An optimised on-device inference framework for heterogeneous mobile devices. In Proceedings of the IEEE International Conference on Smart Computing (SMARTCOMP’21). IEEE, 1--8.
[62]
Siqi Wang, Gayathri Ananthanarayanan, Yifan Zeng, Neeraj Goel, Anuj Pathania, and Tulika Mitra. 2019. High-throughput CNN inference on embedded ARM big.LITTLE multi-core processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 10 (2019), 2254–2267.
[63]
Zhehui Wang, Tao Luo, Miqing Li, Joey Tianyi Zhou, Rick Siow Mong Goh, and Liangli Zhen. 2021. Evolutionary multi-objective model compression for deep neural networks. IEEE Computational Intelligence Magazine 16, 3 (2021), 10–21.
[64]
Hao Wen, Yuanchun Li, Zunshuai Zhang, Shiqi Jiang, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang, and Yunxin Liu. 2023. AdaptiveNet: Post-deployment neural architecture adaptation for diverse edge environments. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking (MobiCom'23). ACM, 28:1--28:17.
[65]
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019. FBNet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). Computer Vision Foundation/IEEE, 10734–10742.
[66]
Carole-Jean Wu et al. 2019. Machine learning at facebook: Understanding inference at the edge. In Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 331–344.
[67]
Xiaofeng Wu, Jia Rao, Wei Chen, Hang Huang, Chris H. Q. Ding, and Heng Huang. 2021. SwitchFlow: Preemptive multitasking for deep learning. In Proceedings of the 22nd International Middleware Conference (Middleware’21), Kaiwen Zhang, Abdelouahed Gherbi, Nalini Venkatasubramanian, and Luís Veiga (Eds.). ACM, 146–158.
[68]
Jiyang Xie, Xiu Su, Shan You, Zhanyu Ma, Fei Wang, and Chen Qian. 2022. ScaleNet: Searching for the model to scale. In Proceedings of the 17th European Conference on Computer Vision (ECCV’22), Part XXI, Lecture Notes in Computer Science, Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Vol. 13681, Springer, 104–120.
[69]
Mengwei Xu et al. 2019. A first look at deep learning apps on smartphones. In Proceedings of the Conference on the World Wide Web (WWW’19). IEEE, 2125--2136.
[70]
Zhiyuan Xu, Dejun Yang, Chengxiang Yin, Jian Tang, Yanzhi Wang, and Guoliang Xue. 2023. A co-scheduling framework for DNN models on mobile and edge devices with heterogeneous hardware. IEEE Transactions on Mobile Computing 22, 3 (2023), 1275–1288.
[71]
Qizheng Yang, Tianyi Yang, Mingcan Xiang, Lijun Zhang, Haoliang Wang, Marco Serafini, and Hui Guan. 2024. GMorph: Accelerating multi-DNN inference via model fusion. In Proceedings of the 19th Conference on Computer Systems (EuroSys’24). ACM, 505--523.
[72]
Taojiannan Yang, Sijie Zhu, Chen Chen, Shen Yan, Mi Zhang, and Andrew R. Willis. 2020. MutualNet: Adaptive ConvNet via mutual learning from network width and resolution. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Part I, Lecture Notes in Computer Science, Vol. 12346, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 299–315.
[73]
Juheon Yi and Youngki Lee. 2020. Heimdall: Mobile GPU coordination platform for augmented reality applications. In Proceedings of the Annual International Conference on Mobile Computing and Networking (MobiCom’20). ACM, 35:1--35:14.
[74]
Fuxun Yu, Shawn Bray, Di Wang, Longfei Shangguan, Xulong Tang, Chenchen Liu, and Xiang Chen. 2021. Automated runtime-aware scheduling for multi-tenant DNN inference on GPU. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD’21). IEEE, 1--9.
[75]
Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Chenchen Liu, and Xiang Chen. 2022. A survey of multi-tenant deep learning inference on GPU. MLSys'22 Workshop on Cloud Intelligence/AIOps. arXiv:2203.09040. Retrieved from https://arxiv.org/abs/2203.09040
[76]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. ORCA: A distributed serving system for transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22). USENIX Association, 521–538.
[77]
Jiahui Yu and Thomas S. Huang. 2019. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, 1803–1811.
[78]
Mu Yuan, Lan Zhang, Zimu Zheng, Yi-Nan Zhang, and Xiang-Yang Li. 2023. MLink: Linking black-box models from multiple domains for collaborative inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 10 (2023), 12085–12097.
[79]
Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys’21), Suman Banerjee, Luca Mottola, and Xia Zhou (Eds.). ACM, 81–93.
[80]
Li Lyna Zhang, Yuqing Yang, Yuhang Jiang, Wenwu Zhu, and Yunxin Liu. 2020. Fast hardware-aware neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR Workshops’20). Computer Vision Foundation/IEEE, 2959–2967.
[81]
Song Yang Zhang, Zhifei and Hairong Qi. 2017. Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 4352--4360.
[82]
Yu Zhang and Qiang Yang. 2022. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering 34, 12 (2022), 5586–5609.
[83]
Ziyang Zhang, Huan Li, Yang Zhao, Changyao Lin, and Jie Liu. 2023. BCEdge: SLO-aware DNN inference services with adaptive batching on edge platforms. arXiv:2305.01519. Retrieved from https://arxiv.org/abs/2305.01519
[84]
Ziyang Zhang, Yang Zhao, and Jie Liu. 2023. Octopus: SLO-aware progressive inference serving via deep reinforcement learning in multi-tenant edge cluster. In Proceedings of the 21st International Conference on Service-Oriented Computing (ICSOC’23), Flavia Monti, Stefanie Rinderle-Ma, Antonio Ruiz Cortés, Zibin Zheng, and Massimo Mecella (Eds.). Lecture Notes in Computer Science, Vol. 14420, Springer, 242--258.

Cited By

View all
  • (2024)SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00049(565-579)Online publication date: 2-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 23, Issue 4
July 2024
333 pages
EISSN:1558-3465
DOI:10.1145/3613607
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 29 June 2024
Online AM: 23 May 2024
Accepted: 07 May 2024
Revised: 09 April 2024
Received: 14 November 2023
Published in TECS Volume 23, Issue 4

Check for updates

Author Tags

  1. Deep neural networks
  2. on-device inference
  3. service-level objectives
  4. heterogeneity. runtime adaptation
  5. multi-dnn execution

Qualifiers

  • Research-article

Funding Sources

  • Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,123
  • Downloads (Last 6 weeks)262
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00049(565-579)Online publication date: 2-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media