1 Introduction

The advancement of edge computing is critically threatened by the mismatch between the growing computational needs of applications and the physical limitations that, thanks to the breakdown of the Dennard scaling (Zahran, 2021), prevent continual exponential growth of hardware capabilities. Spearheading this trend is the expansion of deep learning (DL) models – with their massive computational, memory, and energy burden – to resource-limited devices. Furthermore, the burden shows no signs of waning – the state of the art DL models for language processing, for instance, tend to double in size every three to five months (Lauriola et al., 2022).

Approximate computing (AC) deals with devising techniques for simplifying and speeding up the computation, thus, lowering the energy usage and loosening the accuracy guarantees. AC philosophy is frequently applied to DL, where network parameter pruning, quantisation, and other techniques have been harnessed (He et al., 2017; Jiang, 2021; Yang & Liu, 2024). Such approximate DL models could then be used to enable state-of-the-art inference on resource-constrained devices. The most promising DL approximation techniques are those methods that enable a dynamic trade-off between inference accuracy and resource usage by switching among different variants of the same model at runtime.

Close integration of context sensing and on-device processing renders embedded computing attractive for numerous domains, such as wearable and mobile computing, unmanned aerial vehicle-based processing, and robotics, to name a few. Yet, the inherent variability of the context indicates that a DL model residing on a device has to cope with ever-changing data, often captured by the device’s sensors. From the DL standpoint, the data may be characterised by varying difficulty, impacting the classification success rate when different neural network approximations are applied. For example, a user’s physical activity might be successfully classified from a wearable accelerometer data by a quantised neural network, as long as a user is performing a “calmer” activity, such as sitting or lying, whereas the same network would need to work in the full-precision mode to recognise that a user is walking up the stairs (Machidon et al., 2021).

Dynamic approximation techniques, mentioned earlier in this section, represent a promising means to judiciously reduce the resource appetites of DL on resource constrained platforms. However, to the best of our knowledge, a general method that would enable context-aware adaptation of DL approximation, so that a given higher-level goal is achieved, has not been presented before. The key reason for this is the fact that deciding how to adapt the approximation at runtime remains challenging. First, at the time of DL model usage we have no information about the ground truth, thus we do not whether the approximation indeed leads to correct inference or not. Second, the most appropriate approximation level in one situation, might not be appropriate once the situation changes, and uncovering the most suitable approximation in a certain context is difficult. For example, an approximate model enabling a spoken keyword detection on a smartphone placed in a quiet environment might become unusable once the phone is taken out on a noisy street. Finally, for different users and different circumstances the goals of the adaptation can be diverse – from achieving maximal accuracy to ensuring that the battery lasts till the usual charging time.

In this paper we develop a method for context-aware adaptation of approximated DL for resource constrained applications. Our solution is based on the control theory, more specifically, model-predictive control (MPC), an advanced method for process control that optimises the operation over a finite horizon. In our work, MPC adjusts the approximation of a DL model by implicit learning, from the historical data, about the accuracy achieved by different approximation levels in different contextual situations. Unlike the existing approaches, our method is not limited to a particular approximation technique or domain and remains widely applicable. The method is guided by the entropy of the DL model’s softmax layer confidence, which we identify as a reliable proxy of the actual production-time inference accuracy. Finally, we expose knobs that allow setting of different end-goals of the control, such as maintaining high accuracy or ensuring long battery life.

To summarise, the key contributions of our work include:

  • We formally cast the problem of controlling approximate DL to the problem of linear input-output system modelling.

  • We design a model-predictive control-based method for adaptation that learns from the existing data on model performance in the light of different contextual changes.

  • We identify the entropy of the softmax layer’s confidence as a viable proxy for the model’s accuracy when it comes to adaptation control.

  • We design a custom cost function to steer the control towards a particular end-goal. The cost function balances the needs for high-accuracy and resource savings according to user demands.

  • We extensively test the proposed method over data collected from three real-world use cases, namely, human activity recognition, spoken keyword recognition, and computer vision, and implement a fully-functioning approximate mobile DL adaptation on a ubiquitous computing device.

The evaluation of the method demonstrates that it can lead to up to 50% energy savings without any loss in accuracy in case of on-device activity recognition and spoken keyword detection. In addition, the runtime overhead of the controller is minimal, accounting for less than 0.2 seconds on a low-end Raspberry Pi device. Our method outperforms state of the art lightweight approaches, and achieves competitive accuracy compared to more complex solutions. Moreover, the method enables fine tuning of the relationship between the inference accuracy and resource usage – its Pareto front of operating points achieves the higher energy savings \(\times \) accuracy factor than any operation point achieved with a fixed approximation level. With this, we believe that our method makes DL on the edge future-proof, while also expanding the support for on-device learning to a wider range of less-capable devices.

2 Related work

In this work we aim to develop a generally-applicable system for controlling the approximation of a neural network (NN) deployed on resource constrained device, so to increase the efficiency of on-device DL. Consequently, our work is grounded in multiple domains, reaching from DL compression methods to control systems theory. In the rest of the section we discuss the most relevant work from each of the domains.

Deep learning compression techniques

Among the recent efforts towards enabling efficient DL on resource constrained devices, the model compression techniques (weight pruning He et al., 2017, sharing He et al., 2017, quantisation Jiang, 2021) are particularly relevant, as these methods deal with reducing the size and the computational burden of NNs for deployment in constrained environments, often only modestly sacrificing the accuracy. Yet, these techniques initially allowed only one-off compression that permanently reduces the inference power of the model. The succeeding wave of research introduced adaptable compression, such as the Slimmable Neural Networks (SNNs), which allow dynamic acceleration of NN inference through occasional use of a reduced number of NN parameters (Yu et al., 2018), or through early exit networks that may produce a result before getting to the final layer of the model (Scardapane et al., 2020).

Context adaptivity

Several frameworks have been designed to optimise the runtime adaptive DL compression according to the dynamic context. Among them, ApproxNet (Xu et al., 2021) and ApproxDet (Xu et al., 2020) propose the use of dynamic approximation techniques to achieve a mobile/edge video object classification and detection, respectively. These two frameworks target the case where the operation should meet the accuracy-latency requirements under changing content and resource availability conditions. The context variability is narrowly defined as the changing workload and the characteristics of video content, which are particular to video streaming. In this paper, however, we develop an approach that is applicable to a range of applications and contextual factors. Mobiprox (Fabjančič et al., 2023) enables dynamic approximation on mobile devices but is limited to Android and lacks consideration of future system behavior, unlike our predictive approach. AdaSpring (Liu et al., 2021) defines the context as time-varying hardware capabilities (e.g. storage, battery, processor) and is oriented towards satisfying predefined accuracy demands. Our approach, on the other hand, focuses on maximising the accuracy with minimum energy consumption while taking into account the changing demands of the time-varying context. The most notable difference between AdaSpring and our work lies in the selection of the most suitable approximation: AdaSpring (Liu et al., 2021) uses an evolutionary network combined with a runtime search strategy, while our work involves a more computationally efficient and deterministic control theory approach. Reinforcement learning has also shown promise for adaptive decision-making: for decentralized task offloading in edge networks (Wu & Guan, 2024), or for automating decision tasks (Ghanem et al., 2023). The methodology in these cases is fundamentally different from ours (through trial-and-error interactions with the environment, without a mathematical model).

Control theory for software system adaptation

Control theory was originally designed for the control of physical systems and engineering processes, but is nowadays pervasive, being used in a variety of domains ranging from computer engineering to sociology. Several research efforts focused on providing software adaptivity using control-theoretical adaptation strategies. An on-line automatic resource allocation system that combines control theory and machine learning for cloud-based software services is proposed in Chen et al. (2020). Another control theory inspired approach for server load balancing in a Software Defined Networking (SDN) traffic model is proposed in Malavika and Valarmathi (2022). Model Predictive Control (MPC) was also exploited for dynamic video streaming bitrate adaptation in order to maximise the quality of experience while maintaining low latency (Sun et al., 2019; Kan et al., 2021).

A control theory inspired strategy for guiding the adaptation of DL approximations has not been attempted yet, although the need for it has been acknowledged (Filieri et al., 2015). One of the key obstacles is the lack of a tangible signal that can be measured to assess the runtime performance of DL. Most of the existing control system-based approaches to software adaptation rely on well-assessed performance metrics such as system throughput, response time, power utilisation, and others. However, there is a lack of formalism regarding monitoring the accuracy of DL inference during runtime. This is critical, as the operating conditions may fluctuate significantly between the training and the testing phase.

In this work, we demonstrate that the control theory represents a viable approach for managing adaptivity of approximate mobile DL. For the concept to be fully realised, however, the following challenging steps need to be taken: i) analysing the DL runtime behaviour and modelling it as a dynamic system, ii) mapping the accuracy-efficiency-context relationship into an appropriate control system problem, and iii) designing a controller cost function for dynamic DL context-driven adaptation. Solving these challenges will enable the mathematical basis of control theory to support the optimal accuracy-efficiency trade-off in dynamic adaptive mobile DL systems. In this paper, we embrace the state-of-the-art control systems theory and show how it can be used to express adaptive approximate DL through a linear model. We further propose a cost function balancing the accuracy, resource consumption, and the impact of the context on the selection of an approximate DL model configuration during runtime. Finally, we present comprehensive experimental results for several different DL tasks and datasets and we deploy a fully-functioning implementation of a control system for dynamic approximate DL management on a ubiquitous computing device.

3 Control system for approximate edge computing

In this section we develop a control system for runtime adaptation of approximation of a deep learning model executed on a resource-constrained device. The aim of the system is to guide the execution of an application, so that the optimal trade-off between resource usage and inference accuracy is struck. Indicative, for edge/mobile computing, is a potentially significant contextual variability over time and the requirement that the control system optimises the adaptation according to a possibly long-term goal.

3.1 Model predictive control

To achieve the above, we start from the the concept of control system. The output of the system is measured and then fed back into the controller, that decides on the most appropriate input to control the system and steer its behaviour towards a desired performance (quantified by a cost function), despite external disturbances and taking into account certain constraints.

Among the available control algorithms Model Predictive Control (MPC) is one of the most widely used and is ubiquitous in modern industrial applications. MPC is based on solving an iterative optimisation problem over a finite prediction horizon of time. Model predictive controllers rely on dynamic models of the system that needs to be controlled, and these models can usually be obtained through system identification. MPC determines the next control move by sampling the current state of the system and solving an optimisation problem (minimizing the cost function) over a prediction time horizon. Through the model of the system, MPC explores several potential state trajectories that emerge from the current state and finds a cost-minimizing control strategy. Then, just the first move of this control strategy is implemented and the procedure is repeated again starting from the new current state (Fig. 1).

Fig. 1
figure 1

Overview of the model predictive control. At each time step k, MPC solves an iterative optimisation problem over a finite prediction horizon of time and based on a predicted trajectory. From the identified cost minimizing control strategy, only the first step is taken, and then the process is repeated

The MPC standard cost function J(ux) generally balances two objectives: ensuring the solution that leads the system the closest possible to the desired reference (first term in (1)) and that requires the minimum effort (second term in (1)). These two goals are achieved by minimizing the squared differences between the predicted state \(x_k\) and the actual reference \(ref_k\), and also the squared differences between two consecutive inputs \(\Delta u_k\), over the entire prediction horizon N:

$$\begin{aligned} J(u,x) = \sum _{k=1}^{N} Q_k \cdot (ref_k-x_k)^2 + \sum _{k=1}^{N} R_k \cdot (\Delta u_k)^2 \end{aligned}$$
(1)

where \(Q_k,R_k \in \mathbb {R^{+}}\) are termed weighting coefficients, since they quantify the importance of the two objectives, N is the prediction horizon, k indexes the time (sampling intervals), x are the states, ref are the references that need to be tracked, u generically denotes the MPC decision vector, and \(\Delta u_k = u_{k}-u_{k-1}\) is the input slew rate.

MPC is particularly suitable for controlling deep learning approximations due to the following properties. First, we discover that the dynamics of the runtime DL approximations from the point of view of inference accuracy and resource usage change can be mathematically modelled, and thus, predicted through a formal dynamic system. The context in which mobile computing is executed is highly dynamic, making it difficult to predict long-term optimal solutions. However, we observe that short-term future states can be anticipated, since there is a relatively stable relationship between the softmax layer entropy of consecutive inferences that a DL model makes (discussed in Section 3.2.1). Thus, we can apply the MPC optimisation on these shorter time horizons. Finally, we opt for MPC instead of other possible solutions (e.g. reinforcement learning) for its intuitiveness, that allows the exploitation of the relationship between the network’s performance (entropy), resource usage, and the context in a transparent and easy to interpret manner. Note, however, that our approach is MPC-inspired (as opposed to MPC-based), as we address scenarios where the tracked reference value for accuracy and resource usage of a DL model need not be reachable. For example a mobile DL model for human activity recognition (HAR) may never reach the setpoint accuracy of 90%, as some some activities may be classified with more than 90% accuracy with all NN configurations while others cannot be classified with more than 80% accuracy even with the best performing NN configuration. Consequently, we adapt the MPC cost function to fit these types of problems more naturally.

Fig. 2
figure 2

Block diagram of a system for controlling DL approximations

3.2 Mathematical modelling of approximate operation

We now concretise the abstract control system to drive approximate edge DL according to the overall accuracy and resource usage achieved during a finite horizon. The system, depicted in Fig. 2, consists of a module that describes the Context-dependent impact of approximation on the performance, the proxy metric for Assessing the performance at runtime, the cost function within the MPC controller for driving the adaptation, and the external Context sensing module that provides information on system disturbances. In the rest of the section we discuss different aspects of the proposed model in detail.

3.2.1 Measuring and monitoring operational performance

The inference accuracy and the resource usage represent the targets our control system aims for. For each of these aspects, a metric is needed to formalise it into a mathematically meaningful, yet practical way.

Evaluating the accuracy of the model during inference time is not possible without having the ground truth available. Therefore, instead of accuracy, we gauge a related metric: the confidence of the prediction. One way to assess a neural network’s confidence is by interpreting the softmax layer’s scores as probabilities. The entropy of these probabilities offers a good estimate of the performance of a neural network (Park & Simoff, 2015; Feng et al., 2018), particularly when calibrated, so that they reflect the true accuracy likelihood (A.Balanya et al., 2023). Low entropy values are usually associated with correct predictions while high entropy scores most likely indicate incorrect predictions.

Probabilistic multi-label classification produces a categorical distribution over predictions, representing the model’s confidence. Research shows that lower entropy typically indicates higher prediction accuracy, especially in well-calibrated models trained on homogeneous datasets (Park & Simoff, 2015; Feng et al., 2018). However, in poorly calibrated models or cases of overfitting, low entropy may falsely suggest overconfidence and lead to incorrect predictions. Metrics like Expected Calibration Error (ECE) (A.Balanya et al., 2023) can help assess calibration, and if miscalibration is detected, methods like temperature scaling can correct it. We applied temperature scaling to our models to ensure entropy reliably indicates prediction accuracy and prevent overconfidence-related misinterpretations.

Previous studies, however, have shown that the entropy of the prediction increases with the deviation of the inferred sample from the training data distribution. In other words, an input sample that is very different from the samples that the network was shown during the training phase, will most likely result in a low entropy of prediction confidence. For this reason, monitoring instantaneous confidence entropy could lead to the underestimation of the accuracy.

Fig. 3
figure 3

Entropy-zones reflecting inference accuracy variations for the purpose of adapting mobile DL approximation at runtime. The entropy values are collected from the SoftMax scores of a MobileNet V1 model trained on the on the UCI-Human Activity Recognition (HAR) dataset for recognizing 6 different physical activities ran over a trace of a single test user performing the activities in succession. Inference accuracy differs across activities (black numbers). Instantaneous entropy values (red line) vary widely even within a single zone (representing a single ground-truth physical activity). However, averaged over \(p=4\) subsequent values (blue line) the entropy shows more uniform behaviour allowing clearer separation of the zones (i.e. the actual activities), thus a good reflection of the actual inference accuracy, without knowing the ground truth

A decrease/increase of accuracy over a certain time period is more relevant than instantaneous accuracy, when it comes to DL model approximation adaptation. Such a persistent change in accuracy can signal the need for a less approximated network or an opportunity to use more aggressive approximation. Thus, we track the entropy of the confidence of p successive predictions. In this manner, we are able to estimate the confidence trend, assess how challenging a given situation is for a certain approximation level, and adapt to it (Fig. 3). The value of the parameter p impacts the system’s sensitivity to changes in the context: higher values of p lead to a delayed, but more precise response contextual variations, while lower values lead to a more rapid response to context changes, but are more susceptible to false alarms. Through our experiments with p, we found that the optimal range lies between 3 and 5, depending on the specific application. This balance ensures an effective trade-off between responsiveness and accuracy in adapting to context changes.

For the second aspect of the performance, the resource usage, we turn to either the energy consumption of each of the DL model approximation levels measured on a target device, or to the inference running time, in case direct energy measurements are not possible. In most cases, the two metrics can be considered equivalent as the speedup achieved with a certain NN configuration translates to energy savings (Asadikouhanjani et al., 2021; Ron et al., 2022)Footnote 1.

3.2.2 Context-aware predictive model equations

We now embrace MPC (Section 3.1) and adapt it to the problem of adjusting the mobile DL approximation level in order to achieve a certain trade-off between the inference accuracy and resource usage. Our mathematical model explicitly includes the knobs (i.e. approximate configurations of the model) and the target cost function (i.e. the balance between the accuracy and the resource usage). It also explicitly includes (possibly time-varying) context, as the context of usage can impact the accuracy of the inference of different approximation configurations.

We model the system’s dynamics with linear differential equations, where the key variables include:

  • \(p+2\) system state variables x, representing the current confidence entropy, p previous entropy values, and the current battery/energy status;

  • two inputs u, representing the approximate DL model configuration and the context’s requirements;

  • two outputs y, representing the average confidence entropy and the battery level.

The equations in the system are:

$$\begin{aligned} x[k+1] = A \cdot x[k] + B \cdot u[k] \end{aligned}$$
(2)
$$\begin{aligned} y[k] = C \cdot x[k] + D \cdot u[k] \end{aligned}$$
(3)

where k indexes time, x is the vector form of the state variables, y represents the vector form of the output and u are the dynamic system’s inputs. A is the system matrix, B is the control matrix, C is the output matrix, and D is the feed-forward matrix.

Considering the explicit definition of each of the system’s variables:

$$\begin{aligned} x[k]= \begin{bmatrix} x_1[k]\\ x_2[k]\\ x_3[k]\\ \vdots \\ x_{p+1}[k]\\ x_{p+2}[k] \end{bmatrix} = \begin{bmatrix} entropy[k]\\ entropy[k-1]\\ entropy[k-2]\\ \vdots \\ entropy[k-p]\\ energy[k] \end{bmatrix} \end{aligned}$$
(4)
$$\begin{aligned} u[k]= \begin{bmatrix} u_1[k]\\ u_2[k]\\ \end{bmatrix} = \begin{bmatrix} model[k]\\ context[k]\\ \end{bmatrix} \end{aligned}$$
(5)
$$\begin{aligned} y[k]= \begin{bmatrix} y_1[k]\\ y_2[k]\\ \end{bmatrix} = \begin{bmatrix} avg\_{entropy}[k]\\ battery[k]\\ \end{bmatrix} \end{aligned}$$
(6)

Equation 2 becomes:

$$\begin{aligned} x[k\!+\!1] \!=\! \begin{bmatrix} x_1[k+1]\\ x_2[k+1]\\ x_3[k+1]\\ \vdots \\ x_{p+1}[k+1]\\ x_{p+2}[k+1] \end{bmatrix} \!=\! \begin{bmatrix} a_{11} & \dots & a_{1 p+2}\\ a_{21} & \dots & a_{2 p+2}\\ a_{31} & \dots & a_{3 p+2}\\ \vdots & \ddots & \vdots \\ a_{p+1 1} & \dots & a_{p+1 p+2}\\ a_{p+2 1} & \dots & a_{p+2 p+2} \end{bmatrix} \cdot \begin{bmatrix} x_1[k]\\ x_2[k]\\ x_3[k]\\ \vdots \\ x_{p+1}[k]\\ x_{p+2}[k] \end{bmatrix} \!+\! \begin{bmatrix} b_{11} & b_{12}\\ b_{21} & b_{22}\\ b_{31} & b_{32}\\ \vdots & \vdots \\ b_{p+2 1} & b_{p+2 2} \end{bmatrix} \cdot \begin{bmatrix} u_1[k] \\ u_2[k] \end{bmatrix} \end{aligned}$$
(7)

Similarly, (3) can be written as:

$$\begin{aligned} y[k]= \begin{bmatrix} y_1[k]\\ y_2[k]\\ \end{bmatrix} = \begin{bmatrix} c_{11} & \dots & c_{1 p+2}\\ c_{21} & \dots & c_{2 p+2}\\ \end{bmatrix} \cdot \begin{bmatrix} x_1[k]\\ x_2[k]\\ \vdots \\ x_{p+2}[k]\\ x_{p+3}[k] \end{bmatrix} + \begin{bmatrix} d_{11} & d_{12}\\ d_{21} & d_{22}\\ d_{31} & d_{32}\\ \vdots & \vdots \\ d_{p+2 1} & d_{p+2 2} \end{bmatrix} \cdot \begin{bmatrix} u_1[k] \\ u_2[k] \end{bmatrix} \end{aligned}$$
(8)

This explicit form that uses matrices instead of scalars allows us to better model the combined effect of adjusting different parameters; in addition we can easily incorporate the known dynamics of the system and the topology of the states, inputs and outputs together with their known evolution in time. We opted for matrix formulation as this approach effectively models the combined effects of multiple parameter adjustments, adheres to standard practices in state-space system equations for improved accessibility for readers familiar with control theory, and simplifies computer simulations by employing first-order differential equations. We know for example that the second state variable at time \(k+1\), \(x_2[k+1] = entropy[k]\) is equal to the first state variable at time k, \(x_1[k] = entropy[k]\), and also that the third state variable at time \(k+1\), \(x_3[k+1] = entropy[k-1]\) is equal to the second state variable at time k, \(x_2[k] = entropy[k-1]\), and so on. Similarly, we know that the inputs at time k directly affect only the entropy and the battery at time \(k+1\), and not the previous entropy values. Also, since we acknowledge the impact of the model and the context on the entropy and battery states, through the B matrix, there is no need for the D matrix, which can be skipped. The explicit way in which we compute the average entropy at the output must also be incorporated through the matrix C parameters. These known dynamics of the system allow us to already identify some of the unknown parameters:

$$\begin{aligned} x[k+1] = \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1 p+2}\\ 1 & 0 & \dots & 0\\ 0 & 1 & \dots & 0\\ \vdots & \vdots & \ddots & \vdots \\ a_{p+2 1} & a_{p+2 2} & \dots & a_{p+2 p+2} \end{bmatrix} \cdot \begin{bmatrix} x_1[k]\\ x_2[k]\\ x_3[k]\\ \vdots \\ x_{p+2}[k] \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12}\\ 0 & 0\\ 0 & 0\\ \vdots & \vdots \\ b_{p+2 1} & b_{p+2 2} \end{bmatrix} \cdot \begin{bmatrix} u_1[k] \\ u_2[k] \end{bmatrix} \end{aligned}$$
(9)
$$\begin{aligned} y[k]= \begin{bmatrix} 1/p & \dots & 1/p\\ c_{21} & \dots & c_{2 p+2}\\ \end{bmatrix} \cdot \begin{bmatrix} x_1[k]\\ x_2[k]\\ \vdots \\ x_{p+2}[k]\\ x_{p+3}[k] \end{bmatrix} \end{aligned}$$
(10)

The remaining unknown coefficients of the AB and C matrices can be determined by system identification from empirical measurements. These measurements include the softmax confidence entropy and energy consumption (or the running time) for all the approximate configurations of the DL model in each of the contexts. Using (9) and (10), and these input-output data, the unknown parameters can be identified with Matlab’s greyest function, for example. Further details on the exact parameters used in the experiments are provided in Appendix A.

3.2.3 Custom cost function

Unlike the conventional MPC, our problem does not require reference tracking, so we replace the standard cost function (1) with our custom function (11) defined by the key goals of mobile DL: balance the inference accuracy (reflected through the softmax confidence entropy) with the resource consumption (e.g. the battery energy expenditure), under time-varying context.

We define our cost function as a sum of terms, each focusing on one of the particular aspects of our controller performance:

$$\begin{aligned} J(y,u,k) = Q \cdot y_1[k]^2 - R \cdot y_2[k]^2 + S \cdot (u_2[k] - u_1[k])^2 \end{aligned}$$
(11)

where \(Q,R,S \in \mathbb {R^{+}}\), k is the current control interval, y is a vector of predicted outputs (computed based on the states using (10)); both the predicted states and the predicted outputs are computed using the previously discussed dynamical model of the controlled system), \(u_1\) is the MPC decision and \(u_2\) is the context, that calls for a certain approximation level of the DL model.

This cost function balances the next value of the first system output, i.e. entropy of the prediction (first term in (11)) with the next value of the second output, i.e. energy/battery level (second term in (11)) and with the optimal model configuration required by the current context (third term in (11)). Unlike the standard MPC cost function (1), our cost function uses outputs instead of states, since in a real-world environment, the actual output is the one that we want to track and drive the adaptation on. For example, the battery level, might be influenced by many factors, such as other applications running in the same time, the external temperature, and other factors. We also eliminate the second part of the standard MPC cost function (1) since the cost of excessive changes in the case of NN adaptation is minimal.

A balance between competing objectives, such as minimizing the energy consumption and maximizing the accuracy, can be achieved by tuning the weighting coefficients QR, and S. Before tuning the cost function weights, all the terms should be scaled to be of the same range. Then, for traditional reference tracking cost function in MPC there are rough guidelines on how to set the weights of the terms of the cost function to prioritize various goals (Garriga & Soroush, 2010). Starting from these guidelines, we experiment with different values for the weight of all the three terms of our cost function and adjust them according to different objectives, ranging from minimizing the resource usage to maximizing the accuracy.

We assume that the control policy can be adjusted at every sampling period, which is often the case with systems based on deep learning, therefore we set the control interval to be equal to the duration of sensor the sampling. In most cases, we are interested in finding the optimal policy over the next couple of intervals only, so the prediction horizon is small as well. Setting these parameters to small values also reduces the amount of computation needed for control policy optimisation.

The MPC decision \(u_1[k]\) is determined at each discrete time step k, by solving the following constrained minimization problem:

$$\begin{aligned} \begin{aligned}&\underset{u_1[k]}{\text {minimize}}&J(x,u,k) \\&\text {subject to}&x[k+1]=A \cdot x[k] + B \cdot u[k];\\ & y[k]=C\cdot x[k] ;\\ & x[0]=x_0 ;\\ & u_1[k] \in [1, \dots , n];\\ \end{aligned} \end{aligned}$$

where x are states, y are the outputs, \(u_1\) are the inputs, A is the system matrix, B is the control matrix, C is the output matrix, n are the available NN configurations, and \(x_0\) is the initial state of the system.

4 Evaluation

In this section we assess the benefits brought by the predictive DL approximation control algorithm we developed in the previous section. We show that our method is primarily designed to automatically and dynamically balance performance and resource utilisation. We explain how different accuracy-resource usage trade-offs can be achieved by tuning the knobs of the predictive control algorithm according to the varying demands, thus enhancing the system’s adaptability to various deployment contexts and application requirements. Furthermore, we assess the sensitivity of our approach to the impact of the context and evaluate the generalisability of our solution in various application fields and using different DL models.

To summarise, we aim to answer the following research questions:

  • RQ1: Can our predictive control algorithm enable varying accuracy-resource usage trade-offs?

  • RQ2: What is the benefit of taking the context into account in our predictive control algorithm?

  • RQ3: Does our approach generalise across application areas and DL models?

  • RQ4: How does our approach compare to the state of the art?

4.1 Experimental setup: applications, datasets, architectures, and approximation techniques

Applications

We conduct experiments in three widely different domains: human activity recognition (HAR), spoken keyword recognition (SKR), and object classification (OC) in computer vision. In each of these scenarios we define the relevant dimensions of the context and periodically assess the context’s variability. For HAR we consider the type of the activity a user is performing as the context that demands more or less precise DL. For SKR we assume the context is represented by the varying level of background noise in the recording, which, in turn impacts the difficulty of keyword recognition. For OC, we assume a scenario where the amount of brightness in each image defines the context, so that DL approximation performs differently under different levels of brightness.

Datasets

For HAR we perform experiments on the UCI HAR dataset (Anguita et al., 2013). This dataset consists of records of smartphone inertial sensors data from 30 users labeled according to the class of activity performed (walking, upstairs, downstairs, sitting, standing, lying). The UCI HAR sensor signals (accelerometer and gyroscope) were pre-processed in fixed-width sliding windows of 2.56 seconds with a 50% overlap. For SKR we use the mini Speech Commands dataset (Warden, 2018), an audio dataset consisting of audio clips of 1 second or less sampled at 16kHz and containing one of the following spoken words:“up”, “down”,“left”, “right”, “go”, and “stop”. We converted the waveforms in the audio files of the dataset to spectrograms and used these spectrograms for training the neural network. For OC we use Matlab’s Vehicle dataset, a collection of 295 images, each containing one or two labeled instances of a vehicle. Most of the images from this data set are from the Caltech Cars 1999 and 2001 (Weber et al., 2000; Philip et al., 2022) datasets.

DL architectures

We evaluate our approach on two DL architectures: MobileNet V1 (Howard et al., 2017), a lightweight DL model commonly used for mobile and embedded vision applications, and YOLO v3 (Redmon & Farhadi, 2018) a larger object detection network. For the UCI HAR dataset, we transform the 3-axis acceleration and rotation data into an \(8 \times 128\) matrix, stack it into 32-element wide blocks to fit the input size, and feed it into MobileNet V1, which takes \(32 \times 32 \times 3\) RGB images as input.

Approximation techniques

We experiment with slimmable neural networks (SNNs) (Yu & Huang, 2019) and pruning (Molchanov et al., 2019). For SNNs we train a shared network with switchable batch normalization layers (Yu & Huang, 2019) corresponding to four different widths of the MobileNet V1 neural network architecture: 100% width (full network), 75% width, 50% width and 25% width. With the filter pruning compression technique, we identify the least important convolution filters in the network according to the Taylor-based importance scores and remove them. In addition to the original structure, we select two more configurations: a pruned network configuration that has 70% fewer parameters than the original one, and another one with 80% fewer parameters. Irrespective of the approximation technique, we calibrate each model’s predictions using temperature scaling to ensure that the output probabilities can be interpreted as confidence scores that more closely match the network’s actual accuracy.

Fig. 4
figure 4

Energy measurements setup: UDOO Neo board and Monsoon power meter

4.2 Accuracy-energy consumption trade-off

To answer RQ1 we initially focus on MobileNet SNN and measure the energy consumption during DL inference with differently slimmed versions of SNN used. The measurements were performed using Monsoon Footnote 2, a high frequency power meter (Schuler & Anderst-Kotsis, 2019), and a UDOO Neo, an IoT development board running Linux (Fig. 4). The system-wide energy consumption for performing inference using each of the fixed slimming widths of the MobileNetFootnote 3 SNN is depicted in Table 1. The results indicate that slimming a network indeed results in substantially reduced energy consumption for DL inference, thus representing a viable basis for improving energy efficiency of DL inference.

Our predictive control system allows for different goals to be set according to the varying scenarios that a mobile application might encounter. This flexibility is achieved by adjusting the weights of the cost function (11). For example, assigning a higher weight to the entropy term (\(Q=0.4, R=0.1\)), will prioritise maximising accuracy, while assigning a higher weight to the energy-related term (\(Q=0.4, R=0.8\)) will prioritise energy savings. Importantly, these adjustable parameters (knobs) do not merely enable what-if analysis but make the adaptivity of the dynamic control possible. This allows the system to adapt to various deployment contexts and application requirements, ensuring optimal performance and resource utilisation.

Figure 5 illustrates how different accuracy-resource usage goals can be achieved: the MPC points on the plots correspond to various combinations of these factors within the cost function. For comparison, we also plot the accuracy–energy consumption achieved by fixed approximation levels (i.e. full-width SNN, 75%-width SNN, 50%-width SNN, and 25%-width SNN).

Table 1 MobileNet V1 SNN energy consumption for a single input inference on a UDOO Neo board

We also compare our approach with “Best model for context”. This oracle-based solution knows the actual context and for each datapoint uses an approximation level that yields the highest accuracyFootnote 4 of inference on the training set. The relevant context is the activity class in case of HAR and the noise level in case of SKR. Thus, in Fig. 5a we have a point termed “Best model for class” and in Fig. 5b we have “Best model for noise”. accuracy for the current actual datapoint’s class label.

Figure 5 shows that by adjusting the cost functions’ weights, we can achieve a variety of accuracy-energy trade-off points, each corresponding to a different scenario, ranging from the most parsimonious one – the bottom left MPC points, to the most accurate one – the top right MPC points. Importantly, MPC points score better than any individual network approximations either from the accuracy point of view (with the same energy consumption) or from the energy consumption perspective (with the same accuracy), or both. For HAR (Fig. 5a), the predictive control approach achieves 2% higher accuracy than the most accurate neural network configuration, with the same energy consumption (in the accuracy-focused scenario) or a little less than 2% higher accuracy than the most energy-efficient neural network at almost the same energy consumption (in the efficiency-focused scenario). In case of SKR (Fig. 5b) we observe that by adjusting the factors of the cost function we can achieve accuracy on par with the most accurate NN approximation (SNN100%) with almost 40% lower energy consumption. Alternatively, we can achieve the same energy consumption as the most energy-efficient approximation (SNN25%) and still achieve a 0.5% accuracy improvement. In both cases – HAR and SKR – increasing the weight of the third factor in our cost function, the one that enforces the selection of the best approximation level required by the current context, leads to our solution (MPC5) achieving almost the same result as the oracle “Best model for context” solution.

Fig. 5
figure 5

Predictive control accuracy vs. energy consumption results for MobileNet V1 Slimmable NN on two datasets. MPC datapoints describe the trade-off curve obtained using various cost function parameter values

Table 2 Comparison of our approach with the SoTA on UCI HAR dataset

4.3 Overall performance comparison with state of the art

To further validate our approach and answer RQ4, we compare our solution with other state-of-the-art techniques, evaluating both the accuracy and the resource consumption of each approach. Table 2 summarizes the results. For all implementations, except SNN MobileNet-V1 (Baseline) and MPC+MobileNet-V1 (our) accuracy scores are those reported in Zhongkai et al. (2022). The time and energy measurements reported have been performed on a Raspberry PI 4 board.

Our approach (MPC+MobileNet-V1) achieves 92.9% accuracy with the same parameter count (6.0M) as the baseline SNN MobileNet-V1, demonstrating improved performance without added complexity. It outperforms lightweight models like MobileNet-V2 (81.62%) and MobileNet-V3Small (91.25%) while matching the accuracy of more complex models like Inception-V3 (94.23%) and EfficientNet B0 (93.53%) with lower energy and time costs. Compared to PyramidNet18 (0.4M, 92.56%), our method strikes a better balance of accuracy and efficiency, making it ideal for resource-constrained scenarios. Furthermore, as shown in the Energy vs. Accuracy plot (Fig. 6), our method achieves a high accuracy (92.90%) with low energy consumption (25.30 \(\mu \)Ah), lying favourably on the efficiency frontier and thus demonstrating the suitability of our approach for energy-constrained environments.

Fig. 6
figure 6

Energy vs. accuracy comparison of our approach vs. the state of the art models described in Table 2 on the UCI HAR dataset

4.4 Benefits of context-awareness

In this section we answer RQ2 and assess to which extent the knowledge of the context improves our approach’s performance. In our approximation control approach the task-relevant context (e.g. whether a user is walking or running, the level of the noise, image luminosity, etc.) is provided through measured disturbances in the model learning phase. During this phase the impact of the context on the system’s state and output (along with the impact of using a certain NN approximation) is learned through simulations in which data reflecting different contextual scenarios (e.g. various noise levels) is used to set values of the parameters of the mathematical model guiding the adaptation (5). Trained in such a manner, the model is then able to adapt to context variations at runtime. Thus, in the test phase, we are able to estimate the context and use context variations when deciding on the next approximation level. Furthermore, our approach can tie a particular operational mode with a particular contextual state, e.g. use the 25% width SNN if the user is “lying”, by harnessing the context-related term in the cost function (11).

To assess the impact of the knowledge of the context on the quality of the adaptation, we perform experiments using both a context-based adaptation and a context-agnostic one (where the cost function in (11) and the mathematical model do not feature the context-related term), leaving all the other factors unchanged. Figure 7 shows the accuracy vs. energy consumption achieved with the predictive control approach for the HAR (Fig. 7a) and SKR (Fig. 7b) use cases with an explicit context-based adaptation (orange) and a context-agnostic adaptation (green), respectively. The differences among the two approaches are clearly visible for the HAR use case, where the trajectories of the two curves strongly diverge from the first adaptation trade-off point until the last one. In the case of SKR, the gap between the two curves (given by connecting the trade-off points achieved with the context-based adaptation and context-agnostic adaptation, respectively), enlarges only towards the right-most, accuracy-centered points. In the resource-centered scenarios (left most points on the graph), the focus is on minimizing the resource usage and the strategy for achieving this goal is to interchange the smallest, most compressed approximation, when the entropy increases or decreases. This strategy allows for some accuracy gain (compared to using the smallest NN approximation only, for example) but there is not much room for involving the larger approximation, since the focus is on resource saving.

The accuracy-centered scenarios, on the other hand, have more freedom for selecting the optimal NN approximation among all approximations available, hence here we can better see the impact of the context-related knowledge: the context-based adaptation converges towards an “oracle” solution (based on using the best model for each context scenario), while the context-agnostic adaptation towards the most accurate model. Without the contextual dimension integrated into the dynamic system model and into the cost function, the context-agnostic adaptation strategy is only focused on balancing the two contrasting goals: maximizing the accuracy with minimum resources, that translates into changing the approximations based only on the entropy variations (and not on the context). The context-based adaptation achieves superior results by exploiting the knowledge about the context and the relationship between various contextual factors and the NN approximations.

Fig. 7
figure 7

Comparison between the predictive control’s accuracy vs. energy consumption results for MobileNet V1 Slimmable NN on the UCI HAR dataset (a) and Speech Commands dataset (b). The orange datapoints describe the trade-off curve obtained using various weights on the cost function of the predictive control algorithm using a context-based adaptation, while the green datapoints describe the trade-off curve obtained using the predictive control algorithm and a context-agnostic adaptation

5 Real world implementation

To evaluate our approach in a real-world scenario, we design a portable computer vision-based vehicle detection system for driver assistance. Vehicle detection inference is often impacted by the dynamic change of illumination due to weather or road conditions, especially if NN approximations are used in order to achieve real-time responsiveness. To address these challenges we implement a computer vision-based vehicle detection system for embedded systems and use it to evaluate the approximate NN control strategy proposed in this paper. The aim is to achieve as accurate inference as possible within the shortest amount of time, i.e. with the fastest model configuration, all in an environment where the scene illumination varies.

To perform vehicle detection, we trained the Yolo v3 architecture on a subset of the Vehicles dataset and applied filter pruning to create two approximate configurations: one with 70% fewer parameters and another with 80% fewer parameters. The 70% pruned network achieved a precision of 91% with an average inference time of 3.08 s, while the 80% pruned network achieved a precision of 89% with a faster inference time of 2.62 s. In comparison, the full Yolo v3 model achieved the highest precision of 94% but required the longest inference time of 4.42 s.

Using the methodology described in Section 3, we construct a mathematical model that reflects the inference precision and running time variation of the three flavours of our Yolo v3 network. For that, we altered the training subset of the Vehicles dataset with different degrees of brightness. We then design the controller that adjusts the configuration in a predictive manner during runtime to fit the demands of the context’s variability. The context is measured using a brightness estimation function.

We evaluate the precision and energy consumption of NN adaptation based on our controller on a Raspberry Pi4 model B embedded device equipped with a Raspberry Pi camera module V2.1, shown in Fig. 8. Using Matlab Coder we generate the Raspberry Pi C code and run the predictive control framework on the subset of images which were pre-loaded on the device, and asses the performance in terms of both inference precision and time. We chose to perform the system-wide evaluation in a controlled environment (i.e. on a pre-loaded trace of images instead of using real-time acquisition from the camera) in order to isolate the performance of NN adaptation from system artefacts. We do, however, independently confirm that the image acquisition, vehicle inference, and NN adaptation pipeline work as a whole.

Fig. 8
figure 8

Raspberry Pi4 computing system running MPC inference on the Vehicle dataset

Figure 9 depicts trade off points achieved with our predictive control approach driving the Yolo v3 network among its three possible approximate configurations. The detection is performed on the training subset of the Vehicles dataset where the image brightness varies. Our approach allows fine-grain trade-off, for instance, allowing us to achieve 20% speedup with same precision as the full (unpruned) network and up to 35% speedup with only 5% drop in precision compared to the full network. In addition, compared to using the fastest, most compressed network for all the images, our approach yields 6% higher precision while maintaining a similar running time. This improvement stems from the MPC’s ability to seize and exploit different approximate configuration of a NN for different images depending on their properties, such as the inference entropy achieved over previous images in the sequence, the brightness of the image, and the available energy.

Fig. 9
figure 9

Accuracy vs. time results for vehicle detection using Yolo v3 on the Raspberry Pi platform. MPC datapoints describe the trade-off curve obtained using various weights on the cost function of the predictive control algorithm

5.1 System identification and runtime overhead

The calculation of the mathematical model driving the control, i.e. preforming system identification, is a task performed just once, offline and thus incurs a one-off cost that takes less than half an hour. The learning time is dominated by the cost of generating simulation data for learning the coefficients of the state-space model. In comparison to the training time of a typical DNN, this overhead is negligible.

The controller runtime overhead is mainly influenced by the computational complexity of solving the control cost function minimisation. This complexity, in turn, depends on the complexity of the objective function and on the number and type of constraints (linear/non-linear) involved and is reflected in the number of iterations a solver has to perform to find a solution. For example, for a simple quadratic cost function (\(x^2\)) and no constraints, the number of iterations needed is usually 1, thus the complexity of the function is O(1). In general, solving the minimisation of a quadratic constrained MPC function can be done in polynomial time (Alamo et al., 2005). For our cost function, with all the constraints applied, in practice, it takes between 2 and 4 iterations to find the solution.

In terms of the running time on a Raspberry PI device, the overhead of the controller is minimal, as depicted in Fig. 10. Out of a total average execution time of less than 1 second to classify an image, the controller accounts for between 1% and 20% (depending on the size of the prediction horizon used). The low computational costs of our control strategy show that our approach meets the requirements for practical deployment in real time on low-end devices.

Fig. 10
figure 10

Time measurements on Raspberry Pi 4B

6 Discussion ans limitations

In this paper we developed and prototyped a generally-applicable solution for approximate DL adaptation on resource constrained devices. Being, to the best of our knowledge, the first control system for dynamic approximation adaptation on resource constrained devices, our work faces the following limitations that we plan to address in our future work.

The first issue is the potential divergence between the training and the test set distribution. Further experimentation is needed to quantify the robustness of our approach to such divergences. Intuitively, however, if there is a very large imbalance in the training data vs. the test data, the trained DL model would likely be less confident in its predictions, which would in turn lead to high entropy values. Consequently, our control system would react to this high entropy by favouring inference accuracy and approximating DL cautiously.

Second, deploying our solution in real-world engineering applications requires the identification of a relevant contextual dimension that interplays with the inference accuracy and DL model approximation. In this paper, we focused on use cases where identifying such contextual dimensions is not challenging, as we have focused on demonstrating the concept. However, in a less-researched setting it might not be straightforward to identify the dimension with the highest impact on the inference.

Finally, in our approach we assume and implement a linear model for the dynamic system. This linearity has its own limitations and might not always be optimal in all cases, hence other types of models should also be investigated. Potentially, a model-agnostic approach, such as reinforcement learning, can be considered for guiding the adaptation of the various approximate network configurations. However, unlike MPC, reinforcement learning-based approaches for control suffer from stability and robustness issues as well as difficulties in handling constrains, and all these aspects should be carefully considered when designing a controller for real-world applications.

7 Conclusion

In this work we develop a system for controlling approximation of deep learning models running on edge/mobile devices. In our data-driven approach we harness the control system theory and construct equations describing the relationship among resource usage, approximation levels, and the delivered result quality. We start from the intuition that the inference accuracy and resources usage brought by adaptable approximable DL can be mathematically modeled and thus predicted. Although in many ubiquitous computing environments long-term prediction of accuracy and resource usage is not feasible due to the fast-changing nature of the context, short-term future be anticipated by exploiting the relationships between the entropy of the confidence of consecutive DL inferences. We show that the entropy represents a good proxy for the performance of the network approximations for a short-time horizon, and thus we can exploit it to construct equations that describe the dynamics of DL approximation in time. Then, we design a model-predictive control-based method for adaptation that steers the control towards a particular end-goal under different contextual changes. We provide extensive experimentation in three domains – human activity recognition, acoustic scene profiling, and computer vision – with two different neural networks architectures and two different approximation techniques, and demonstrate that our system can lead to up to 50% energy savings without sacrificing the quality of DL inference. In addition, the runtime overhead of the controller is minimal. Finally, the comparison with other state of the art approaches shows that our method achieves a strong balance between accuracy and model complexity. With this, we believe that our method makes embedded DL future-proof in light of ever-increasing DL model complexity, while also expanding the support for on-device learning to a wider range of less-capable devices.