Open AccessArticle

Virtual Machine Placement in Edge Computing Based on Multi-Objective Reinforcement Learning

Shanwen Yi

¹,

Shengyi Hong

²,

Yao Qin

^2,*

Hua Wang

³ and

Naili Liu

⁴

School of Computer Science and Technology, Shandong University, Jinan 250100, China

Department of Investigation, Shanghai Police College, Shanghai 200137, China

School of Software, Shandong University, Jinan 250100, China

⁴

School of Information Science and Engineering, Linyi University, Linyi 276000, China

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 633; https://doi.org/10.3390/electronics14030633

Submission received: 27 December 2024 / Revised: 23 January 2025 / Accepted: 5 February 2025 / Published: 6 February 2025

Download

Browse Figures

Figure 1
The three-tier architecture for MEC. "> Figure 2
The hypervolume of a maximization problem with two objective functions. "> Figure 3
The hypervolume of the eight tuples of weighting coefficients when the MEC server number is 80 and the VM request number is 500. "> Figure 4
The relationship between the response latency of all VM requests and the VM request number for different numbers of MEC servers. "> Figure 5
The relationship between the energy consumption of all servers and the VM request number for different numbers of MEC servers. "> Figure 6
The connection between hypervolume and the VM request number when the MEC server number is 80. "> Figure 7
Variation in hypervolume with iterations when the MEC server number is 80 and the VM request number is 500. "> Figure 8
The running time of EVMPRL when the MEC server number is 160. ">

Review Reports Versions Notes

Abstract

With the popularization of internet of things (IoT), the energy consumption of mobile edge computing (MEC) servers is also on the rise. Some important IoT applications, such as autonomous driving, smart manufacturing, and smart wearables, have high real-time requirements, making it imperative for edge computing to reduce task response latency. Virtual machine (VM) placement can effectively reduce the response latency of VM requests and the energy consumption of MEC servers. However, the existing work does not consider the selection of weighting coefficients for the optimization objectives and the feasibility of the solution. Besides, these algorithms scalarize the objective functions without considering the order-of-magnitude difference between objectives. To overcome the above problems, the article proposes an algorithm called EVMPRL for VM placement in edge computing based on reinforcement learning (RL). Our aim is to find the Pareto approximate solution set that achieves the trade-off between the response latency of VM requests and the energy consumption of MEC servers. EVMPRL is based on the Chebyshev scalarization function, which is able to efficiently solve the problem of selecting weighting coefficients for objectives. EVMPRL can always search for solutions in the feasible domain, which can be guaranteed by selecting the servers that can satisfy the current VM request as the next action. Furthermore, EVMPRL scalarizes the Q-values instead of the objective functions, thus avoiding the problem in previous work where the order-of-magnitude difference between the optimization objectives makes the impact of an objective function on the final result too small. Finally, we conduct experiments to prove that EVMPRL is superior to the state-of-the-art algorithm in terms of objectives and the solution set quality.

Keywords:

reinforcement learning; edge computing; virtual machine placement; multi-objective optimization; latency; energy saving

1. Introduction

The market size of the Internet of Things (IoT) has grown significantly [1]. The increase in IoT services has resulted in the explosive growth of smart devices such as smartphones, wearables, smart cars, and smart factories [2]. The quantity of IoT equipment will reach up to 2.46 billion globally by 2025, with an average annual increase rate of

13 %

[3]. As IoT applications such as autonomous driving, smart manufacturing, and smart wearables demand low latency, they are gradually shifted from distant cloud data centers to mobile edge computing (MEC) servers that are close to the user equipment (UE) [4]. For these reasons, the energy consumption of MEC servers is also on the rise. This increase in energy consumption not only increases the operation cost but also causes serious carbon emissions and accelerates global warming.

Virtual machine (VM) placement has been shown to be an effective solution to reduce energy consumption in edge computing [5]. VM placement can optimize many important metrics in cloud and edge computing, like energy consumption, resource wastage, and latency [6]. Considering multiple optimization objectives has become a trend in VM placement nowadays. For VM placement problems in edge computing, the computational tasks can be modeled as VM requests. VM requests from the UE side are assigned to the most appropriate MEC server and then passed back to the UE side after processing [7]. Due to the requirements of real-time IoT applications [8], we consider the response latency of the VM requests and the energy consumption of the MEC servers. Nevertheless, we found that discussions of VM placement algorithms in edge computing environments tend not to show how to select appropriate weighting coefficients for the objectives.

As a matter of fact, the weighting coefficients play a crucial role in the solution set quality of multi-objective optimization problems. However, most existing multi-objective VM placement algorithms in edge computing utilize the linear scalarization function, making it challenging to select suitable weighting coefficients for the objectives [9]. Unsuitable weighting coefficients cause the calculated solution set to deviate from the Pareto optimal set and result in a degradation of the solution set quality [10]. Fortunately, multi-objective reinforcement learning (RL) algorithms utilizing the Chebyshev scalarization function can effectively select weighting coefficients [11,12]. This method can find Pareto-optimal solutions in various Pareto frontier shapes [13]. As a result, this method can guarantee the solution set quality under most combinations of weighting coefficients. The existing algorithms also tend to overlook the solution’s feasibility, which leads to extra operations caused by VMs requesting more resources than the server resources [14,15] and increases the unnecessary overhead. In addition, these algorithms sum the objective functions without considering the gap between the order-of-magnitude differences in the objective values [14,15,16,17,18]. This causes the optimization objective with a much smaller order of magnitude to have a limited impact on the final result. For example, the energy consumption in our experiments is much greater than the latency. If we scalarize the objective functions and add them up, the latency will have a negligible effect on the final result.

In this paper, we propose a multi-objective RL algorithm called EVMPRL to deal with the VM placement problem in edge computing. The VM placement problem belongs to the discrete decision-making problem, which can be solved using RL [19]. The goal of EVMPRL is to find a Pareto approximate solution set that can achieve the trade-off of reducing the response latency of VM requests and the energy consumption of MEC servers. Compared with existing algorithms, EVMPRL uses the Chebyshev scalarization function to select appropriate weighting coefficients to improve the solution set quality. EVMPRL can always search for solutions in the feasible domain, thus avoiding additional operations. In addition, EVMPRL scalarizes the Q-values rather than the objective functions, which can effectively avoid the problems associated with excessive order-of-magnitude differences.

Finally, EVMPRL is compared with the state-of-the-art algorithm. The experimental results prove that EVMPRL outperforms the existing algorithm in two objectives and the solution set quality.

The main contributions are as follows:

This paper uses the Chebyshev scalarization function to design EVMPRL, which could solve the weighting coefficient selection problem in multi-objective VM placement in edge computing and significantly improve the solution set quality.
EVMPRL can always search for solutions in the feasible domain, which is guaranteed through selecting the servers that can satisfy the current VM request as the next action. Thus, the additional operations required in the previous work are avoided.
EVMPRL scalarizes the Q-value instead of the objective function, thus avoiding the problem in previous work in which the order-of-magnitude difference between the optimization objectives makes the impact of an objective function on the final result too small. To prevent excessive order-of-magnitude differences from affecting the results, we normalize two reward functions and keep the reward values between [0,1].
The experimental results prove that EVMPRL is superior to the state-of-the-art algorithm in terms of objectives and the solution set quality.

The remainder of this paper is organized as follows. In Section 2, we introduce the related work of VM placement. In Section 3, we introduce the relevant background. In Section 4, we model the system and present the mathematical formulation. In Section 5, we introduce our proposed algorithms in detail. In Section 6, we conduct experiments to compare EVMPRL with the state-of-the-art algorithm and evaluate EVMPRL’s convergence and scalability. In Section 7, we summarize this paper.

2. Related Work

VM placement can effectively optimize many important metrics, such as energy consumption, resource wastage, latency, and Service Level Agreement (SLA) violation rate in both edge computing and cloud computing environments, thus reducing the cost of operations and improving the user experience [20]. Table 1 shows a summary and comparison of the representative work on VM placement.

2.1. VM Placement in Cloud Computing

VM placement is critical for cloud computing. Researchers have conducted extensive studies on VM placement.

From Table 1, we can see that many VM placement algorithms are energy-aware. Gao et al. [21] propose an ant colony optimization (ACO) algorithm called VMPACS. The aim of the VMPACS is to achieve a trade-off between resource wastage and energy consumption in datacenters. VMPACS stores ant pheromones on each set of VM–server pairs. When assigning a VM, the pheromone will be updated. Shaw et al. [22] propose a multi-objective VM placement algorithm called PACPA to achieve a trade-off between minimizing the SLA violation rate and energy consumption. PACPA uses artificial neural networks (ANN) to predict the resource usage.

In [23], the authors propose a multi-objective chemical reaction optimization (CRO) algorithm called CVP with the aim of achieving a trade-off between maximizing resource utilization and minimizing energy consumption. CVP is a permutation-based algorithm, and contains three kinds of operators. To achieve a trade-off between the share of energy generated by renewable energy sources and carbon emissions, the authors of [24] study VM placement in data centers using a mixture of renewable and conventional energy supplies. Their proposed EFP algorithm consists of two successive heuristic algorithms.

2.2. VM Placement in Edge Computing

At present, applications are gradually shifting from distant cloud data centers to MEC servers. This has led to a proliferation of research related to the placement of VMs in edge computing. In edge computing, different kinds of latency and energy consumption are important research topics due to the popularization and high real-time requirements of some important IoT applications, such as autonomous driving, smart manufacturing, and smart wearables.

Many VM placement algorithms in edge computing are based on artificial intelligence algorithms or metaheuristics. Jian et al. [14] propose a bat algorithm (BA) called OEMBA. It aims to achieve a trade-off between the server power and placement delay. OEMBA avoids premature convergence using the second-order oscillation factor. Xu et al. [15] consider the response latency and energy consumption in edge computing environments. This study proposed a multi-objective RL algorithm called MRLVMP. MRLVMP uses the method of order exchange and migration (OEM) to relocate the VMs that exceed the current server capacity. Li et al. [18] study the server placement problem in edge computing, and propose a particle swarm optimization (PSO) algorithm to capture the trade-off between the energy consumption and the resource utilization. In order to make PSO applicable to discrete decision problems, the authors modified PSO according to the application scenario.

There are also heuristic-based algorithms for VM placement. Zhang et al. [16] propose a VM placement algorithm for virtualized multi-access edge computing. The authors convert their VM placement problem into a two-sided matching problem. The aim is to achieve a trade-off between energy consumption and communication delay. Jia et al. [17] investigate the cloudlet placement problem in the wireless MANET environment. This work reduces the average waiting time by finding the optimal VM placement policy.

The above algorithms scalarize their objective functions using the linear scalarization function to form a new objective function. The weighting coefficients are set to equal values without considering the order-of-magnitude difference between the objectives and the selection of the weighting coefficients. For instance, the energy consumption is, significantly, an order of magnitude higher than the response latency in [15], resulting in the response latency having a negligible effect on the results. The energy consumption is, significantly, an order of magnitude higher than the resource utilization in [18], resulting in the resource utilization having a negligible on the results. Furthermore, neglecting the selection of weighting coefficients will reduce the quality of the resulting solution set. In addition, some existing algorithms do not consider the feasible domain when selecting the next action, resulting in additional operations and increasing the unnecessary overhead. To address the above problems, we propose EVMPRL.

3. Background

For illustrative purposes, this section provides the relevant background.

3.1. Pareto Domination

Pareto domination has a wide range of uses in multi-objective optimization. A multi-objective minimization problem with k objectives is formulated as

min \vec{f} (x) = {f_{1} (x), \dots, f_{k} (x)}

, where x denotes the decision variable, and

\vec{f} = {f_{1}, \dots, f_{k}}

denotes the vector of all objective functions [25].

For the above minimization problem, the

x_{1}

Pareto solution dominates solution

x_{2}

if, and only if, both of the following two conditions are met simultaneously [26]:

$\forall f_{i} \in \vec{f}$ , $f_{i} (x_{1}) \leq f_{i} (x_{2})$ .
$\exists f_{i} \in \vec{f}$ , $f_{i} (x_{1}) < f_{i} (x_{2})$ .

A solution is called a non-dominated solution if it is not Pareto-dominated by any other solutions. The Pareto optimal solution set consists of all non-dominated solutions [27]. Pareto-based multi-objective algorithms usually measure the solution set quality by hypervolume.

3.2. Reinforcement Learning

The RL learning agent operates interactively with a specific environment and is rewarded or penalized by the environment for its action. RL can be modeled as the Markov decision process [28]. Let

S = {s_{1}, \dots, s_{M}}

denote the state space and

A = {a_{1}, \dots, a_{N}}

denote the feasible action space. RL aims to find the policy with the maximum expected cumulative reward.

Q-value denotes the discounted cumulative reward;

Q^{π} (s, a)

denotes the Q-value when taking action a in the state s under the policy

π

. Watkins [29] proposes using the

\hat{Q}

-value to approximate the optimal value

Q^{*}

step by step. Let

\hat{Q} (s, a)

denote the

\hat{Q}

-value when taking action a in the state s, which is updated as follows:

\hat{Q} (s, a) = (1 - α_{t}) \hat{Q} (s, a) + α_{t} (r + γ max_{a^{'}} \hat{Q} (s^{'}, a^{'})),

(1)

where

α_{t}

is the learning rate, r is the reward, and

γ

is the discount factor. As long as the learning agent visits the state and action pairs infinitely with an appropriate learning rate, the

\hat{Q}

-value will converge to the optimal

Q^{*}

-value [30].

Similarly, the multi-objective RL is usually described as a multi-objective Markov decision process in which the reward vector replaces the reward [31].

4. Problem Formulation

4.1. Systems Modeling

Figure 1 shows the MEC architecture, which contains the cloud data center layer, MEC server layer, and UE layer [32]. In this paper, each computational task on the UE side is assigned to the most suitable MEC server in the form of a VM request, which occupies a certain amount of MEC server resources. The results are then passed back to the UE side after processing. All “servers” mentioned later refer to “MEC servers”, and “VM requests” and “VMs” will be used interchangeably in the remainder of this paper.

Let V represent the set of all VM requests and P represent the set of all MEC servers. Assume that the problem involves m VMs and n servers. We consider two kinds of server resources, including the memory resources and CPU core number [33]. A VM request

i \in V

is expressed as

{R_{i}^{p}, R_{i}^{m}, d_{i}}

, where

R_{i}^{p}

denotes the demand of VM i on the number of CPU cores,

R_{i}^{m}

denotes the demand of VM i on memory resources, and

d_{i}

denotes the computational demand of VM i, measured in Million Instructions (MI). The resources of server

j \in P

are expressed as

{T_{j}^{p}, T_{j}^{m}, ρ_{j}}

, where

T_{j}^{p}

denotes the CPU core number of server j,

T_{j}^{m}

refers to the memory size of server j, and

ρ_{j}

is the CPU processing speed of server j, measured in million instructions per second (MIPS). The CPU processing speed varies depending on the server type.

4.2. Response Latency Model

As in [15], we ignore the transfer latency of the VM requests between MEC servers and UEs and only consider the execution latency incurred by servers in processing VM requests. When VM i is allocated to server j, the response latency

t_{i j}

is as follows:

t_{i j} = \frac{d_{i}}{{c p}_{i j}} = \frac{d_{i}}{ρ_{j} R_{i}^{p}},

(2)

where

{c p}_{i j}

denotes the VM processing speed when VM i is allocated to server j. The value of

{c p}_{i j}

equals the CPU processing speed of server j multiplied by the demand of the VM request i on the number of CPU cores.

4.3. Energy Consumption Model

Ref. [34] proved that an active server’s power consumption and CPU utilization are linearly correlated. In this paper, servers in the idle state are shut down. Then, the power consumption of server j is as follows:

p_{j} = \{\begin{matrix} (p_{j}^{f u l l} - p_{j}^{i d l e}) \cdot U_{j}^{p} + p_{j}^{i d l e} & , U_{j}^{p} > 0 \\ 0 & , o t h e r w i s e \end{matrix},

(3)

where

p_{j}^{i d l e}

represents the idle power consumption of server j,

p_{j}^{f u l l}

represents the full load power consumption of server j,

U_{j}^{p}

is CPU utilization of server j. The idle power consumption refers to the power consumption when the CPU utilization is 0 and the server is not shut down.

4.4. Mathematical Formulation

We assume that any server can satisfy the VM request with the highest resource demand. This paper uses the decision variable

x_{i j} \in {0, 1}

to denote whether VM i is placed on server j. Each VM can only be allocated to a server once. Considering the decision variables

x_{i j}

and Equation (2), the formula for the total response latency

T_{j}

of all VM requests assigned to server j can be obtained [16,35]:

T_{j} = \sum_{i = 1}^{m} t_{i j} = \sum_{i = 1}^{m} \frac{d_{i}}{ρ_{j} R_{i}^{p}} \cdot x_{i j},

(4)

based on the above equation, the total response latency of all VM requests is as follows:

T_{t o t a l} = \sum_{j = 1}^{n} T_{j},

(5)

similarly, when considering decision variables, Equation (3) becomes the following:

p_{j} = \{\begin{matrix} p_{j}^{i d l e} + (p_{j}^{f u l l} - p_{j}^{i d l e}) \cdot \sum_{i = 1}^{m} \frac{R_{i}^{p} x_{i j}}{T_{j}^{p}} & , \sum_{i = 1}^{m} x_{i j} > 0 \\ 0 & , o t h e r w i s e \end{matrix},

(6)

where

\sum_{i = 1}^{m} R_{i}^{p} x_{i j} / T_{j}^{p}

denotes the CPU utilization of server j, and

\sum_{i = 1}^{m} x_{i j} > 0

denotes that at least one VM request is placed on server j. Based on Equations (4) and (6), the total energy consumption of all MEC servers becomes the following [15,16]:

E_{t o t a l} = \sum_{j = 1}^{n} p_{j} T_{j},

(7)

based on Equations (5) and (7), our problem becomes te following:

\begin{matrix} G o a l s : \\ min \sum T_{t o t a l} a n d min \sum E_{t o t a l} \\ C o n s t r a i n t s : \end{matrix}

\begin{matrix} \sum_{j = 1}^{n} x_{i, j} = 1, \forall i \in V \end{matrix}

(8)

\begin{matrix} \sum_{i = 1}^{m} R_{i}^{p} x_{i, j} \leq T_{j}^{p}, \forall j \in P \end{matrix}

(9)

\begin{matrix} \sum_{i = 1}^{m} R_{i}^{m} x_{i, j} \leq T_{j}^{m}, \forall j \in P \end{matrix}

(10)

\begin{matrix} x_{i, j} \in \{0, 1\}, \forall i \in V, j \in P \end{matrix}

(11)

Constraint (8) indicates that each VM can only be allocated once. Constraint (9) indicates that the total demand for the CPU core number of all VMs placed on a server is not able to surpass that server’s CPU core number. Constraint (10) indicates that the total demand for memory resource of all VMs placed on a server is not able to surpass that server’s memory size. Constraint (11) defines the decision variables.

5. Proposed Algorithms

5.1. The Learning Agent

5.1.1. State Space

The state space

S = {s_{1}, \dots, s_{m}}

in EVMPRL is defined as m VM requests that need to be assigned. At each time step, EVMPRL completes the placement of one VM. Therefore, the total time step is the number of all VM requests. State

s_{t} \in S

can be considered as the VM request to be placed at time step t.

At time step t, the resource utilization of server j is expressed as

{U_{t j}^{p}, U_{t j}^{m}}

, where

U_{t j}^{p}

and

U_{t j}^{m}

denote the CPU utilization and memory utilization of server j at time step t, respectively. The equations of

U_{t j}^{p}

and

U_{t j}^{m}

are as follows:

U_{t j}^{p} = \sum_{i = 1}^{m} R_{i}^{p} x_{i j}^{t},

(12)

U_{t j}^{m} = \sum_{i = 1}^{m} R_{i}^{m} x_{i j}^{t},

(13)

where

x_{i j}^{t}

denotes whether VM i is allocated to server j at time step t. Based on state space, Equations (12) and (13), the learning agent can calculate the CPU core utilization and memory utilization of all MEC servers at each time step.

5.1.2. Action Space

One VM is placed at each time step. EVMPRL defines the action as selecting a server with sufficient resources for the current VM request. At time step t, the corresponding action space

A_{t}

consists of all servers that can satisfy the current VM request

s_{t}

at this point, and its expression is as follows:

\begin{matrix} A_{t} = {l \in P | (\sum_{u = 1}^{m} R_{u}^{p} x_{u, l}^{t} + R_{s_{t}}^{p} \leq T_{l}^{p}) \\ \land (\sum_{u = 1}^{m} R_{u}^{m} x_{u, l}^{t} + R_{s_{t}}^{m} \leq T_{l}^{m})} \end{matrix}

(14)

The first inequality shows that all servers in the action space

A_{t}

can satisfy the demand for the number of CPU cores for VM request

s_{t}

. The second inequality shows that all servers in the action space

A_{t}

can satisfy the demand for memory for VM request

s_{t}

EVMPRL can always take action in the feasible domain. Therefore, it avoids the extra operations caused by VM requests requiring more resources than server resources.

At time step t, action

a_{t} \in A_{t}

in state

s_{t}

is actually selecting server

a_{t}

for VM request

s_{t}

at time step t.

5.1.3. Reward Vector

Multi-objective RL replaces the reward with the reward vector

\vec{r} = {r_{1}, r_{2}}

, where

r_{1}

represents the reward for objective 1 and

r_{2}

represents the reward for objective 2. To avoid excessive order-of-magnitude differences between the reward values, we normalize two reward functions and keep the reward values between

[0, 1]

Let

r_{1} (s_{t}, a_{t})

denote the reward corresponding to the first objective. The formula of normalized

r_{1} (s_{t}, a_{t})

is designed as follows:

r_{1} (s_{t}, a_{t}) = \frac{T_{t} + μ}{T_{t + 1}^{s_{t}, a_{t}}},

(15)

where

T_{t}

is the total response latency generated by all VMs at time step t,

T_{t + 1}^{s_{t}, a_{t}}

is the total response latency generated by all VMs after VM

s_{t}

is placed on server

a_{t}

at time step t and

μ

is a small positive constant with a value of

0.0001

From Equation (15), we can see that the smaller the response latency generated by the placement of the current VM, the larger the corresponding reward value.

Similarly, the normalized reward

r_{2} (s_{t}, a_{t})

corresponding to the second objective is calculated as follows:

r_{2} (s_{t}, a_{t}) = \frac{E_{t} + μ}{E_{t + 1}^{s_{t}, a_{t}}},

(16)

where

E_{t}

is the energy consumption of n servers at time step t and

E_{t + 1}^{s_{t}, a_{t}}

is the energy consumption of n servers after VM

s_{t}

is placed on server

a_{t}

at time step t.

From Equation (16), we can see that the smaller the energy consumption generated by the placement of the current VM, the larger the corresponding reward value.

5.2. Scalarization Functions in RL Algorithms

5.2.1. Linear Scalarization Function

Multi-objective optimization typically combines all objective functions using a weighted sum to create a new objective function, denoted by

L F

, through the linear scalarization function. For the solution x, the corresponding value of the new optimization objective is as follows [13]:

L F (x) = \sum_{i = 1}^{k} w_{i} \cdot f_{i} (x),

(17)

where k is the number of objectives,

f_{i}

denotes the i-th objective function, and

w_{i}

denotes the weighting coefficient for the i-th objective. The sum of all weighting coefficients is 1.

Multi-objective RL replaces the above

f_{i}

with the

\hat{Q}

-value

\hat{Q_{i}} (s, a)

of the i-th optimization objective. Then, the linear scalarized

\hat{Q}

-value

\hat{L Q} (s, a)

when taking action a in the state s is as follows:

\hat{L Q} (s, a) = \sum_{i = 1}^{k} w_{i} \cdot {\hat{Q}}_{i} (s, a),

(18)

the greedy action

{a^{'}}_{g} (s)

taken by the learning agent in state s corresponds to the maximum linear scalarization value, which is given by the following equation:

{a^{'}}_{l g} (s) = arg max_{a^{'}} \hat{L Q} (s, a^{'})

(19)

In this paper, EVMPRL is compared with a comparison algorithm called L-EVMPRL, designed based on the linear scalarization function, to demonstrate EVMPRL’s superiority in solving the weighting coefficients problem. Like most algorithms using the linear scalarization function, L-EVMPRL sets the

w_{1}

and

w_{2}

of two objectives to

0.5

5.2.2. Chebyshev Scalarization Function

Most multi-objective VM placement algorithms for edge computing are based on the linear scalarization function. However, when using the linear scalarization function, determining appropriate weighting coefficients is a recognized challenge [11]. Ignoring this problem degrades the solution set quality. Van et al. [13] found that the Chebyshev scalarization function can find the high-quality solution set under most combinations of weighting coefficients.

The Chebyshev scalarization function belongs to a kind of nonlinear scalarization function. This not only overcomes the problems of the linear scalarization function but also searches for solutions in a bigger space, rather than in convex regions. The method finds Pareto-optimal solutions in various Pareto frontier shapes [13]. Our previous work has shown that multi-objective reinforcement learning (RL) algorithms utilizing the Chebyshev scalarization function can effectively select weighting coefficients [11,12]. Thus, the method can find the high-quality solution set under most combinations of weighting coefficients. We selected the solution set with the best quality from multiple randomly generated combinations of weighting coefficients for subsequent experiments.

This method is based on Chebyshev metrics. The Chebyshev metric of solution x can be obtained using the following equation [36]:

L_{\infty} (x) = max_{i = 1, \dots, k} w_{i} | f_{i} (x) - z_{i}^{*} |,

(20)

where k denotes the number of objective functions,

f_{i}

denotes objective i,

z^{*}

denotes the utopian point used to record the current optimal objective vector,

z_{i}^{*}

denotes the i-th component of the utopian point for the i-th optimization objective,

w_{i}

denotes the weighting coefficients corresponding to objective i, and the sum of all weighting coefficients for k optimization objectives is 1. The updating formula of

z_{i}^{*}

is as follows:

z_{i}^{*} = f_{i}^{b e s t} (x) + φ,

(21)

where

f_{i}^{b e s t} (x)

is the current optimal value of objective i and

φ

is a very small constant, which is set to 1.

For multi-objective RL, it is usually the

\hat{Q}

-value that is scalarized [13].

f_{i} (x)

in Equation (20) is replaced with

{\hat{Q}}_{i} (s, a)

, which is the

\hat{Q}

-value of objective i. Then, the scalarized

\hat{Q}

-value

\hat{S Q} (s, a)

is updated as follows:

\hat{S Q} (s, a) = max_{i = 1, \dots, k} w_{i} | {\hat{Q}}_{i} (s, a) - z_{i}^{*} |,

(22)

in state s, the greedy action is selected according to the following rule:

{a^{'}}_{c g} (s) = arg min_{a^{'}} \hat{S Q} (s, a^{'})

(23)

5.3. Scalarized Policies

5.3.1. The Scalarized Policy of EVMPRL

EVMPRL uses a scalarized

ε

-greedy policy based on the Chebyshev scalarization function. Algorithm 1 shows the pseudo-code of the scalarized

ε

-greedy policy, where line 1 initializes the parameters, lines 2 to 6 denote the scalarization of the values, lines 7 to 14 select the next action based on Equation (23) and the

ε

-greedy policy, and line 15 returns the selected action.

If the above policy selects the optimal action only according to Equation (23); we call this the scalarized greedy policy based on the Chebyshev scalarization function, denoted by sgreedy().

Algorithm 1

ϵ

-sgreedy()

1:: Initialize the $\hat{S Q}$ -value list $l_{s q}$ to $ϕ$
2:: for each action $a_{t} \in A_{t}$ do
3:: $\vec{q_{t}} = {{\hat{Q}}_{1} (s_{t}, a_{t}), \dots, {\hat{Q}}_{k} (s_{t}, a_{t})}$
4:: $\hat{S Q} (s_{t}, a_{t}) = s c a l a r i z e (\vec{q_{t}}, \vec{w})$ %According to Equation (22),
5:: $l_{s q} = l_{s q} ⋃ \hat{S Q} (s_{t}, a_{t})$
6:: end for
7:: for each action $a_{t} \in A_{t}$ do
8:: $ε_{0} = r a n d (0, 1)$
9:: if $ε_{0} > ε$ then
10:: Choose the greedy action a according to Equation (23)
11:: else
12:: Choose a random action a from the action space $A_{t}$
13:: end if
14:: end for
15:: return action a

5.3.2. The Scalarized Policy of L-EVMPRL

To prove the superiority of EVMPRL in the selection of weighting coefficients, L-EVMPRL based on the linear scalarization function is designed as a comparison algorithm. If Equation (22) for scalarizing the Q-value in line 4 of Algorithm 1 is substituted with Equations (18) and (23) for selecting the greedy action in line 10 of Algorithm 1 is substituted with Equation (19). Algorithm 1 then changes to the scalarized

ε

-greedy policy based on the linear scalarization function. This paper uses

ε

-Lsgreedy() to denote the scalarized

ε

-greedy policy based on the linear scalarization function.

Similarly, if policy

ε

-Lsgreedy() selects the optimal action only according to Equation (19), we call this the scalarized greedy policy based on the linear scalarization function, denoted by Lsgreedy().

5.4. The Detailed Pseudo-Codes of Two Algorithms

Algorithm 2 shows the detailed pseudo-code of EVMPRL. The complexity of each iteration is

O (m, n)

, where m denotes the VM request number and n denotes the MEC server number.

Algorithm 2 can be divided into 4 steps:

Parameter initialization. Initialize the current approximate optimal solution set $P_{s}$ , the $\hat{Q}$ -value table, and the set of current unplaced VMs.
Solution construction. The learning agent selects the action for the current state based on Algorithm 1, updates the decision variable, calculates the reward vector, and updates the corresponding $\hat{Q}$ -value.
Update the approximate solution set. EVMPRL compares the constructed with the current approximate solution set and updates $P_{s}$ .
Termination detection. If the maximum iteration is reached, EVMPRL returns the approximate solution set $P_{s}$ . If not, return to step 2.

To be exact, Step 1 is in lines 1–3; Step 2 is in lines 5–19; Step 3 corresponds to line 20; Step 4 is in line 4, 21, and 22 of Algorithm 2.

Algorithm 2 Pseudocode of EVMPRL

1:: Set the current solution set $P_{s}$ to the empty set
2:: Set $\hat{Q}$ -values to 0
3:: Set the currently unplaced VMs set $V_{o p}$ to V
4:: repeat
5:: Initialize the current state s
6:: for $t = 1$ to m (number of VMs) do
7:: $a = ϵ$ -sgreedy(s) % According to Algorithm 1
8:: Place VM s on server a %Update the decision variable
9:: $V_{o p} = V_{o p} ∖ {s}$ % The current placed VM needs to be remove from $V_{o p}$
10:: Find the next state $s^{'}$ in the state space
11:: Introduce a new VM $i^{'} \in V$ for the state $s^{'}$
12:: $a^{'} =$ sgreedy( $s^{'}$ ) % Select the greedy action according to Equation (23)
13:: Calculate the reward $r_{1}$ according to Equation (15)
14:: Update ${\hat{Q}}_{1} (s, a)$ according to Equation (1)
15:: Calculate the reward $r_{2}$ according to Equation (16)
16:: Update ${\hat{Q}}_{2} (s, a)$ according to Equation (1)
17:: $s = s^{'}$
18:: $i = i^{'}$
19:: end for
20:: If currently constructed solution x is not Pareto-dominated by all solutions in $P_{s}$ , add x to $P_{s}$ . Then move all solutions which are Pareto-dominated by x out of $P_{s}$ .
21:: until The maximum iteration is reached
22:: return $P_{s}$ ;

L-EVMPRL is similar to EVMPRL except that policy

ε

-sgreedy() in line 7 of Algorithm 2 is replaced by policy

ε

-Lsgreedy() based on the linear scalarization function, and policy sgreedy() in line 11 of Algorithm 2 is replaced by policy Lsgreedy() based on Equation (19). L-EVMPRL, based on the linear scalarization function, does not consider the weighting coefficients. Instead, L-EVMPRL set the weighting coefficients corresponding to the two objectives to 0.5. Other than that, EVMPRL and L-EVMPRL are almost the same.

6. Experimental Results

6.1. Simulation Setup

We implemented all programs ising Java. The experimental environment was built on the EdgeCloudSim platform [37].

The values of the parameters appearing in this paragraph are identical in all experiments. This paper sets

ε

in Algorithm 1 to 0.1, the discount factor

γ

to 0.8, the learning rate

α_{t}

to 0.1, and the maximum iteration to 500. We repeated the experiments 10 times, and the mean values are used as the experimental results.

Three types of MEC servers, HP Proliant ML110 G5, HP Proliant DL360 G7, and HP Proliant DL360 Gen9, are evenly distributed in the experimental environment. Table 2 shows the CPU core number, memory size, and CPU processing speed for above MEC server types [24,38,39]. Similar to [16,33,35], this paper randomly generated VM requests according to the following rules. In order to ensure that the server with the fewest resources is able to execute the VM request with the largest resource demand, this paper used the following VM request generation method: the demand of VM requests for CPU cores is randomly selected in

{1, 2}

, the demand for memory is randomly selected between [1 GB and 4 GB], and the computation demand is randomly selected between [500 MI and 2000 MI]. In this paper, the CPU resources is only reflected in the number of cores, and the processing speed of the VMs varies with their assigned MEC server type.

6.2. The Quality of the Solution Set

Pareto-based algorithms generally use hypervolume to evaluate the quality of the solution set [40].

Hypervolume is the region’s volume enclosed by solutions in the solution set and a reference point [41]. In multi-objective maximization problems, each coordinate of the reference point is the minimum value of the corresponding optimization objective minus a tiny constant. Figure 2 illustrates the hypervolume of a maximization problem with two objective functions. For the minimization problem studied in this paper, each component of the coordinate value of the reference point is the maximum value of the corresponding objective plus a tiny constant. A larger hypervolume value represents a better-quality solution set.

6.3. The Selection of Weighting Coefficients

To solve the problem of selecting the weighting coefficients for each objective, we used the Chebyshev scalarization function to design the multi-objective RL algorithm. When using this function, most combinations of weighting coefficients can guarantee the solution set’s quality [13]. Therefore, this paper randomly generated a few combinations of weighting coefficients and selected the one whose hypervolume was the largest.

For instance, when the MEC server number is 80 and the VM request number is 500, eight tuples of weighting coefficients are randomly generated. As shown in Figure 3, the third tuple of weighting coefficients has the largest hypervolume value. Therefore, this tuple of weighting coefficients will be used to conduct subsequent experiments.

6.4. Algorithm Comparison

This subsection compares EVMPRL with MRLVMP [15] and L-EVMPRL based on the linear scalarization function. It investigates the total response latency of all VM requests and the total energy consumption of MEC servers when the VM request number m grows from 100 to 700 in steps of 100, with the number of MEC servers n being 80 and 160, respectively.

Figure 4a,b depict the the connection between the total response latency of VM requests and the number of VM requests for MEC server sizes of 80 and 160, respectively. In general, the total response latency of VM requests increases with the number of VM requests.

Figure 4a depicts a case when the MEC server number is 80. When the VM request number is 100, the total response latency calculated by EVMPRL is

1.48 %

and

5.51 %

lower than that of L-EVMPRL and MRLVMP, respectively; when the VM request number rises to 700, the total response latency calculated by EVMPRL is

4.54 %

and

10.31 %

lower than that of L-EVMPRL and MRLVMP, respectively.

Figure 4b depicts a case when the MEC server number is 160. When the VM request number is 100, the total response latency calculated by EVMPRL is

1.44 %

and

5.56 %

lower than that of L-EVMPRL and MRLVMP, respectively; when the VM request number rises to 700, the total response latency calculated by EVMPRL is

4.69 %

and

13.83 %

lower than that of L-EVMPRL and MRLVMP, respectively.

From the figures, it can be seen that EVMPRL outperforms L-EVMPRL and MRLVMP in the metric of the total response latency of the VM requests. This is because MRLVMP and L-EVMPRL set the weighting coefficients of the two objectives to equal values without investigating the problem of weighting coefficient selection. Thus, the solution set quality of EVMPRL outperforms the abovementioned algorithm. L-EVMPRL outperforms MRLVMP because L-EVMPRL is designed to accurately compute the reward based on the total response latency before and after placement of the current VM request. In contrast, the MRLVMP sets the reward between −1, 0, and 1. Therefore, L-EVMPRL can more accurately represent the changes in total response latency before and after placement. In addition, MRLVMP scalarizes the two objective functions without considering the order-of-magnitude difference between two objectives. In contrast, EVMPRL and L-EVMPRL avoid this problem by scalarizing the

\hat{Q}

-values.

Comparing Figure 4a,b, we can see that when the VM request number is less than or equal to 400, the total response latency of the three algorithms is almost the same when the MEC server number is different. This is because the server number of each type in both cases is sufficient to satisfy all the current VM requests; when the VM request number is greater than 500, the total response latency when the number of servers is 80 is significantly greater than that when the number of servers is 160. This is because the increase in the MEC server number in Figure 4b leads to an increase in the optional server types for VM requests, which brings about a decrease in response latency.

Figure 5a,b depict the relationship between the total energy consumption of the MEC servers and the number of VM requests for 80 and 160 MEC servers, respectively. In general, the total energy consumption of MEC servers increases with the number of VM requests.

Figure 5a depicts the case when the MEC server number is 80. When the VM request number is 100, the total energy consumption calculated by EVMPRL is

4.01 %

and

19.35 %

lower than that of L-EVMPRL and MRLVMP, respectively; when the VM request number rises to 700, the total energy consumption calculated by EVMPRL is

4.43 %

and

6.89 %

lower than that of L-EVMPRL and MRLVMP, respectively.

Figure 5b depicts a case when the MEC server number is 160. When the VM request number is 100, the total energy consumption calculated by EVMPRL is

3.96 %

and

19.8 %

lower than that of L-EVMPRL and MRLVMP, respectively; when the VM request number rises to 700, the total energy consumption calculated by EVMPRL is

11.49 %

and

12.58 %

lower than that of L-EVMPRL and MRLVMP, respectively.

From the figures, it can be seen that EVMPRL outperforms MRLVMP and L-EVMPRL in the metric of total energy consumption. This is because MRLVMP and L-EVMPRL set the weighting coefficients of the two objectives to equal values without investigating the problem of weighting coefficient selection. Thus, the solution set quality of EVMPRL outperforms the abovementioned algorithm. L-EVMPRL outperforms MRLVMP because L-EVMPRL is designed to accurately compute the reward based on the total energy consumption before and after the placement of the current VM request. In contrast, the MRLVMP sets the reward between −1, 0, and 1. Therefore, L-EVMPRL can more accurately represent the changes in total energy consumption before and after placement. In addition, MRLVMP scalarizes the two objective functions without considering the order of magnitude difference between the two objectives. In contrast, EVMPRL and L-EVMPRL avoid this problem by scalarizing the

\hat{Q}

-values.

Comparing Figure 5a,b, we can see that when the VM request number is less than or equal to 400, the total server energy consumption of the three algorithms using different numbers of servers is almost the same. This is because the number of servers of each type in both cases is sufficient to satisfy all the current VM requests; when the VM request number reaches 500, the total energy consumption of servers when the number of servers is 80 is greater than that when the number of servers is 160. This is because the increase in the choice of server types in Figure 5b brings about a reduction in energy consumption.

Figure 6 depicts the solution set quality of algorithms when the number of VM requests m grows from 100 to 700 in steps of 100 when the number of MEC servers is 80. EVMPRL is significantly better than L-EVMPRL and MRLVMP in this quality metric of hypervolume. The main reason for this is the superiority of the Chebyshev scalarization function in selecting weighting coefficients.

6.5. Convergence Analysis

With a suitable learning rate, a Q-learning algorithm will converge after visiting states and actions an infinite number of times [30]. The multi-objective algorithm usually determines whether the algorithm converges or not based on the hypervolume [42]. Figure 7 shows the variation in the hypervolume obtained by EVMPRL with iterations when the server number is 80 and the VM request number is 500. The hypervolume converges gradually with the increase in iterations.

6.6. Scalability Analysis

In this paper, we verify the scalability of EVMPRL based on the methodology in [21]. Figure 8 shows the running time of EVMPRL using different numbers of VM requests when the number of MEC servers is 160.

When the VM request number is 1000, the running time is 106 s. When the VM request number is 1800, the running time becomes 274 s. The results show that when the number of VM requests is as high as 1800, EVMPRL runs for no more than 5 min.

Although the running time increases with an increase in the VM request scale, EVMPRL still obtains good convergence. This is because EVMPRL adopts the following methodologies:

Based on the reward functions for response latency and energy consumption, EVMPRL prefers servers that produce the lowest possible response latency and energy consumption for the current VM request. At the cost of narrowing the search space, the response latency and energy consumption can approximate to relatively optimal values.
When selecting an action, the scalarized strategy can remove low-quality solutions. This evaluates the distance from the current solution to the utopian point, and prefers the action with the smallest SQ-value to accelerate algorithm convergence.
To avoid falling into local optimal, we balance exploration and exploitation by using the scalarized $ϵ$ -sgreedy policy.

Therefore, EVMPRL is appropriate for scenarios with large-scale VM requests.

7. Conclusions

To reduce the response latency of VM requests and the energy consumption of MEC servers in edge computing, we investigated a multi-objective VM placement problem. The existing edge-computing VM placement algorithms do not consider the problem of selecting the weighting coefficients of each objective and the feasible domain when selecting the actions. Previous work does not consider the order-of-magnitude difference between the optimization objectives, which could mean that the impact of the objective function on the final result too small.

This paper proposes an algorithm called EVMPRL based on the Chebyshev scalarization function to solve the above problems. EVMPRL can always search for solutions in the feasible domain; this is guaranteed through the selection of servers that can satisfy the current VM request as the next action. Thus, the extra operations required in the previous work are avoided. EVMPRL also scalarizes the Q-value instead of the objective function, which addresses the problems arising from order-of-magnitude differences. The experimental results demonstrate that EVMPRL is more capable of achieving a trade-off between the response latency and energy consumption while obtaining higher-quality solution sets than MRLVMP and L-EVMPRL. The results also demonstrate EVMPRL’s convergence and scalability.

This paper only addresses the problem in a static environment and does not consider metrics such as resource utilization. In future work, we will investigate VM placement problems in turbulent changing environments, and will consider as many of the metrics that might be involved in that environment as possible.

Author Contributions

Conceptualization, H.W.; data curation, S.Y. and Y.Q.; formal analysis, S.Y. and Y.Q.; funding acquisition, H.W.; investigation, S.H. and Y.Q.; methodology, S.Y. and Y.Q.; supervision, H.W.; validation, S.Y. and S.H.; visualization, S.Y. and Y.Q.; writing—original draft, S.Y.; writing—review and editing, S.Y., S.H., Y.Q. and N.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by Shandong Provincial Natural Science Foundation of China (grant No. ZR2023MF090), Key Research and Development Program of Shandong Province of China (grant No. 2024CXGC010109), and the Introduction and Cultivation Program for Young Innovative Talents of Universities in Shandong Province of China (grant No. 2021QCYY003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to thank the editor and reviewers for their helpful suggestions to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pütz, P.; Mitev, R.; Miettinen, M.; Sadeghi, A.R. Unleashing IoT Security: Assessing the Effectiveness of Best Practices in Protecting Against Threats. In Proceedings of the ACSAC ’23: 39th Annual Computer Security Applications Conference, New York, NY, USA, 4–8 December 2023; pp. 190–204. [Google Scholar]
Sharma, M.; Tomar, A.; Hazra, A. Edge computing for industry 5.0: Fundamental, applications and research challenges. IEEE Internet Things J. 2024, 11, 19070–19093. [Google Scholar] [CrossRef]
Chen, Z.; Cheng, Z.; Luo, W.; Ao, J.; Liu, Y.; Sheng, K.; Chen, L. FSMFA: Efficient firmware-secure multi-factor authentication protocol for IoT devices. Internet Things 2023, 21, 100685. [Google Scholar] [CrossRef]
Yang, J.; Shah, A.A.; Pezaros, D. A Survey of Energy Optimization Approaches for Computational Task Offloading and Resource Allocation in MEC Networks. Electronics 2023, 12, 3548. [Google Scholar] [CrossRef]
Alashaikh, A.; Alanazi, E.; Al-Fuqaha, A. A survey on the use of preferences for virtual machine placement in cloud data centers. ACM Comput. Surv. (CSUR) 2021, 54, 1–39. [Google Scholar] [CrossRef]
Silva Filho, M.C.; Monteiro, C.C.; Inácio, P.R.; Freire, M.M. Approaches for optimizing virtual machine placement and migration in cloud environments: A survey. J. Parallel Distrib. Comput. 2018, 111, 222–250. [Google Scholar] [CrossRef]
Sun, Z.; Yang, H.; Li, C.; Yao, Q.; Teng, Y.; Zhang, J.; Liu, S.; Li, Y.; Vasilakos, A.V. A resource allocation scheme for edge computing network in smart city based on attention mechanism. ACM Trans. Sens. Netw. 2024. [Google Scholar] [CrossRef]
Al-Hammadi, I.; Li, M.; Islam, S.M.; Al-Mosharea, E. Collaborative computation offloading for scheduling emergency tasks in SDN-based mobile edge computing networks. Comput. Netw. 2024, 238, 110101. [Google Scholar] [CrossRef]
Lee, K.; Lee, S.; Lee, J. Interactive character animation by learning multi-objective control. ACM Trans. Graph. (TOG) 2018, 37, 1–10. [Google Scholar] [CrossRef]
Das, I.; Dennis, J.E. A closer look at drawbacks of minimizing weighted sums of objectives for Pareto set generation in multicriteria optimization problems. Struct. Optim. 1997, 14, 63–69. [Google Scholar] [CrossRef]
Qin, Y.; Wang, H.; Yi, S.; Li, X.; Zhai, L. A multi-objective reinforcement learning algorithm for deadline constrained scientific workflow scheduling in clouds. Front. Comput. Sci. 2021, 15, 155105. [Google Scholar] [CrossRef]
Yi, S.; Li, X.; Wang, H.; Qin, Y.; Yan, J. Energy-aware disaster backup among cloud datacenters using multiobjective reinforcement learning in software defined network. Concurr. Comput. Pract. Exp. 2022, 34, e6588. [Google Scholar] [CrossRef]
Van Moffaert, K.; Drugan, M.M.; Nowé, A. Scalarized multi-objective reinforcement learning: Novel design techniques. In Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Singapore, 16–19 April 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 191–199. [Google Scholar]
Jian, C.; Bao, L.; Zhang, M. A high-efficiency learning model for virtual machine placement in mobile edge computing. Clust. Comput. 2022, 25, 3051–3066. [Google Scholar] [CrossRef]
Xu, H.; Jian, C. A meta reinforcement learning-based virtual machine placement algorithm in mobile edge computing. Clust. Comput. 2024, 27, 1883–1896. [Google Scholar] [CrossRef]
Zhang, L.; Zhuge, S.; Wang, Y.; Xu, H.; Sun, E. Energy-delay tradeoff for virtual machine placement in virtualized multi-access edge computing: A two-sided matching approach. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 6603–6621. [Google Scholar] [CrossRef]
Jia, M.; Cao, J.; Liang, W. Optimal cloudlet placement and user to cloudlet allocation in wireless metropolitan area networks. IEEE Trans. Cloud Comput. 2015, 5, 725–737. [Google Scholar] [CrossRef]
Li, Y.; Wang, S. An Energy-Aware Edge Server Placement Algorithm in Mobile Edge Computing. In Proceedings of the 2018 IEEE International Conference on Edge Computing (EDGE), San Francisco, CA, USA, 2–7 July 2018; pp. 66–73. [Google Scholar] [CrossRef]
Gábor, Z.; Kalmár, Z.; Szepesvári, C. Multi-criteria reinforcement learning. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, WI, USA, 24–27 July 1998; Volume 98, pp. 197–205. [Google Scholar]
Kong, Y.; He, Y.; Abnoosian, K. Nature-inspired virtual machine placement mechanisms: A systematic review. Concurr. Comput. Pract. Exp. 2022, 34, e6900. [Google Scholar] [CrossRef]
Gao, Y.; Guan, H.; Qi, Z.; Hou, Y.; Liu, L. A multi-objective ant colony system algorithm for virtual machine placement in cloud computing. J. Comput. Syst. Sci. 2013, 79, 1230–1242. [Google Scholar] [CrossRef]
Shaw, R.; Howley, E.; Barrett, E. An energy efficient anti-correlated virtual machine placement algorithm using resource usage predictions. Simul. Model. Pract. Theory 2019, 93, 322–342. [Google Scholar] [CrossRef]
Li, Z.; Li, Y.; Yuan, T.; Chen, S.; Jiang, S. Chemical reaction optimization for virtual machine placement in cloud computing. Appl. Intell. 2019, 49, 220–232. [Google Scholar] [CrossRef]
Zhao, D.; Zhou, J. An energy and carbon-aware algorithm for renewable energy usage maximization in distributed cloud data centers. J. Parallel Distrib. Comput. 2022, 165, 156–166. [Google Scholar] [CrossRef]
Yang, C.; Xia, Y. Interval Pareto front-based multi-objective robust optimization for sensor placement in structural modal identification. Reliab. Eng. Syst. Saf. 2024, 242, 109703. [Google Scholar] [CrossRef]
Yan, J.; Wang, H.; Li, X.; Yi, S.; Qin, Y. Multi-objective disaster backup in inter-datacenter using reinforcement learning. In Proceedings of the Wireless Algorithms, Systems, and Applications: 15th International Conference, WASA 2020, Qingdao, China, 13–15 September 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 590–601, Proceedings, Part I 15. [Google Scholar]
Verma, A.; Kaushal, S. A hybrid multi-objective particle swarm optimization for scientific workflow scheduling. Parallel Comput. 2017, 62, 1–19. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
Tsitsiklis, J.N. Asynchronous stochastic approximation and Q-learning. Mach. Learn. 1994, 16, 185–202. [Google Scholar] [CrossRef]
Wiering, M.A.; De Jong, E.D. Computing optimal stationary policies for multi-objective markov decision processes. In Proceedings of the 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, Honolulu, HI, USA, 1–5 April 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 158–165. [Google Scholar]
Hua, H.; Li, Y.; Wang, T.; Dong, N.; Li, W.; Cao, J. Edge computing with artificial intelligence: A machine learning perspective. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, H.; Yu, L.; Xu, H.; Song, L.; Han, Z. Virtual resource allocation for mobile edge computing: A hypergraph matching approach. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Fan, X.; Weber, W.D.; Barroso, L.A. Power provisioning for a warehouse-sized computer. ACM SIGARCH Comput. Archit. News 2007, 35, 13–23. [Google Scholar] [CrossRef]
Liu, X.F.; Zhan, Z.H.; Deng, J.D.; Li, Y.; Gu, T.; Zhang, J. An energy efficient ant colony system for virtual machine placement in cloud computing. IEEE Trans. Evol. Comput. 2016, 22, 113–128. [Google Scholar] [CrossRef]
Voß, T.; Beume, N.; Rudolph, G.; Igel, C. Scalarization versus indicator-based selection in multi-objective CMA evolution strategies. In Proceedings of the 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–6 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 3036–3043. [Google Scholar]
Sonmez, C.; Ozgovde, A.; Ersoy, C. Edgecloudsim: An environment for performance evaluation of edge computing systems. Trans. Emerg. Telecommun. Technol. 2018, 29, e3493. [Google Scholar] [CrossRef]
Ding, W.; Luo, F.; Han, L.; Gu, C.; Lu, H.; Fuentes, J. Adaptive virtual machine consolidation framework based on performance-to-power ratio in cloud data centers. Future Gener. Comput. Syst. 2020, 111, 254–270. [Google Scholar] [CrossRef]
Tandon, A.; Patel, S. DBSCAN based approach for energy efficient VM placement using medium level CPU utilization. Sustain. Comput. Inform. Syst. 2024, 43, 101025. [Google Scholar] [CrossRef]
Wang, J.; Wang, N.; Wang, H.; Cao, K.; El-Sherbeeny, A.M. GCP: A multi-strategy improved wireless sensor network model for environmental monitoring. Comput. Netw. 2024, 254, 110807. [Google Scholar] [CrossRef]
Al Mindeel, T.; Spentzou, E.; Eftekhari, M. Energy, thermal comfort, and indoor air quality: Multi-objective optimization review. Renew. Sustain. Energy Rev. 2024, 202, 114682. [Google Scholar] [CrossRef]
Qin, Y.; Wang, H.; Yi, S.; Li, X.; Zhai, L. An energy-aware scheduling algorithm for budget-constrained scientific workflows based on multi-objective reinforcement learning. J. Supercomput. 2020, 76, 455–480. [Google Scholar] [CrossRef]

Figure 1. The three-tier architecture for MEC.

Figure 2. The hypervolume of a maximization problem with two objective functions.

Figure 3. The hypervolume of the eight tuples of weighting coefficients when the MEC server number is 80 and the VM request number is 500.

Figure 4. The relationship between the response latency of all VM requests and the VM request number for different numbers of MEC servers.

Figure 5. The relationship between the energy consumption of all servers and the VM request number for different numbers of MEC servers.

Figure 6. The connection between hypervolume and the VM request number when the MEC server number is 80.

Figure 7. Variation in hypervolume with iterations when the MEC server number is 80 and the VM request number is 500.

Figure 8. The running time of EVMPRL when the MEC server number is 160.

Table 1. Summary and comparison of representative work on VM placement.

Reference	Cloud/Edge	Framework	Optimization Objectives
Gao et al. [21]	Cloud	ACO	energy consumption, resource wastage
Shaw et al. [22]	Cloud	ANN	energy consumption, SLA violation rate
Li et al. [23]	Cloud	CRO	energy consumption, resource utilization
Zhao et al. [24]	Cloud	Heuristic	the share of energy generated by renewable
			energy sources, carbon emissions utilization
Jian et al. [14]	Edge	BA	power consumption, placement delay
Xu et al. [15]	Edge	RL	energy consumption, response latency
Zhang et al. [16]	Edge	Two-sided matching	energy consumption, communication delay
Jia et al. [17]	Edge	Heuristic	the average waiting time
Li et al. [18]	Edge	PSO	energy consumption, resource utilization

Table 2. The resources and CPU processing speed of the MEC server types.

MEC Server Type	Number of Cores	Memory Size (GB)	CPU Processing Speed (MIPS)
HP Proliant ML110 G5	2	4	2660
HP Proliant DL360 G7	12	16	3067
HP Proliant DL360 Gen9	36	64	2300

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, S.; Hong, S.; Qin, Y.; Wang, H.; Liu, N. Virtual Machine Placement in Edge Computing Based on Multi-Objective Reinforcement Learning. Electronics 2025, 14, 633. https://doi.org/10.3390/electronics14030633

AMA Style

Yi S, Hong S, Qin Y, Wang H, Liu N. Virtual Machine Placement in Edge Computing Based on Multi-Objective Reinforcement Learning. Electronics. 2025; 14(3):633. https://doi.org/10.3390/electronics14030633

Chicago/Turabian Style

Yi, Shanwen, Shengyi Hong, Yao Qin, Hua Wang, and Naili Liu. 2025. "Virtual Machine Placement in Edge Computing Based on Multi-Objective Reinforcement Learning" Electronics 14, no. 3: 633. https://doi.org/10.3390/electronics14030633

APA Style

Yi, S., Hong, S., Qin, Y., Wang, H., & Liu, N. (2025). Virtual Machine Placement in Edge Computing Based on Multi-Objective Reinforcement Learning. Electronics, 14(3), 633. https://doi.org/10.3390/electronics14030633

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Virtual Machine Placement in Edge Computing Based on Multi-Objective Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. VM Placement in Cloud Computing

2.2. VM Placement in Edge Computing

3. Background

3.1. Pareto Domination

3.2. Reinforcement Learning

4. Problem Formulation

4.1. Systems Modeling

4.2. Response Latency Model

4.3. Energy Consumption Model

4.4. Mathematical Formulation

5. Proposed Algorithms

5.1. The Learning Agent

5.1.1. State Space

5.1.2. Action Space

5.1.3. Reward Vector

5.2. Scalarization Functions in RL Algorithms

5.2.1. Linear Scalarization Function

5.2.2. Chebyshev Scalarization Function

5.3. Scalarized Policies

5.3.1. The Scalarized Policy of EVMPRL

5.3.2. The Scalarized Policy of L-EVMPRL

5.4. The Detailed Pseudo-Codes of Two Algorithms

6. Experimental Results

6.1. Simulation Setup

6.2. The Quality of the Solution Set

6.3. The Selection of Weighting Coefficients

6.4. Algorithm Comparison

6.5. Convergence Analysis

6.6. Scalability Analysis

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI