Introduction

Cloud computing has enabled the emergence of several service-oriented resources, such as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) [27]. The development and use of such cloud-based services have resulted in an increased number of users, and higher levels of data produced by different devices and applications. Several corporations and institutions have shown an interest in cloud computing, and because of this many cloud computing platforms have been proposed. Google, Amazon, Microsoft, and Facebook are examples of companies that are investmenting heavily in cloud computing services [30, 37].

Cloud computing has grown rapidly, and has gained popularity because it offers several benefits including on-demand self-service, virtualization, geographic distribution, and resilience [3]. These benefits are particularly attractive because they can offer flexibility guarantees for customer service constraints such as downtime and cost, which are negotiated by cloud providers through Quality of Service (QoS) guarantees. However, providing cloud services according to the customer needs and specific constraints remains a challenge. Parameters such as reliability, capacity-oriented availability and cost are relevant factors in the negotiation of such services [6, 10]. Therefore, an efficient and accurate assessment of cloud infrastructures considering availability, reliability and cost requirements is fundamental in allowing customers to identify a cloud infrastructure that suits their needs and preferences.

To provide uninterrupted cloud services, cloud managers must evaluate and improve dependability aspects of cloud infrastructures, such as availability and transaction loss; this is because users require a reasonable level of confidence in such infrastructures to efficiently plan and operate their business [19]. Some services may be considered mission-critical, and depending on the number of data operations involved, it may be essential to deploy redundancy strategies [24]. Such strategies can lead to avoidance of outage due to issues such as database deadlock, data loss, or network failure. Cloud outages can cause significant financial losses to an organization, and in extreme cases may result in the failure of the business [4].

In this context, dependability models like Reliability Block Diagrams (RBDs) and Stochastic Petri Nets (SPNs) can be useful when comparing cloud infrastructures [7, 28]. Cloud infrastructures differ one from another in many aspects, and this results in significant challenge for cloud users attempting to identify the infrastructure that best suits their needs [34, 44]. There are always trade-offs when considering different cloud alternatives; for example, robust cloud infrastructures may result in unnecessary costs to guarantee against events that are very unlikely to happen, while simple infrastructures may result in loss of critical data. MCDM methods, which consist of techniques to solve such multi-criteria problems, are essential because they can assist cloud users in choosing the best cloud infrastructure, and can take into account multiple criteria like capacity-oriented availability, reliability, downtime, or cost.

MCDM methods are designed to analyze and give recommendations on situations involving a large number of alternatives and conflicting criteria. In [33], the authors presented a case study to compare different MCDM methods in order to select IaaS services. Garg et al. [14] proposed a framework based on an Analytic Hierarchy Process (AHP) to rank cloud providers. In [23], the authors presented a multi-attribute group decision-making (MAGDM) approach for selection cloud providers. Differently from these works, we present an approach based on a MCDM method and stochastic models to evaluate, rank and find a set of optimal cloud environments considering dependability (eg.: ca-pacity-oriented availability and reliability), and cost requirements.

The process of choosing a cloud infrastructure can be slow, tedious and costly; it can also generate conflicts of interest considering a set of alternatives. In recognition of the importance of making an appropriate cloud infrastructure choice, we propose a novel approach which implements an Multiple-Criteria Decision-Making (MCDM) method to rank the best infrastructure, and takes customer service constraints such as dependability and cost into consideration. Although there are several methods for multi-criteria decision-making, we adopted the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method [11] due to its simplicity and easiness to apply. The results allow cloud customers to identify and choose the cloud infrastructure that best suits their needs, in a fast and efficient manner. Specifically, our contributions are:

  • We design and implement a strategy that combines a decision-making method, with the capacity of stochastic models to obtain dependability-related metrics (like reliability and capacity-oriented availability).

  • Our modeling strategy is based on hierarchical and heterogeneous modeling for planning cloud infrastructures, which allows the evaluation of cloud infrastructures with complex redundant mechanisms and maintenance policies.

  • We developed a tool (called MiPACE) to support the planning of cloud infrastructures which consider customer service constraints, and assists in the decision-making process.

  • We demonstrate the feasibility of our approach by showing real case scenarios and identify a set of ideal cloud infrastructures.

The remaining sections are organized as follows. “Background” section describes some general concepts. “Related work” section presents the related work. “Adopted strategy and base cloud architecture” section shows the proposed approach for ranking cloud infrastructures according to customers needs, and details the base cloud architecture adopted by this paper. “Hierarchical models and cost equations” section illustrates the hierarchical modeling process and cost equations. “MiPACE: a multi-criteria tool for planning and analysis of cloud environments” section presents the developed tool used to support the decision-making process. “Results and discussion” section illustrates the proposed approach through a real-world case study. “Final remarks” section concludes the paper and presents future directions.

Background

This section introduces fundamental concepts on multiple-criteria decision-making, dependability modeling, stochastic Petri net, and reliability block diagram.

Multiple-criteria decision-making

In our daily lives, we do not consider just one criterion when making a decision, but rather compare and evaluate more than one alternative simultaneously. When purchasing a cloud service for example, security, processing power, networking throughput, and storage capacity may all be considered as main criteria. It would be unusual for the cheapest cloud service to have the highest reliability and unlimited storage, and it is necessary to evaluate all potential impact when making decisions that involve long-term commitment and budget allocation. Thus, companies must consider multiple criteria when determing the best cost-benefit ratio. Multiple-Criteria Decision-Making (MCDM) methods have been developed to support the decision-making process in solutions that exhibit multiple conflicting criteria, and thus provide techniques for finding a set of optimal solutions.

A large number of MCDM techniques have been proposed, each with different perspectives and theories. Some techniques are used to solve ranking problems, such as the Analytic Hierarchy Process (AHP), Analytic Network Process (ANP), Elimination and Choice Expressing Reality (ELECTRE III), and Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) approaches [11]. Other approaches adopted monitoring and historical data combining with ranking techniques for decision making [15]. In this paper, the concept of the TOPSIS method has been adopted for ranking cloud infrastructures. TOPSIS is a useful technique for ranking and selecting a number of externally determined alternatives, using distance measures such as Euclidean, Manhattan and Minkowski [39]. These distances measures order alternative solutions from the best to the worst by means of scores or pairwise comparisons. They are based on five stages, where the first step groups the set of alternative solutions when taking the defined criteria into account. In the second step, the values representing each criterion are normalized; this allows all criteria to be treated in a similar way, independent of the metric adopted. Next, the criteria can be weighted and the distances between each solution are calculated, taking an anti-ideal and an ideal point (optimal solution) into consideration. In the fourth step, the relative closeness to the ideal solution is calculated. In the last step, a set of alternatives is ranked according to the relative closeness [18].

Dependability modeling and evaluation

The dependability [21] of a system is defined as its justifiably trusted ability to deliver a set of services. Dependability requirements encompass the concepts of reliability, availability, maintainability, performability, and testability. This paper focuses on availability and reliability modeling, and analysis of cloud infrastructures. Availability is the probability that the system is working (even if not at its full capacity) over time, whereas reliability is the probability that the system will deliver a set of services over a given period of time [21, 25]. The steady state availability (A) may be calculated by Eq. 1:

$$ A = \frac{MTTF}{MTTF + MTTR} $$
(1)

where MTTF and MTTR denote the mean time to failure and mean time to repair, respectively.

For any given time period represented by the interval (0,t), R(t) is the probability that the component has continued to function (i.e. has not failed) from 0 until t. When an exponentially distributed time to failure (TTF) is considered, reliability is represented by

$$ R(t)= exp \left[ -\int_{0}^{t} \lambda \left(t^{'}\right)dt^{'} \right] $$
(2)

where \(\phantom {\dot {i}\!}\lambda (t^{'})\) is the instantaneous failure rate.

We also adopt the Capacity-Oriented Availability (COA) as [25, 42]. COA takes into account how much of a service provided by a system is delivering, therefore, does not consider only states of availability or unavailability, but the impact of these conditions in service delivery. The COA calculation considers pc i as operational processing capacity or the amount of resource available at any state s i . π i is the probability of being at state s i S, where S is the set of reachable states. And the maximum capacity of the system is N. Thus, we can calculate the COA by Eq. 3.

$$ COA = \frac{\sum_{s_{i}\in S}~pc_{i} \times \pi_{i} }{N} $$
(3)

Redundancy techniques

In several application domains, different techniques have been adopted to increase the dependability of systems. These techniques are traditionally classified into four groups: fault prevention, fault removal, fault forecasting, and fault tolerance [38]. Unlike other techniques, fault tolerance (redundancy) aims to provide correct service delivery even in the presence of faults. Redundancy refers to extra resources that are not necessary for the execution of the faultless task, but must be applied if faults occur to guarantee the service delivery.

The redundancy techniques for fault tolerance include active-standby and active-active redundancy [5]. In an active-active redundancy mechanism, both the main elements (e.g., resource and service) and the redundant elements are permanently active. The users do not perceive the occurrence of faults, nor does performance degradation take place. In contrast, active-standby mechanisms are characterized by fault detection followed by recovery actions, which require extra processing time. This type of strategy uses two component types: active, and standby. The active module usually provides the service for all environments; if the active module fails, the standby component assumes control. Standby modules are classified as hot, warm, or cold, depending on the level of service restoration [24].

Stochastic Petri net

Petri nets are very well suited for modeling several system types. This is because concurrency, synchronization, communication mechanisms, and deterministic and probabilistic delays, are naturally represented. In general, Petri nets are a bipartite directed graph, in which places (represented by circles) denote local states, and transitions (depicted as rectangles) represent actions. Arcs (directed edges) connect places to transitions, and vice versa. The original Petri Net does not have the notion of time for analyzing performance and dependability; the introduction of event durations results in a timed Petri Net.

Stochastic Petri nets (SPN) [26] is a special type of timed Petri Net, which allows the association of probabilistic delays with transition, by using exponential distribution. It is a high-level model which allows automatically to generate and evaluate Continuos Time Markov Chain (CTMC) [42]. This carateristic is particularly useful when the system’s state space is large and/or system component’s interactions are complex. Besides, SPN may also be evaluated through simulation. Simulation may be the alternative when non phase-type distribution [42] are required and/or the system state space is infinity.

Reliability block diagram

A Reliability Block Diagram (RBD) [12] is a combinatorial model, initially proposed as a technique for calculating the reliability of a system by using block diagrams. The technique has also been extended to calculate other dependability metrics, such as availability [21, 24, 25]. RBD may be a model of choice for computing availability and reliability related metrics for passive redundant mechanism and/or independent component systems [25]. In RBD, model are usually obtained by serial and parallel composition of components and subsystems.

In an arrangement series, the whole system is no longer operational if a single component fails. This means that all components must be operational for the serial system to succeed. If a system with n independent components is considered, the reliability (instantaneous availability or steady-state availability) is obtained by the product of component’s reliabilities (instantaneous availability or steady-state availability). In a parallel arrangement, the whole system is considered operational even if only a single component is operational, because there are a total of n possible success paths. For a system with n independent components, the unreliability (instantaneous unavailability or steady-state unavailability) is obtained by the product of component’s unreliability (instantaneous unavailability or steady-state unavailability). k-out-n redundancy may also be represented by RBDs. k-out-n RBD models allows you to represent more general compositions than simple series or parallel configurations. Actually, simple series or parallel configurations are special cases of k-out-n compositions [24, 25, 31, 42].

Failure critical index

In general, component importance ranking indicates the impact of a particular component on the overall system reliability. Based on certain system characteristics, various measures are calculated to estimate the component importance, and this often relates the contribution of a component to the system failure. Birnbaum introduced this concept, which can be considered one of the most widely used reliability importance indices [21]. The Birnbaum importance of a component i is equal to the degree of improvement in system reliability, when the reliability of the component is increased by one unit [21]. In other words, RI (reliability importance) is a partial derivative of system reliability with respect to the failure rate of each individual component [17].

The RI of component i can be computed as

$$ I^{B}_{i} = R_{s}\left(1_{i},\textbf{p}^{i}\right) - R_{s}\left(0_{i},\textbf{p}^{i}\right) $$
(4)

where \(I^{B}_{i}\) is the reliability importance of component i, pi is the component reliability vector with the ith component removed, 0 i represents the condition when component i fails, and 1 i describes the condition when i is working.

\(I^{B}_{i}\) depends on the structure of the system and the reliability of the other components. The RI of a component i is determined by the reliability of the other components, excluding i [32].

Related work

The increasing number of cloud platforms added to the competition among various cloud providers, has resulted in a situation whereby customers may find selection of a dependable and cost-effective cloud infrastructure difficult. In this context, several approaches have been proposed to assist cloud customers with identification of a suitable cloud infrastructure.

Rehman [35] presented a cloud selection approach based on historical QoS to rank cloud services; the proposed approach captures variations in each time-slot, and a service selection decision is then made. All decisions are then aggregated to find the best option. Lee et al. [22] proposed a hybrid multi-criteria decision-making model for a cloud service selection problem using balanced scorecard (BSC), fuzzy Delphi method (FDM) and fuzzy analytical hierarchy process (FAHP). Sachdeva et al. [36] combined a hybrid TOPSIS method with an intuitionistic fuzzy set to select appropriate cloud solutions to manage big data projects in a group decision-making environment. Garg et al. [14] proposed a framework called SMICloud, which compares different cloud providers based on user requirements. The framework considers a set of attributes (e.g. accountability, agility, assurance of service, performance, cost, and usability) when prioritizing and ranking the best services, with the ranking mechanism based on an Analytic Hierarchy Process (AHP). Liu et al. [23] presented a multi-attribute group decision-making (MAGDM) approach for the process of choosing an adequate cloud service vendor. This approach considered objective attributes (i.e., cost and time), as well as subjective attributes such as TOE (Technology, Organization, and Environment). To demonstrate the usefulness of the approach, a hypothetical example was given. Differently from these works that use data from other works or estimate data from case studies to ranking, we use the results obtained from the proposed hybrid approach to generate a set of optimal solutions. The approach combines RBD and SPN models to represent and evaluate cloud infrastructures with complex redundant mechanisms and maintenance policies.

Other work has evaluated the dependability of cloud infrastructures. Wei et al. [45] presented a hierarchical approach that combines reliability block diagrams and general stochastic Petri nets in evaluating the dependability of a virtual data center. Andrade et al. [1] developed a framework for transforming elements of SysML diagrams into deterministic and stochastic Petri nets. The work focused on modeling and analysis of cloud service management, with the aim of maximizing the use of cloud computing resources at the lowest possible cost. Sousa et al. [41] proposed a modeling strategy for planning cloud infrastructure when considering dependability and cost requirements. This approach is based on hierarchical and heterogeneous modeling that combines combinatorial and state-space models to represent and evaluate cloud infrastructures. Dantas et al. [9] described stochastic models for evaluating private cloud architectures, and presented a comparative cost study of public and private cloud providers. Despite the fact that these works are interesting, they only concerned with evaluating scenarios and presenting a comparison of the quantitative results.

The works discussed above has attempted to solve one side of the problem only: either the cloud selection perspective, or the dependability and cost evaluation aspect. However, there are various kinds of cloud environments with conflicting requirements that needs scientific approaches to judge which one should be chosen. To fill this research gap, a strategy that combines the need to rank different service constraints with the capacity to study alternative cloud infrastructures, is presented.

Adopted strategy and base cloud architecture

This section first introduces the proposed strategy for modeling, analysis, and ranking of cloud computing environments. After that, the base cloud architecture adopted for this study is detailed.

A strategy for decision making in cloud computing environments

The strategy is based on stochastic models and an MCDM method to rank a set of cloud infrastructures, taking into account availability, capacity-oriented availability, reliability and cost requirements. The proposed strategy can be used by service providers or individuals who are interested in building and selecting their own cloud computing environments. Figure 1 illustrates the proposed strategy. The macro activities consist of (i) experimental design, (ii) creation of availability and cost models, (iii) assessment, and (iv) the decision-making process.

Fig. 1
figure 1

Strategy for decision making in cloud computing environments

Experimental design (i): This activity comprises two steps: defining the base cloud environment, and designing the experiment. The first step determines the nature of the cloud environment in terms of its components and their interactions, and defines a base cloud environment. The experiment is designed to investigate the effects of variations in one or more parameters on the base cloud environment. This is accomplished by generating distinct scenarios from the base cloud environment (e.g. redundant nodes and repairing service), and investigating the impact of these modifications on adopted metrics such as reliability and capacity-oriented availability.

Create dependability and cost models (ii): The modeling strategy comprises two steps: creation of models, and hierarchical composition. The first step aims to identify a set of individual components from the base infrastructure to be modeled through RBDs. These models are useful to analyze the reliability/availability of simple and complex systems. They can also be used to model variations in the base architecture defined from the design of experiments, such as the redundancy of nodes or virtual machines. Nevertheless, RBDs cannot easily handle detailed failure/repair behavior, and SPNs are therefore adopted to model complex redundant mechanisms and maintenance policies. We combine the strongest advantages of these models to perform the analysis and constitutes a hierarchical model. We also propose equations for estimating the costs of the cloud environment, considering associated maintenance and operational costs.

Assessment (iii): This is a macro activity in which a hierarchical model, comprising RBD and SPN models, is used to evaluate the impact that different redundant mechanisms and maintenance policies have on the steady-state availability and reliability of an environment. The hierarchical model solution is computed by passing the outputs of the SPN models (the lower-level sub-models) as inputs to the higher level sub-models represented by the RBDs. Cost equations are also solved, to estimate the cost of the infrastructure under analysis. The results obtained from the dependability models and cost equations are then used in the next step to assist in the decision-making process.

Decision making (iv): At this stage, the cloud infrastructures are ranked based on any of the following distance measures: Euclidean, Manhattan and Minkowski. First, it is defined a criteria (e.g.: availability or cost) and objectives which can be minimize or maximize the criteria previously defined. The weights of each criterion according to the decision maker’s preference are then defined. Lastly, a set of alternative solutions is ranked, based on a distance measure chosen. If the results are satisfactory for the desired criteria, the proposed strategy is complete. Otherwise, adjustments are made in the criteria, and the macro activity steps are repeated. Note that this macro step is automated, and the tool developed for this is described in “MiPACE: a multi-criteria tool for planning and analysis of cloud environments” section.

The cloud environment

The base cloud architecture used for this study is depicted in Fig. 2, and comprises three main components: the main node, standby node, and the front-end. The main node consists of a virtual machine (VM) hosted on physical hardware (Hw). The virtual machine is represented by an operating system (OS) and an application service (APP). The application running in the VM is a digital library service. It should be noted, however, that the hardware in the main node supports an OS, a management server (Mng) and a VM. The management server executes the cloud services in the operating system. The standby node is used to ensure high levels of availability, and it assumes the role of the main node when a failure occurs; this node has the same components as the main node. The front-end is responsible for supervising and controlling the entire cloud environment through a specific cloud management tool. It is important to highlight that the remote storage volume can be accessed by the VMs, and is managed through the front-end. All of the components are interconnected by a private network. Note that from the base cloud architecture, more complex scenarios were considered based on the strategy described above and are described in the results and discussion section.

Fig. 2
figure 2

Infrastructure overview

The cloud operational mode is described as follows. The main node (and its VM) and the front-end must be in working order for the system to be operational. However, if the standby and main nodes fail, the cloud becomes unavailable. The roles of the standby and main nodes are swapped when the VM is restored. The objective of the standby node is to maximize the availability of the cloud infrastructure, which can be established through a Service Level Agreement (SLA).

Hierarchical models and cost equations

This section describes the hierarchical models designed to represent the base cloud environment previously presented (see Fig. 2). RBDs are used to represent the dependability relationship between independent subsystems, while detailed or more complex fail and repair mechanisms are modeled using SPNs. This approach enables the representation of many kinds of dependency between components, and avoids the well-known issue of state-space explosion [43]. Furthermore, this section also presents the proposed equations for estimating the cloud environment costs, which consider associated maintenance and operational costs.

Availability models for the base cloud environment

A hierarchical model was created to compute dependabi-lity-related metrics for the cloud environment described in “The cloud environment” section. Assuming cloud environments only, the architecture can be divided into three sub-models: front-end, main node, and standby node. The base cloud environment illustrated in Fig. 2 is modeled through RBDs and the respectively availability (reliability) is shown as:

$$ P_{s} = P_{PE} \times (1 - (1 - P_{mn})(1 - P_{sn})), $$
(5)

where P PE , P mn and P sn is the front-end, main node and standby node availability (reliability), respectively.

Equation 6 computes the availability for the front-end sub-model, which is composed of three components connected in series: hardware, operating system, and management server. The front-end component is responsible for identifying and managing the underlying virtualized resources (i.e., the servers, network, and storage). The hardware component corresponds to the physical parts of a computer system (i.e., the memory, CPU, network, etc.). The cloud OS primarily manages the operation of one or more virtual machines within a virtualized environment, while the management server executes the cloud services in the operating system.

$$ P_{s} = P_{Hw} \times P_{OS} \times P_{Mg} $$
(6)

where P Hw , P OS and P Mg is the hardware, operating system, and management server availability (reliability), respectively.

Equation 7 computes the availability for the main node. This node represents the computer resources for the deployment of virtual machines, and is composed of five components in series: hardware, operating system, management server, virtual machine, and service. Similar to the main node, the standby node is composed of five components in series: hardware, operating system, management server, virtual machine, and service. We assumed that the components of the standby node have the same dependability characteristics as the main node; i.e., the same MTTFs and MTTRs.

$$ P_{s} = P_{Hw} \times P_{OS} \times P_{Mg} \times P_{Mg} \times P_{Vm} \times P_{Sv} $$
(7)

where P Hw , P OS and P Mg is the hardware, operating system, management server, virtual machine, and service availability (reliability), respectively.

The availability model representing the base cloud environment depicted in Eq. 5 operates in a hot-standby redundancy configuration (indicated by the parallel configuration). That is, when the main node fails, the redundant component replaces it without a delay in activation. This type of redundancy improves the system availability, because when the main node fails, the hot-standby node automatically takes its place. Nevertheless, RBDs equations cannot easily handle detailed failure/repair behavior. The warm-standby and cold-standby replication mechanisms cannot be fully represented in RBD models, due to the dependency between states of components. Therefore, such mechanisms are represented by SPNs. More specifically, in this paper the warm-standby and cold-standby replication mechanisms are adopted for the main node and virtual machines components; Fig. 3 presents an example of an SPN model for a node with cold-standby redundancy. Note that the hierarchical model solution is computed by passing the outputs of lower-level sub-models as inputs to the higher level sub-models. For example, the results from an SPN model representing a redundant VM are passed as values to the base cloud environment model. The base model is then solved to compute dependability metrics.

Fig. 3
figure 3

An illustrative example of a SPN model for a node with a cold-standby redundancy

Cold standby model

A component with cold standby redundancy is based on a nonactive spare module that waits to be activated when the (main) active module fails. Hence, when the main module fails, the spare module’s activation takes a certain amount of time to be activated. This time period is named mean time to activate (TACT). As the spare component is switched off, it is considered that it does not fail until becoming operational.

Figure 4 depicts an SPN model that illustrates this mechanism. The model uses two virtual machines in four possible places: VM1_ON, VM1_OFF, VM2_ON, and VM2_OFF. The places represent the operational and failure states for both main and spare modules. The spare module (VM2) is initially deactivated, so no tokens are stored in places VM2_ON and VM2_OFF. When the main module fails (VM1), the transition TACT is fired, and consequently the spare module is activated. The immediate transition D_VM2 represents the deactivation of the spare module when the main module is recovered. This redundancy mechanism fails if both modules fail. Thus, the operating mode can be expressed as

$$ Cold_{Operational}~=~(\texttt{VM1\_{ON}=1~OR~VM2\_{ON}=1}) $$
(8)
Fig. 4
figure 4

SPN for cold-standby

where a token in the places VM1_ON or VM2_ON, defines the operational state of the environment.

Warm standby model

A component with warm standby redundancy is based on a nonactive spare module that waits to be activated when the active module fails. The difference with the cold standby redundancy is that the active and spare modules have failure rates λ and spare module has a failure rate ϕ when it is de-energized, considering 0≤ϕλ.

Figure 5 illustrates an SPN warm standby model. The warm standby model has an active module with a full failure rate λ F (1/MTTF_VM1), and the standby is operating with a reduced failure rate α F (1/MTTF_OPVM1). This redundancy mechanism has the spare module configured, but unavailable; it also ensures that the environment has continuously mirrored data. The spare module is activated in the presence of a fault in the environment, and consequently the time before activation will be shorter than in the cold standby approach.

Fig. 5
figure 5

SPN for warm-standby

When the main module fails, the secondary module is fully activated and replaces the faulty main component. The transition TACT represents the activation event. Places VM1_ON, and VM1_OFF represent the operational and non-operational states of the main module. Places OPVM2_ON and OPVM2_OFF represent the spare module in the operational state when not available. Places VM2_ON and VM2_OFF represent the situation in which the secondary module fails before being activated, because it is regularly synchronized with the main module. When the main module fails, the transition TACT is fired to activate the spare module, similarly to the cold redundancy. The immediate transition is named D_VM2, and has the same behavior in cold standby. The entire model fails if both modules fail. Thus, the operating mode can be expressed as

$$ {\begin{aligned} Warm_{Operational}~=~(\texttt{VM1\_{ON}=1~AND~OPVM2\_{ON}=1}) \end{aligned}} $$
(9)

where a token in places VM1_ON or OPVM2_ON represents the operational state of the environment.

Figure 6 depicts the SPN active-active (A/A) redundancy model. From this model, it is possible to estimate the capacity-oriented availability considering the service running on a set of VMs that are hosted an node. The NVM and NND parameters allow such representation, where, n>1. The places VM_ON, ND_ON, VM_OFF, and ND_OFF represent the operational and failure states for both VMs and Nodes. The transition DE is activated when there are no tokens in place ND_ON, that is, when all nodes fail. Thus, VMs will be failing if they fail, or when all nodes fail. The VM_DW place represents the failure state of the VMs when all nodes are faulted. The RVM transition represents the return of the VMs to the operational state since the nodes have been repaired.

Fig. 6
figure 6

An illustrative example of a SPN model for active-active (A/A) redundancy

Equation 10 presents the COA calculation for active-active model considering a scenario with two virtual machines and one physical node, i.e., NVM = 2 and NND = 1.

$$ {\begin{aligned} COA&=((\texttt{P}\{(\#\texttt{VM1\_ON})=(1 \times \texttt{NVM})\}~\times (1 \times \texttt{NVM}))\\ &\quad + (\texttt{P}\{(\#\texttt{VM1\_ON})=((1 \times \texttt{NVM})-1)\} \\ & \quad \times ((1 \times \texttt{NVM})-1)))/(1 \times \texttt{NVM}) \end{aligned}} $$
(10)

Cost model

The cost model uses the concept of Total Cost of Ownership (TCO) for evaluating and comparing the costs of cloud computing environments. TCO is the process of identifying costs categories other that price, transport, and operational [29, 46]. From the details of each experiment described above (such as the number of nodes, and service availability and unavailability), we allocate a period of time in which to estimate the total cost of each cloud environment under study. The estimate includes the cost of maintenance, operation, and rent of the cloud environment. Equation 11 estimates the total cost of the infrastructure.

$$ TCe = Cr + Cm + Cop $$
(11)

Cr, which is represented by Eq. 12, allows determination of the costs associated with the rent of the cloud infrastructure. \(\sum Lc\) represents the monetary value paid for the components that make up the infrastructure, that is, the amount of investment made in equipment and facilities to keep the infrastructure in operation. N is the number of nodes deployed, T is the assumed time period, and Av is the availability of the infrastructure as a service.

$$ Cr = \sum Lc \times N \times T \times Av $$
(12)

Equation 13 is used to estimate the maintenance costs (represented by Cm). Dwt is the downtime period. Lb Dw represents the maintenance labor cost per hour when a failure occurs. Sf is a service factor; i.e., customers may pay more or less depending on the contracted service, which affect the priority level for problem resolution. N and VM represent the number of nodes and virtual machines allocated by the contract, respectively. T is the period of service specified in the contract, while \(\sum Cr\) represents the costs related to the replacement of cloud components.

$$ Cm = (Dwt \times Lb_{Dw} \times Sf \times N \times VM \times T) + \sum Cr $$
(13)

Equation 14 represents Cop, and allows the calculation of the operational costs of the cloud environment. Ec is the energy consumption, and E p is the electricity price. Lb Up represents the monetary value of each hour spent on keeping the infrastructure operational, while T, Sf, Av, N, and VM represent the same parameters as presented in the previous equations.

$$ {\begin{aligned} Cop &= (Ec \times E_{p} \times N \times T \times Av)\\ & \quad + (Lb_{Up} \times Sf \times Av \times N \times VM \times T) \end{aligned}} $$
(14)

MiPACE: a multi-criteria tool for planning and analysis of cloud environments

This section is dedicated to presenting the details of the developed tool. MiPACE was developed to support the planning of cloud infrastructures which consider customer service constraints, and tool assists in the decision-making process. It allows analysts, technicians, managers, and users of cloud services to plan and analyze cloud scenarios. The tool is written in the programming language C, and the features implemented are described below.

  1. (i)

    Mercury tool: The Mercury tool [40] was developed by the MODCS research group, and allows the creation and evaluation of performance and dependability models. It implements the following formalisms: Continuous Time Markov Chains (CTMCs), Reliability Block Diagrams (RBDs), Energy Flow Models (EFMs), and Stochastic Petri nets (SPNs). The Mercury tool is used along with the MiPACE tool to create hierarchical models, and to solve the experimental study design scenarios.

  2. (ii)

    Integration module: Because MiPACE does not implement the RBD and SPN formalisms, this module was implemented to integrate the results obtained from the Mercury tool into MiPACE. That is, an input file is created with all of the results obtained from the design of experiment studies, and then uploaded into the MiPACE tool.

  3. (iii)

    Design of experiment editor: This feature allows users to plan experiments. Initially, it is necessary to choose a number of factors to be combined. The user then indicates the number of levels for each factor. Note that the tool supports the full factorial method, which involves testing every combination of factors against each other. As explained earlier, the experiment design is adopted to investigate the effects of variations of one or more parameters in the base cloud environment; we therefore generate distinct SPN models from the SPN model that represents the base cloud environment, and investigate the impact of such modifications on the adopted metrics. Thus, the purpose of this feature is to provide a set of scenarios that will be modeled and analyzed by the Mercury tool.

  4. (iv)

    Ranking generator: When the results obtained from the experiment study designs have been uploaded into MiPACE, this tool then ranks a set of optimal solutions. At this stage, the user of the tool must define the criteria function (e.g. availability or cost) and the objective which be minimized or maximized the criteria previously defined. The user can then choose the distance measure for ranking the solutions, such as the Euclidean, Manhattan or Minkowski distances [16]. These distance measures are used for similarity comparisons. Finally, the user can add weights to the criteria function previously defined to prioritize one variable over another; for example, the user could use this option to prioritize cost over high availability.

  5. (v)

    Report tool: The results that consider each criterion are displayed in the tool panel. MiPACE also generates two output files containing the ranking of the architectures, with one output file used for visualization and the other for plotting purposes. If necessary, the user can change the criterion function or objective, and repeat the ranking step.

Results and discussion

This section discusses a case study to illustrate the applicability of the proposed approach when considering availability, capacity-oriented availability, reliability and cost requirements. The approach assists individuals in identifying an ideal cloud infrastructure, and takes service constraints into account. The availability and cost models are useful during the design and analysis of cloud infrastructures, because they represent the characteristics of cloud environments. The results obtained by the evaluation of these models serve as the input to the MiPACE tool, which then finds a set of optimal solutions.

Evaluation of the base cloud environment

The first part of this case study aims to demonstrate the applicability of the availability models, and presents the results obtained for the base cloud environment. The base cloud environment represented in Fig. 2 was modeled (shown in Eqs. 6 and 7), and the availability models combined to represent the whole cloud infrastructure. Equation 5 illustrates the RBD model for the base cloud environment. The reliability importance indexFootnote 1 (RI) was the adopted to identify which component of the system required further attention to increase the availability level. Assuming only the Front-End and Main nodes of the devices present in Fig. 2, the RI index for the nodes was, 0.153201 and 0.219544, respectively. The main node is the most critical component, and it is most important when adopting a redundancy mechanism. Three redundancy strategies hot, cold, and warm were used to increase the availability levels, and these mechanisms are presented in Eqs. 4 and 5, respectively.

Table 1 shows the parameter values adopted for estimating the cost of the cloud environment. The E p and Ec parameters represent the energy price (in USD) and the energy consumption per kilowatt-hour [13], respectively. Such parameters only take the servers into account. The Lb Up (operation) and Lb Dw (maintenance) parameters indicate the labor cost per hour, while R t represents the rental rate for the cloud infrastructure. The type of service is categorized as gold, silver, or bronze, and these reflect the capacity of the cloud maintenance team to support different quality levels; a reduction of 10% in the mean time to repair the silver service in comparison to the gold service assumed, with a reduction of 20% in the mean time to repair the bronze service in comparison to the gold service.

Table 1 Cost parameters

Table 2 presents the Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR) used for the availability model (Eq. 5). Those values were obtained from [2, 8, 20], and are used to compute dependability-related metrics for the sub-models, and then for the whole system.

Table 2 Parameters for the front-end submodel

Table 3 shows the parameter values used for the sub-model (Eq. 6), based on [2, 8, 20].

Table 3 Parameters for the cloud node sub-model

Table 4 shows the input parameters for the cold-standby SPN model based on [2, 8, 20]. The parameter values used for evaluating the model may be modified to represent, for example, different service repair policies. It is thus possible to analyze situations where the firing of transitions is shorter or longer depending on the adopted repair policy. Furthermore, values regarding the mean time to failure or mean time to repair may represent components with higher or lower reliability. These models can assist individuals in identifying service repair policies that fit their needs.

Table 4 Parameters for the cold-standby SPN model

Table 5 shows the input parameters for the warm-standby model. Like the cold-standby model, the values relating to the mean time to repair and mean time to failure can be modified in order to represent, for example, more reliable repair policies. Such modifications help individuals in planning cloud environments that fit their needs and budgets.

Table 5 SPN Warm-Standby parameters

Table 6 presents the evaluation of the base cloud environment under distinct configurations. The first configuration is composed of the front-end and the main node, while the second comprises the front-end, the main node and a hot-standby redundancy of the main node. The third configuration includes the front-end, the main node and a cold-standby redundancy of the main node. Lastly, the fourth configuration is composed of the front-end, the main node, and warm-standby redundancy of the main node. After evaluation of the models that represent each of these configurations, it is possible to note the differences between each configuration in terms of availability, capacity-oriented availability, reliability and cost. Furthermore, the importance of redundant mechanisms is illustrated by comparing the first configuration to configurations (2, 3 and 4).

Table 6 Results obtained by solving the models for the base cloud environment

The downtime for each configuration was also calculated. We considered the downtime in minutes over a period of one month, and the following values were obtained: (1) 193.99 min, (2) 93.05 min, (3) 120.88 min, and (4) 108.60 min. As expected, the hot-standby mechanisms had the lowest downtime in relation to the other configurations, followed by warm, cold and finally the configuration without redundancy.

The third column of Table 6 describes the values obtained for the reliability analysis of each configuration. The reliability metric was obtained through transient analysis over a time (T) period of 24 hours. The reliability analysis assumes that the system cannot be repaired. The last column of Table 6 shows the estimated total cost (TCe) for each configuration; as expected, configuration 1 had the lowest TCe because it has the simplest configuration. In the next subsection, a n experiment study design is performed with the base cloud environment, in order to generate a set of scenarios that will be ranked using an MCDM method.

Planning for design of experiments

From the base cloud environment, we designed an experiment to identify which of the variables have the greatest influence on the adopted metrics (i.e., capacity oriented-availability, availability or cost). Table 7 illustrates the experimental plan that considers the variables from Table 8. We generated 72 configurations, where each case received sequential numbering (1 to 72). However, we have removed 36 because some of these configurations cannot be represented in the availability models. This can happen due to the full factorial method adopted. This method involves testing every combination of factors against each other, and some combinations cannot be applied in a real cloud environment. Service providers or individuals that adopt this approach, must check the experimental plan to identify any inconsistencies. For example, if the experimental planning generates a scenario where the number of nodes and the number of VMs are equal to 1, the redundancy mechanism factor should only be assigned to the no redundancy (N/R) level (see Table 8). This is because modelling a configuration with redundancy requires the number of nodes or VMs to be greater than 1. Thus, we kept the numbering initially generated after the removal of several scenarios that could not be represented.

Table 7 Planning for design of experiments of the first and second scenarios
Table 8 Factors and levels

Table 8 presents an overview of the factors (i.e., variables, (k)) and the levels (n i ) applied in the design of the experiments. We considered two levels of redundancy for the node factor, and considered up to three levels for the VM factor. The type of service (TS) factor represents the maintenance factors adopted in this paper, and reflects the capacity of the cloud maintenance team to support different levels of maintenance quality. In this sense, we consider a reduction of 10% in the mean time to repair using a silver service compared to a gold service, and a reduction of 20% in the mean time to repair using a bronze service compared to a gold service.

The redundancy type (RT) factor refers to the provision of support for the redundancy feature. Hot redundancy is commonly used when the system must not go down, even briefly, under any condition. This mechanism uses a spare component in the same regime as the primary one, and the redundant unit is fully capable of supporting the primary unit. Cold redundancy switches to the reserve unit only after failure of the primary unit. For the switch to take place, a time is scheduled for the substitution of the primary module for the reserve module. Warm redundancy tends to decrease the switching time for the reserve module. Lastly, the N/R level (present in Table 8) means that the environment does not consider any redundancy approach.

When the design of experiments and the modeling of each scenario were complete, the next step was to assess the hierarchical models based on the RBDs and SPNs. We considered the following dependability metrics: availability, reliability and unavailability. Table 9 presents the results obtained for each scenario. For the analysis of downtime, we considered a time interval of one month (720 h), and the reliability was obtained by transient analysis over 24 h. The results for each scenario are described in Table 9. This table serves as input for our proposed approach, in that we present two case scenarios examining distinct customer constraints.

Table 9 Dependability and cost results for each configuration

First case scenario

To demonstrate the applicability of the multi-criteria approach, the first case considers a situation in which a given cloud user wishes to select a cloud environment with: lower cost and higher capacity-oriented availability. The distance measure selected for this case was the Euclidean distance. Table 10 illustrates the ranking of cloud environments considering the criteria and distance measure selected. Note that this ranking was automatically computed by the MiPACE tool (see “MiPACE: a multi-criteria tool for planning and analysis of cloud environments” section).

Table 10 Configurations ranking for the defined criteria (i.e. minimize cost and maximize coa)

As Table 10, cloud configuration 38 is the best option for the defined criteria. This environment has the following configuration: two nodes and two VMs with hot-standby redundancy, and a gold service type. The second-best option in the ranking is configuration 40, which showed an increase in cost (0.03%) and decrease in COA (0.69%). The third-best option (configuration 34) had an increase in cost of 2.11% and an decrease in COA of 0.007% when compared to the first option. The worse scenario in the ranking (environment 43) had a increase in cost of 84.57% and in COA of 0.04% when compared to the best option. Table 11 illustrates the components adopted for each cloud configuration.

Table 11 Summary of the components used to rank the configurations (First case)

Figure 7 summarizes the configuration ranking generated for the first case scenario. Figure 7a gives an overview of all configurations described in Table 10, while Fig. 7b shows a set of optimal configurations. The y-axis represents the COA, while the x-axis represents the cost of the cloud configurations. The optimal set of results are plotted near to the x and y-axis, in accordance with the defined criteria (i.e., higher COA and lower cost). Figure 7b presents the optimal set of configurations at higher level of detail, and shows that configuration 38 is optimal.

Fig. 7
figure 7

Ranking of the configurations generated for the first case scenario. a Overview of all ranked configurations. b Optimal configurations

Second case scenario

The second case scenario considers a situation in which a company or service provider wish to choose a cloud environment with higher reliability and lower cost. In this context, the multi-criterion function aims to maximize the first goal and minimize the second. When the decision variables (reliability and cost) have been selected, the ranking is applied using the MiPACE tool. As in the first case scenario, the Euclidean distance method was adopted to find a set of optimal solutions.

Table 12 presents the ranked configurations when considering the defined criteria, while Table 13 describes the components of the first and last configurations. Configuration 39 is the best-ranked configuration, and has the following components: two nodes and two VMs with cold-standby redundancy, and the gold service type. The second configuration in the ranking is configuration 38, which shows a reduction in reliability of 0.062% and a reduction in cost of 0.058% (USD 0.19), when compared to the best-ranked configuration. The third-best configuration (40) had a reduction in reliability and cost of 0.074% and 0.026% respectively, when compared to the best-ranked configuration. The worst configuration in the ranking is configuration 48, which shows a decrease in reliability of 2.416% and an increase in cost of 44.858% when compared to the best configuration. Despite the reduction in cost of the configurations ranked in second and third positions, their reliability also decreased due to the type of redundancy adopted.

Table 12 Configuration ranking for the defined criteria (i.e. maximize reliability and minimize cost)
Table 13 Summary of the components used to rank the configurations (Second Case)

Figure 8 summarizes the configurations rankings generated for the second case scenario. Figure 8a presents an overview of all configurations described in Table 12, while Fig. 8b presents the details of the optimal set of configurations. The y-axis indicates the reliability of the cloud environment and the x-axis represents the cost. The defined criteria of maximizing reliability and minimizing cost, therefore mean that the optimal set of configurations are shown by points plotted in the lower right side of the figure (Fig. 8a). Figure 8b shows that configuration 39 is optimal.

Fig. 8
figure 8

Ranking of the configurations generated for the second case scenario. a Overview of all ranked configurations. b Optimal configurations

Third case scenario

The third case scenario considers a situation in which a company or service provider wishes to choose a cloud environment with higher COA and lower cost. In order to contemplate this case, we adopt the active-active (A/A) redundancy model and consequently generate another design of experiments. Table 14 illustrates the factor values and levels and Table 15 presents an overview of the factors (i.e., variables, (k)) and the levels (n i ) applied in the design of the experiments.

Table 14 Factors and levels
Table 15 Planning for design of experiments of the third scenario

Table 16 presents the ranked configurations when considering the defined criteria, while Table 17 describes the components of the first and last configurations. Configuration 31 is the best-ranked configuration and has the following components: eight nodes and sixteen VMs with active-active redundancy, and the gold service type. The second configuration in the ranking is configuration 1, which shows a decrease in COA (0.002%) and decrease in cost (61.89%) when compared to the best-ranked configuration. The third-best configuration (4) had a decrease in COA of 0.002% and decreased in cost (41.26%) when compared to the best-ranked configuration. The worst configuration in the ranking is configuration 60, which shows an increase in the cost of 1229.35% when compared to the best configuration. In spite of the increase in the number of nodes and VMs, and consequently an increase in the cost of the classified configurations, we noticed that the COA remained close.

Table 16 Configuration ranking for the defined criteria (i.e. maximize COA and minimize cost)
Table 17 Summary of the components used to rank the configurations (Third Case)

Figure 9 summarizes the configurations rankings generated for the third case scenario. Figure 9a presents an overview of all configurations described in Table 17, while Fig. 9b presents the details of the optimal set of configurations. The y-axis indicates the cost of the cloud environment and the x-axis represents the capacity-oriented availability (COA). The defined criteria of minimizing cost and maximizing COA, therefore mean that the optimal set of configurations are shown by points plotted in the lower right side of Fig. 9a. Figure 9b depicts that configuration 31 is optimal.

Fig. 9
figure 9

Ranking of the configurations generated for the third case scenario. a Overview of all ranked configurations. b Optimal configurations

Final remarks

In this paper, we presented an approach to model and analyze cloud infrastructures while considering availability, capacity-oriented availability (COA), reliability and cost requirements. This approach uses stochastic models and a multiple-criteria decision-making method to calculate dependability-related metrics, and to rank cloud infrastructures. A hierarchical strategy was used to modeling and planning cloud infrastructures, combining a multiple-criteria decision-making to find the most appropriate from the set. A case study was presented to illustrate the feasibility of the proposed approach. The results show that our approach enables service providers or individuals who are interested in building their own clouds, to choose the most appropriate cloud infrastructure; the approach allows multiple criteria such as reliability, downtime, and cost to be considered.

As future work, we plan to apply our approach in a more complex environment by utilizing other MCDM methods. We also consider evaluating other metrics like response time, throughput, and CPU usage.