Processing Big Data Across Infrastructures

Verena Kantere¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12402))

Included in the following conference series:

International Conference on Big Data

888 Accesses
1 Citations

Abstract

For a range of major scientific computing challenges that span fundamental and applied science, the deployment of Big Data Applications on a large-scale system, such as an internal or external cloud, a cluster or even distributed public resources (“crowd computing”), needs to be offered with guarantees of predictable performance and utilization cost. Currently, however, this is not possible, because scientific communities lack the technology, both at the level of modelling and analytics, which identifies the key characteristics of BDAs and their impact on performance. There is also little data or simulations available that address the role of the system operation and infrastructure in defining overall performance. Our vision is to fill this gap by producing a deeper understanding of how to optimize the deployment of Big Data Applications on hybrid large-scale infrastructures. Our objective is the optimal deployment of BDAs that run on systems operating on large infrastructures, in order to achieve optimal performance, while taking into account running costs. We describe a methodology to achieve this vision. The methodology starts with the modeling and profiling of applications, as well as with the exploration of alternative systems for their execution, which are hybridization’s of cloud, cluster and crowd. It continues with the employment of predictions to create schemes for performance optimization with respect to cost limitations for system utilization. The schemes can accommodate execution by adapting, i.e. extend or change, the system.

You have full access to this open access chapter, Download conference paper PDF

Cloud resource management: towards efficient execution of large-scale scientific applications and workflows on complex infrastructures

Article Open access 19 June 2017

Big Data Scientific Workflows in the Cloud: Challenges and Future Prospects

Introduction to the 3rd International Workshop on Cloud Computing and Scientific Applications (CCSA’13)

Keywords

1 Introduction

An ever-growing interest in analytics has made Big Data management Applications (hereafter BDAs) a number one priority that allows data-driven scientific domains to extend their research methods to an unprecedented scale and tackle new research questions, but also businesses worldwide to define new initiatives and re-evaluate their current strategies through data-driven decision-making. The deployment of BDAs on a large-scale system, such as an internal or external cloud, a cluster or the crowd, needs to be offered with guarantees of predictable performance and utilization cost. Currently, however, this is not possible, because we lack the technology that identifies the key characteristics of BDAs and their impact to performance, as well as the reasoning on the role of the system operation and infrastructure in performance. The proposed project aims to fill this gap by producing the research results for the development of such technology, and the technology itself.

BDAs typically produce and consume terabytes of data and they may perform complex and long-running computations. Thus, frequent access to the storage subsystem or intense use of the CPU becomes an application bottleneck. Modeling the performance with respect to I/O, CPU and memory utilization gives optimization opportunities to both system providers and users. System providers can predict such bottlenecks and utilize system combinations that may alleviate them. They can also offer cost guarantees within Service Level Agreements related to I/O or computation, as well as more appealing pricing schemes. Furthermore, the users can take informed decisions for the adaptation and the deployment of their BDA in a system, and, therefore, select in an optimal manner the computing resources needed for their application workload.

Research Objective

We need to manage efficiently the performance of BDAs that run on systems operating on large infrastructures taking into account the cost of using the underlying system. The combination of application characteristics A and system characteristics S determines the performance P of the application. Instances of the input space I = S × A are mapped to specific points or areas of P. We need to identify functions f : I → P, i.e. mappings of the application and system characteristics to the application performance. We need to use these functions to predict the performance. Furthermore, the utilization of a system s ∈ S entails a cost C, which is either the cost of power consumption or the renting price. Therefore, points or areas in P for an application a ∈ A running on a system s ∈ S are associated in a 1–1 manner with cost C, i.e. (a ∈, s ∈ S) → (p ∈ P, c ∈ C). We need to manage the performance and cost by adapting either the application or the system. Based on the predictions for performance and the entailed cost, we want to create guidelines for application deployment on various system combinations with guarantees for performance, as well as cost. Ultimately, we want to select areas I′ ⊆ I for which performance is maximized and cost is minimized.

In this paper we describe the methodology with which we can achieve the above objective. Overall, to achieve it we have to start with the modeling of BDAs with respect to dimensions of workload, data and resources and the profiling of BDAs with respect to the proposed modeling. We also need to explore alternative systems for the deployment of BDAs, ranging from cluster, to cloud, and crowd computing, and e need to develop a methodology for the performance prediction of BDAs deployed in any of the three computing environments and combinations of them. This can be achieved with the creation of a multi-agent utilization model and the analysis of the computing environment with numerical and analytical methods. Then, we can employ the predictions to create two types of schemes for performance optimization with respect to cost limitations concerning the system utilization. The first type will reduce the execution cost by adapting, i.e. by approximating or summarizing, the workload. The second will accommodate the execution cost by adapting, i.e. extend or change, the system. Certainly, such research needs to be accompanied by thorough implementation and experimentation on real environments that can give feasible combinations of application and system characteristics. We argue that for such research we need to select at least two application environments from domains on which Big Data Analytics have and will have in the future a great impact. These two environments need to cover from end to end the profiling dimensions of underlying systems, in a very different qualitative and quantitative manner.

The efficient management of the performance of BDAs is a matter of utmost importance in the Big Data era. The research results will show the potential and the limitations of performance and cost prediction in eco-systems of users and resources from combinations cluster, cloud and crowd in arbitrary and dynamically evolving combinations. Ultimately, the results will enable the creation of policies at national and European levels concerning effective utilization of resources and constructive public engagement.

2 Related Work

The potential of managing and processing vast amounts of data in extremely large infrastructures, but, also the difficulties that such processing entails, has led both academia and industry to focus intensively on the production of such solutions. These efforts have resulted in hundreds of research papers in the last few years [1] that try to deal with the (a) heterogeneity of resources in data centers and (b) scalability problems, the (c) variability of applications and (d) unpredictability of the workload. The proposed research aspires to tackle these four issues in a holistic manner by proposing a suite of techniques, starting from the overall and generic profiling of applications on heterogeneous systems, continuing with performance and cost prediction and following up with proposals for dealing with scalability issues, based on workload approximation and summarization, as well as expansion of the underlying system and combination of it with external clouds and crowd. In the following we discuss the related work on major issues in the proposed approach, i.e. performance prediction, I/O modeling and prediction, approximation of execution and hybrid systems for large-scale computing.

Performance Prediction

Specifically, the issue of performance prediction has received great attention. The work in [2] proposes a methodology in order to predict the performance of cloud applications developed with the mOSAIC framework [3] based on benchmarking and simulation. These are much harder to achieve for any type of application that is not developed through a specific framework. Our work aims to provide a more generic methodology, that will be able to capture the characteristics of an application and predict its performance on various types of clouds, but also cluster and crowd. Furthermore, this work does not predict the scaling laws of applications. CloudProphet [4] aims at predicting the performance and costs of legacy web applications that need to be executed in the cloud, based on application instrumentation and tracing. Our work intends to predict the performance and costs for applications that can use the resource elasticity of large infrastructures. Elastisizer [5] is a system that takes as input a user query concerning the cluster size that is suitable for a MapReduce job. The system focuses on profiling in a detailed way the phases of MapReduce and their configuration parameters, e.g. how many Map and Reduce phases, how much memory is allocated to each and what is the data access scheme, and takes into account the structure of the underlying cluster. Concerning the deployment of BDAs on large infrastructures, we need basic profiling, e.g. I/O operations and CPU usage, for general applications and not only for MapReduce: therefore, we need to focus on applications that are centralized (single-VM) or distributed (multi-VM) and process data in a batch or iterative manner.

Performance prediction has been also tackled for the specific case of database deployment in the cloud [6]. The most recent work in [7] deals with prediction through black-box modeling for similar workloads and white-box modeling for very different workloads than those observed in training. The latter gives results that are fed in statistical regression models. This work focuses on OLTP workloads and does not consider the expansion of the underlying system for accommodating the load. Rather than this, overall, such works focus on the creation of multi-tenancy strategies [e.g. 8,9] and consolidation schemes [10] for OLTP workloads and are based on the assumption that the execution time of queries is known through historical data. In our work, we will focus on big analytical workloads, for which we do not know the execution time of jobs. Such workloads differ from OLTP ones, in that they can include both short and long running tasks, which can be both I/O and CPU intensive, and can show skewness and temporal locality in data access. Therefore, we intent to profile the applications with respect to basic operations, e.g. reads and writes, as well as CPU cycles and I/O accesses, rather than profiling whole tasks, queries or transactions. Furthermore, multi-tenancy and consolidation are solutions complementary to system expansion. We intend to build policies for accommodating the application load by adding servers and/or combining outsourcing of the load to external clouds and crowd, given or adapting an existing multi-tenancy and consolidation schemes.

Performance prediction has been tackled for analytical workloads, but not extensively and not for workloads on big amounts of data. The work in [11] and its most recent continuation Contender [12], focus on batch queries that consist of a restricted number of query templates known a priori and propose a prediction technique based on simulations of the workload execution. The challenge in these works is the concurrent execution of queries. Concerning the deployment of BDAs on large infrastructures, we need we need to also tackle the challenge of workload concurrency and its effect to performance, but for a broad range of processing tasks on big amounts of data, which, furthermore, run on for large heterogeneous underlying systems.

I/O Modeling and Prediction

The sheer size of Big Data collections increases the complexity of computation, but especially of I/O management, and makes it very hard to characterize and predict the performance behavior of corresponding applications. I/O modeling and prediction have been studied in the past. In [13, 14] modeling for disk drives and disk arrays is performed. Both approaches make a hierarchical decomposition of the I/O path and examine the impact of each component separately. However, this is impossible for large-scale, complex virtualized environments. Other well-known approaches to disk array modeling are the ones presented in [15, 16] and [10]. While [15] models the disk array and [16] treats it as a black box, they both define workload dimensions and they fit samples of performance to a model. Nevertheless, as these models target only disk arrays, the defined dimensions are not adequate to capture BDA workload characteristics. We need to fill this gap by developing methods for the thorough profiling of BDAs in terms of workload, but also, data and resources.

Although the above approaches seem to work satisfactorily enough for disk I/O prediction, they do not work when long and complex I/O subsystem paths are involved. Realizing that I/O systems become more and more complex, in [17] a self-scaling benchmark is presented. It measures I/O performance as it is seen by an end user issuing reads and writes. An end-to-end approach is also used in [18]. However, these works do not focus on applications deployed in cloud-based environments, but in parallel multicore systems; Concerning the deployment of BDAs on large infrastructures, we need to focus on the I/O and overall performance behavior of BDAs deployed in large-scale systems, with an emphasis on cloud.

I/O characterization in virtualized environments is also carried out in [19, 20]. Understanding I/O performance reveals opportunities for server consolidation and designing efficient cloud storage infrastructures. However, [19] is an experimental study for specific applications and does not include modeling of I/O behavior. An alternative approach to the problem of I/O characterization in the cloud is presented in [21]. I/O traces from production servers are collected and used as training set for a machine-learning tool. During the learning process, I/O workload types are identified automatically and as output, a I/O workload generator is produced. This generator can simulate real application workloads and thus it can be used for storage systems evaluation. Nevertheless, such an evaluation is not done yet. Concerning the deployment of BDAs on large infrastructures, we need to output results in this direction, evaluating thoroughly real cloud storage systems, but also aggregated crowd systems.

Approximate Execution

There has been a great and growing interest in the past few years on how to execute, specifically, a query workload in a way that it is approximate with respect to its actual execution, and, therefore, gain in response time.

Some of the work is on approximate query processing. The recent work in [22] as well as in [23, 24] explore querying large data by accessing only a bounded amount of it, based on formalized access constraints. These works give theoretical results on the classes of queries for which bounded evaluation is possible. Other works focus on how to pre-treat the data in order to create synopses: histograms (e.g. [25]), wavelets (e.g. [26] and sampling (e.g. [27]); or to perform execution which terminates based on cost constraints and returns intermediate results (e.g. [28]). For BDAs, it makes sense to not try to achieve approximation through alteration of the data, but through alteration of the workload.

Another type of work is on approximate query answering, in which a query that is more ‘suitable’ in some sense is executed in the place of the original one. In [29] a datalog program is approximated with a union of conjunctive queries, and in [22] the same example is followed with the creation of approximate versions of classes of FO queries. In a similar spirit, the works in [30, 31] deal with tractable queries for conjunctive queries and the work in [32] deals with subgraph isomorphism for graph queries. Concerning the deployment of BDAs on large infrastructures, we need to work in the same lines as these works, but focus on workflows and sets of tasks rather than queries.

Hybrid Systems for Large-Scale Computing

Issues of resource under-utilization and saturation in cloud systems can be tackled with the employment of combinations of large-scale systems that can limit over-provisioning and accommodate excessive demand in a dynamic manner. The appearance of the cloud computing paradigm was soon followed by the idea of the construction of cloud federations. The basic objective of the latter is to offer broad choices on a related set of cloud services from multiple providers. This allows cloud federations to build hybrid clouds, which can compose services from multiple sites. Currently, many cloud vendors are endorsing this idea and are building such solutions. Cisco is offering the ‘Cisco Intercloud Fabric’ [33], designed to combine and move data and applications across different public or private clouds as needed. HP is launching the ‘Helion Network’ [34], a federated ecosystem of service providers and software vendors that will provide customers with an open market for hardware-agnostic cloud services. Amazon facilitates the migration of enterprise database legacy applications to the AWS cloud through a hybrid cloud solution that employs ‘Oracle RAC on bare metal servers, either in on-premises data centers or in a private cloud, with low-latency connections to web/app tiers in AWS’ [35]. Similarly, Microsoft offers the HPC Pack, which allows the creation of a hybrid high performance cluster consisting of an on-premises head node and some worker nodes deployed on-demand in Azure [36]. Even though big enterprises in cloud computing have recognized the potential of hybrid systems that consist of public clouds as well as private clouds and clusters, the research in this domain is still very recent and nascent. The work [37] is studying the dynamic allocation of resources across multiple clouds, focusing on their intertrust relationships. Similarly, [38] is proposing a way to build federations so that penalties due to possible violation of service quality by untrustworthy providers are minimized. The work in [39] presents an overall vision for the creation of migratable, self-managed virtual elastic clusters on hybrid cloud infrastructures. The work in [40] is more elaborate and proposes an online decision algorithm for outsourcing jobs, cost-based on a given budget. Beyond public and private clouds and clusters, research has not yet considered the inclusion of crowd computing in a hybrid system. The work in [41] is a first step towards this direction, and envisions volunteer cloud federations, in which clouds may join and leave without restrictions and may contribute resources without long-term commitment. The existing research results are mostly exploratory and preliminary; furthermore, they are based on cloud simulations. Concerning the deployment of BDAs on large infrastructures, we need to take this research one step further, by producing mature and complete schemes for the creation of hybrid systems that involve dynamically public and private clouds, clusters and crowd, thoroughly tested for real Big Data applications on real infrastructures.

3 Methodology for Deploying BDAs on Large Infrastructures

In order to manage the performance of BDAs we argue that we need a methodology that includes four steps: profiling of BDAs, exploration of alternative storage and processing environments, predicting performance, and, finally, optimizing the performance. In the following we give details for this methodology.

A. Profiling BDAs

The first step is to profile BDAs in order to understand their characteristics and the qualitative role they play in their performance. For this we need to create a general model for BDAs that can be employed for such characterization. In the following we will create a profiling methodology, which, based on observed measurements, will give us the characterization of a specific application with respect to the general model. We will measure and model both CPU and I/O performance.

a. Modeling BDAs

The application space A includes the following subspaces: Workload subspace, W, data subspace, D, and resource subspace, R. Thus, A = W × D × R and points a ∈ A represent feasible combinations of workload, data and re-source characteristics. We need to model the behavior of BDAs in terms of dimensions of W, D and R. Such dimensions may be interrelated. Our focus should be on identifying such interrelations as well as combinations of W, D and R instances that appear in realistic situations of Big Data management and have distinct impact on the performance behavior. Some of the dimensions to consider are:

Workload dimensions:
1. 1)
  Size of workload in terms of number and size of processing tasks and\or number of users.
2. 2)
  Type of workload in terms of: batch or iterative processing; CPU, I/O and memory intensity.
3. 3)
  Data access patterns: e.g. uniform, sequential and skewed with varying degrees of skewness data access.
4. 4)
  Task structure and complexity: Particular structures and complexities of tasks, for example query plans can be associated with patterns of data access and associative data access.
Data dimensions:
1. 1)
  Size of the data collections accessed by the workload.
2. 2)
  Replication degree and schema of the data collections.
3. 3)
  The schema of the data and data dependencies.
4. 4)
  The types of data: e.g. the data can be in a ‘raw’ format (i.e. bytestreams), or have a specific unstructured or structured formats (e.g. key-value data, RDF, relational).
Resources dimensions:

We should consider the deployment of applications that utilize either local storage or distributed object stores for the Virtual Machine (hereafter VM) storage device, as well as the deployment of applications directly on the system environment.

b. Creating a profiling methodology for BDAs

We need to create a generic process for the profiling of BDAs along the lines of the model produced in step (a). The profiling, first, samples the input space A appropriately and, second, benchmarks the BDAs in order to outline their performance behavior. The sampling takes as input the range or the set of values for the dimensions of the characteristics of the application in hand. For dimensions that contain a finite and small number of values, e.g. the type of operation {read, write}, all values can be used for sampling. For other dimensions with infinite values, e.g. the size of data, or a very large number of values, e.g. the number of VMs, some of the values will be sampled according to exploratory experiments, or pattern matching between already benchmarked generic applications. The sampling creates a subspace of A that can be used for performance measurements; acquiring these constitutes essentially the benchmarking of the BDA.

B. Exploration of alternative storage and processing environments

The second step of the methodology is to explore system architectures, i.e. instances in the system space S, that can be commonly found in environments that run BDAs and that have substantial differences, which affect the performance of BDAs. The BDAs may present very different behavior when deployed on different environments, because of the architectural complexity of the I/O path, the network connectivity and the heterogeneity of the nodes with respect to CPU and memory. We need to explore: cloud, cluster, and crowd environments.

a. Cloud computing

In a cloud environment, the nodes act as independent and need not be in the same physical location. The memory, storage device and network communication are managed by the operating system of the basic physical cloud units. This software supports the basic physical unit management and virtualization computing. We observe that the architectural complexity of the I/O path in a virtualized environment, such as the cloud, presents a serious burden in modeling the performance of a BDA. An I/O request may have to go through the VM main memory, some hypervisor-dependent drivers, the VM host memory and the network before it finally reaches the storage system, which, in turn, may have its own caches. As data centers display high heterogeneity, the structure of this path varies across different VM containers even within the same data center. Thus, a hierarchical analytical approach, as in for disk arrays, which requires a thorough evaluation and modeling of each system component separately, seems to be highly impractical: results might be too complex to analyze or combine for a definitive I/O and, therefore, BDA, modeling, especially given the numerous choices of different setups, vendors and hardware involved. In order to avoid tackling all this complexity, we need to treat the I/O subsystem path as a ‘black box’ and attempt to characterize it in an end-to-end fashion. Thus, we need to model I/O performance, as it is perceived by the application end-user. Since read and write operations may be fulfilled directly from caches, cache effects can be incorporated to the end-to-end I/O model.

b. Cluster computing

In a cluster environment a group of nodes are run as a single entity. The nodes are normally connected to each other using some fast local area networks. Performance and fault tolerance are the two reasons for deploying an application on a cluster rather than a single computer. In a cluster, a BDA is deployed right on the operating system, rather than in VMs. The cluster can offer parallel execution of a BDA, by employing many processors simultaneously. The cluster environment is homogeneous, as the nodes are usually in the same physical location, have the same hardware and operating system, and are directly connected with the same network. Therefore, it is possible to model the environment in more detail than the cloud. For cluster need to treat the I/O subsystem path as an ‘open box’, as we can have knowledge of the network connectivity and the memory of the nodes.

c. Crowd computing

A crowd environment is a combination of cloud and volunteer computing. Like cloud computing, crowd computing offers elastic, on-demand resources for the deployment of BDAs. As in cloud, the BDAs run on a virtualized environment. The nodes of the system are usually personal computers, which, naturally, are not in the same physical location. The VMs are shipped to the computers from a central location, usually a cloud or a cluster environment, and partial results produced by remote VMs are transmitted back to the central location through the Internet. The modeling of the I/O subsystem path for the crowd has the same problems as for the cloud, as the environment is once more virtualized and heterogeneous. However, in this case we have also the problem of remote access of nodes, as well as their availability fluctuations, due to their autonomy. These two factors make the crowd environment very dynamic and harder to model than the cloud. For the crowd, we need to treat the whole BDA performance in an end-to-end fashion, as perceived by the central location.

C. Performance prediction

Based on the profiling produced in step (3.1) we can develop methods for the prediction of the performance of BDAs deployed in any of the three computing environments explored in step B and combinations of them. We can achieve this with the creation of multi-agent utilization models and the analysis of the computing environment with numerical and, whenever possible, analytical methods.

a. Creation of a multi-agent utilization model

Based on the results of the methodology steps (3.1) and (3.2) we can create a multi-agent model of the utilization of the computing resources by the users. The computing systems explored in step B usually host multiple applications and/or multiple instances of the same application. The model will have the type of BDAs and number of BDA instances as well as the type and size of the system on which they are deployed as adjustable parameters. BDA instances can be described by agents whose needs are the results of a probability distribution determined by their profiling. As a simple example, we can start considering Discrete Event Simulation (DES) [42] approaches, in which the events are the requests for a running a BDA instance for a given amount of resources, i.e. a given systems. If the requested resources are available, the BDA is executed; if not it goes in a waiting queue. Upon completion, we can obtain the BDA performance and the cost of using the system.

b. Numerical and analytical analysis of the computing environment

We need to study the scaling behavior of the agent-based model under a range of conditions, to elucidate scaling laws that describe the performance and cost of sharing, amongst multiple BDAs, computing resources in cloud, cluster and crowd systems. We can vary the adjustable parameters of the multi-agent model, i.e. the type and number of BDAs, and the type and size of the system, and we can determine, through numerical simulation, mappings between the computing environment, (the application and system), and performance, i.e. functions f : I → P. We can also conduct an analytical approach with simplified models, for special cases of computing environments, for example BDAs with a limited range of values for workload, data and resource dimensions, which run on a cluster.

D. Performance optimization for BDAs

The fourth step of the methodology is to employ performance predictions from step (3.3) in order create performance optimization schemes for BDAs. A monitoring process can observe the real performance of a BDA already deployed in a system and can pull performance statistics in specific time points; the monitoring schedule can be customized according to the characteristics of the specific BDA and can be decided either beforehand or on the fly, while the BDA is executing. The monitoring process can initiate and re-initiate, if needed, the profiling and performance prediction described by the methods of step (3.3). Based on measurements and predictions an advisor will suggest schemes for performance optimization. We can create two types of schemes that can improve the performance of a BDA with respect to cost limitations concerning the system utilization. The first targets to reduce the execution cost (e.g. response time or throughput) and the second to accommodate it:

a. Reduce execution cost through workload approximation and summarization

Workload Approximation: The reduction of execution cost can be done by altering the workload of the BDA. We can explore ways to approximate or summarize the workload. Workload approximation can be achieved through its relaxation from tasks that are I/O, CPU or memory intensive. Relaxed versions of the workload can be explored based on their comparison with the original version with respect to I/O, CPU or memory. We can deduce such comparisons based on the prediction functions created in step (3.3).

Workload Summarization: Beyond approximation, the execution cost can be reduced with workload summarization. We can achieve this with mining common tasks within and across workloads, and scheduling workload execution so that common tasks are executed once or a small number of times. This is possible especially for BDAs with multiple instances, in which, frequently, workloads contain some core identical tasks and some parameterized tasks with different parameter values.

b. Accommodate execution cost through expansion and\or change of computing system

If reduction of execution cost is not a choice, we can accommodate this cost by altering the computing system. The simulation of the computing environment through multi-agent models can allow a quantitative and direct comparison of performance and utilization cost of BDAs running on cloud, both commercial and private, cluster and crowd systems. We can explore the possibility to create systems with guarantees from stochastic resources and to improve these guarantees by hybridization with dedicated resources. To achieve this we can employ the results of step (3.3), f : I → P, in order to create iso quality relations between BDAs and system combinations, i.e. find functions f_P,C : A′ → S′, A′ ⊂ A, S′ ⊂ S that guarantee a specific performance P with a specific cost C for a set of application instances A′ and a set of system instances S′. We can use these iso relations in order to create guidelines for the online and the offline deployment of BDAs in combinations of public and private clouds, clusters and crowd systems, which guarantee the accommodation of execution cost together with some guarantees for system utilization cost. Our utmost goal should be to select application and system combinations, i.e. areas I′ ⊆ I for which the performance is maximized and system utilization cost is minimized.

E. Research deployment

The described research needs to be carried out in a combination of big computing environments, which, together, will allow us to investigate a very big part of the spectrum of BDA and system characteristics. In such environments we will face the challenging task of handling a big range of applications on a big range of datasets. For these, we need to collect and mine an extreme volume of time-dependent data about their operation, and produce timely profiles and predictions, which is on its own a Big Data analytics problem. As an example of the appropriateness of such environments for the deployment oft he described research, we show in Fig. 1 the complementarity of the environments of the Worldwide LHC Computing Grid (WLCG) at CERN in high-energy physics, Vital-IT (part of the Swiss Institute for Bioinformatics (SIB)) in bioinformatics, and Baobab, the high performance computing cluster of the University of Geneva. The applications of CERN and a lot of applications of Vital-IT are BDAs, in the classical sense of 4 Vs (volume, variety, velocity, veracity). The rest of the applications of Vital-IT and the applications of Baobab process large or numerous small datasets from different sources. Since all infrastructures are very big, they are heterogeneous; however, Baobab and Vital-IT have homogeneous parts.

4 Conclusion

We describe our vision for research of the deployment of BDAs that run on systems operating on large infrastructures, in order to achieve optimal performance, while taking into account running costs. We propose a methodology that models and profiles applications and explores alternative systems for their execution on hybridizations of cloud, cluster and crowd. It continues with the employment of predictions to create schemes for performance optimization with respect to cost limitations for system utilization.

References

Jennings, B., Stadler, R.: Resource management in clouds: survey and research challenges. J. Netw. Syst. Manag. 23(3), 567–619 (2014). https://doi.org/10.1007/s10922-014-9307-7
Article Google Scholar
Cuomo, A., Rak, M., Villano, U.: Performance prediction of cloud applications through benchmarking and simulation. Int. J. Comput. Sci. Eng. 11(1), 46–55 (2015)
Google Scholar
Petcu, D., et al.: Architecturing a sky computing platform. In: Cezon, M., Wolfsthal, Y. (eds.) ServiceWave 2010. LNCS, vol. 6569, pp. 1–13. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22760-8_1
Chapter Google Scholar
Li, A., Zong, X., Kandula, S., Yang, X., Zhang, M.: CloudProphet: towards application performance prediction in cloud. SIGCOMM-Comput. Commun. Rev. 41(4), 426 (2011)
Article Google Scholar
Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC 2011 (2011). Article no: 18
Google Scholar
DBSeer: resource and performance prediction for building a next generation database cloud. In: CIDR 2013 (2013)
Google Scholar
DBSeer: pain-free database administration through workload intelligence. PVLDB 8(12), 2036–2047 (2015)
Google Scholar
Zhang, Y., Wang, Z., Gao, B., Guo, C., Sun, W., Li, X.: An effective heuristic for on-line tenant placement problem in SaaS. In: ICWS, pp. 425–432 (2010)
Google Scholar
Liu, Z., Hacigümüs, H., Moon, H.J., Chi, Y., Hsiung, W.-P.: PMAX: tenant placement in multitenant databases for profit maximization. In: EDBT 2013, pp. 442–453 (2013)
Google Scholar
Curino, C., Jones, E.P.C., Madden, S., Balakrishnan, H.: Workload-aware database monitoring and consolidation. In: SIGMOD 2011, pp. 313–324 (2011)
Google Scholar
Ahmad, M., Duan, S., Aboulnaga, A., Babu, S.: Predicting completion times of batch query workloads using interaction-aware models and simulation. In: EDBT (2011)
Google Scholar
Duggan, J., Papaemmanouil, O., Çetintemel, U., Upfal, E.: Contender: a resource modeling approach for concurrent query performance prediction. In: EDBT 2014, pp. 109–120 (2014)
Google Scholar
Ruemmler, C., Wilkes, J.: An introduction to disk drive modeling. IEEE Comput. 27(3), 17–28 (1994)
Article Google Scholar
Uysal, M., Alvarez, G.A., Merchant, A.: A modular analytical throughput model for modern disk arrays. In: MASCOTS (2001)
Google Scholar
Anderson, E.: Simple table-based modeling of storage devices. Technical report, HP Labs (2001)
Google Scholar
Wang, M., Au, K., Ailamaki, A., Brockwell, A., Faloutsos, C., Ganger, G.R.: Storage device performance prediction with CART models. In: MASCOTS (2004)
Google Scholar
Chen, P., Patterson, D.A.: A new approach to I/O performance evaluation-self scaling I/O benchmarks, predicted I/O performance. In: SIGMETRICS (1993)
Google Scholar
Ipek, E., de Supinski, B.R., Schulz, M., McKee, S.A.: An approach to performance prediction for parallel applications. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 196–205. Springer, Heidelberg (2005). https://doi.org/10.1007/11549468_24
Chapter Google Scholar
Gulati, A., Kumar, C., Ahmad, I.: Storage workload characterization and consolidation in virtualized environments. In: VPACT (2009)
Google Scholar
Kraft, S., Casale, G., Krishnamurthy, D., Greer, D., Kilpatrick, P.: Performance models of storage contention in cloud environments. Softw. Syst. Model. 12(4), 681–704 (2013). https://doi.org/10.1007/s10270-012-0227-2
Article Google Scholar
Delimitrou, C., Sankar, S., Vaid, K., Kozyrakis, C.: Decoupling datacenter studies from access to large-scale applications: a modeling approach for storage workloads. In: IISWC (2011)
Google Scholar
Potti, N., Patel, J.M.: DAQ: a new paradigm for approximate query processing. PVLDB 8(9), 898–909 (2015)
Google Scholar
Fan, W., Geerts, F., Libkin, L.: On scale independence for querying big data. In: PODS (2014)
Google Scholar
Cao, Y., Fan, W., Yu, W.: Bounded conjunctive queries. PVLDB 7(12), 1231–1242 (2014)
Google Scholar
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. In: VLDB (2009)
Google Scholar
Garofalakis, M.N., Gibbons, P.B.: Wavelet synopses with error guarantees. In: SIGMOD (2004)
Google Scholar
Agarwal, S., et al.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: SIGMOD (2014)
Google Scholar
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys (2013)
Google Scholar
Chaudhuri, S., Kolaitis, P.G.: Can datalog be approximated? JCSS 55(2), 355–369 (1997)
MathSciNet MATH Google Scholar
Barcelo, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. SICOMP 43(3), 1085–1130 (2014)
Article MathSciNet Google Scholar
Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT (2011)
Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractability to polynomial time. PVLDB 3(1), 1161–1172 (2010)
Google Scholar
http://www.cisco.com/c/en/us/products/cloud-systems-management/intercloud-fabric/index.html
http://www8.hp.com/us/en/cloud/helion-network-overview.html
https://reinvent.awsevents.com/files/sponsors/Logicworks_Hybrid_Cloud_Legacy_Applications_WP.pdf
https://technet.microsoft.com/en-us/library/jj899572.aspx
Lo, N.-W., Liu, P.-Y.: An efficient resource allocation framework for cloud federations. J. Inf. Technol. Control 44(1) (2015)
Google Scholar
Hassan, M.M., Alelaiwi, A., Alamri, A.: A dynamic and efficient coalition formation game in cloud federation for multimedia applications. In: GCA (2015)
Google Scholar
Calatrava, A., Moltó, G., Romero, E., Caballer, M., de Alfonso, C.: Towards migratable elastic virtual clusters on hybrid clouds. In: IEEE CLOUD (2015)
Google Scholar
Niu, Y., Luo, B., Liu, F., Liu, J., Li, B.: When hybrid cloud meets flash crowd: towards cost-effective service provisioning. In: IEEE INFOCOM (2015)
Google Scholar
Rezgui, A., Rezgui, S.: A stochastic approach for virtual machine placement in volunteer cloud federations. In: IEEE IC2E (2014)
Google Scholar
Pllana, S., Fahringer, T.: Performance prophet: a performance modeling and prediction tool for parallel and distributed programs. In: ICPP Workshops (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
Verena Kantere

Authors

Verena Kantere
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Verena Kantere .

Editor information

Editors and Affiliations

CSIRO, Marsfield, NSW, Australia
Surya Nepal
Facebook (United States), Menlo Park, CA, USA
Wenqi Cao
Chungbuk National University, Cheongju, Korea (Republic of)
Aziz Nasridinov
Fordham University, Bronx, NY, USA
MD Zakirul Alam Bhuiyan
University of North Texas, Denton, TX, USA
Xuan Guo
Kingdee International Software Group Co. Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kantere, V. (2020). Processing Big Data Across Infrastructures. In: Nepal, S., Cao, W., Nasridinov, A., Bhuiyan, M.Z.A., Guo, X., Zhang, LJ. (eds) Big Data – BigData 2020. BIGDATA 2020. Lecture Notes in Computer Science(), vol 12402. Springer, Cham. https://doi.org/10.1007/978-3-030-59612-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-59612-5_4
Published: 18 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59611-8
Online ISBN: 978-3-030-59612-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us