CN110046048B

CN110046048B - Load balancing method based on workload self-adaptive fast redistribution

Info

Publication number: CN110046048B
Application number: CN201910314058.6A
Authority: CN
Inventors: 张纪林; 李明伟; 万健; 任永坚; 魏振国; 张俊聪
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2021-09-28
Anticipated expiration: 2039-04-18
Also published as: CN110046048A

Abstract

The invention discloses a load balancing method based on Adaptive Fast redistribution (AdaptFR). The method obtains performance parameters of each computing node through a performance monitoring tool, then redistributes the workload for each computing node according to a performance model, so that slow nodes obtain less computation amount, Fast nodes obtain more computation amount, and the time for completing single iteration among the nodes is balanced, thereby indirectly balancing the load of a cluster and improving the performance of model training.

Description

Load balancing method based on workload self-adaptive fast redistribution

Technical Field

The invention belongs to the technical field of distributed machine learning acceleration, and particularly relates to a load balancing method based on workload self-adaptive rapid reallocation.

Background

With the advent of the big data age, large-scale machine learning has become an important component of many modern application services. In order to adapt to the complexity of big data and reduce the computing time of an application program, more and more machine learning algorithms are turned to parallel distributed implementation, so that distributed machine learning becomes the hot of research gradually.

The iterative-convergence algorithm is an important subset of the machine learning algorithm. Such algorithms start with randomly generated solutions and are continually refined to obtain solutions by iteratively iterating through the input data to converge. An iterative convergence algorithm generally selects to fragment input data, and then adopts a Bulk Synchronization Parallel (BSP) model to perform distributed model training, wherein a distributed system is composed of a plurality of parameter servers and computing nodes. The data parallelization based on the parameter server system is the parallelization scheme, and the training process is as follows:

1) model parameters are randomly initialized and subsequent updates of the model parameters are maintained by the parameter server.

2) And distributing the current global model parameters to each computing node, and maintaining a local model parameter copy by each node.

3) And dividing the training set sample into subdata sets with the same size and distributing the subdata sets to each computing node.

4) And performing iterative training, wherein the computing node performs local updating on the model copy by training the corresponding sub data set.

5) And synchronously waiting for all the computing nodes to finish uploading local updates, and then distributing the new global model parameters to all the computing nodes by the parameter server.

6) If the total iteration times do not reach the preset maximum value, continuing to start from the step 4); otherwise, ending the training.

The problem of hysteresis: due to unbalanced cluster load, the phenomenon that the overall operation speed of the computing nodes with poor performance is slowed down is caused.

The main problem with the BSP model is the hysteresis problem. The hysteresis problem becomes more serious due to the increase of the scale of the computing nodes and the dynamic change of the computing nodes in time consumption when the computing nodes complete single iterative training, so that the model training performance under the BSP model is greatly reduced.

In order to solve the lag problem of the BSP model, Dean proposes an Asynchronous Parallel Execution (ASP) model for distributed machine learning, in which a computing node may use local model parameters to execute a next iteration before receiving global model parameters, so that the lag problem due to unbalanced cluster load is avoided, and the time cost of model training is greatly reduced. However, the ASP model infinitely amplifies the fault tolerance of the iterative algorithm, which may cause the machine learning model to fall into local optimum, and it cannot be guaranteed that the machine learning model finally converges to the optimum solution, and it cannot be guaranteed that the accuracy is high. Qirong Ho proposes a delay Synchronous Parallel (SSP) model, which allows each compute node to use non-latest global model parameters during iterative training, greatly reduces the lag time of the compute node, and strictly controls the number of iterations using the non-latest global model parameters to ensure model convergence. However, the capability of the SSP model to balance the cluster load is limited, and when the cluster load difference is large, the SSP model cannot balance the load well, so that it cannot solve the hysteresis problem completely. Therefore, how to improve the performance of the distributed machine learning model training is an urgent problem to be solved.

Disclosure of Invention

The invention aims to solve the technical problem of how to reduce the influence of unbalanced cluster performance load on the training of the distributed machine learning model, solve the problem of hysteresis on the premise of ensuring certain accuracy and improve the overall performance of model training.

In the distributed machine learning model training based on the iterative convergence algorithm, each computing node continuously and repeatedly enters next iterative training before reaching a stopping condition. In traditional distributed machine learning model training, the training data set for each iteration of each compute node is of an equal fixed size.

The technical scheme adopted by the invention for solving the technical problems is as follows: the performance monitoring module is used for realizing the real-time acquisition of the performance parameters of the computing nodes, and then the self-adaptive adjustment is carried out on the workload of each computing node according to the performance parameters. The method is realized by the following steps:

step 1: a parameter server system is employed. One node acts as a parameter server and the other nodes act as compute nodes.

Step 2: and deploying the performance monitoring model. And deploying a performance monitoring tool (Ganglia) on the parameter server and the computing node to realize the real-time acquisition of the node performance parameters.

And step 3: and the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module.

And 4, step 4: and the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes.

And 5: and the computing node enters the next iteration training by using the updated workload.

According to the invention, the slow nodes obtain less calculated amount, and the fast nodes obtain more calculated amount, so that the time for completing single iteration among the nodes is balanced, the load of the cluster is indirectly balanced, and the performance of model training is improved.

The invention has the beneficial effects that: when the distributed machine learning model is trained, the distributed system based on the BSP model is always limited by the influence of the hysteresis problem, so that a large amount of system resources are wasted in a real environment, and the overall performance is very low. The invention dynamically and rapidly adjusts the workload of each iterative training of each computing node, effectively balances the performance difference among the nodes of the cluster, thereby effectively relieving the lag problem and greatly improving the performance of model training.

Drawings

FIG. 1 is a performance monitoring system architecture.

Fig. 2 is a Caffe-based distributed machine learning framework.

FIG. 3 is a flow chart of the training of a parallel computing model (A-DSP) based on the AdaptFR method.

FIG. 4 is a comparison of the accuracy of different computational models when cluster nodes perform closely.

FIG. 5 is a comparison of training times of different computational models when cluster nodes perform similarly

FIG. 6 is a comparison of the accuracy of different computational models when cluster node performance varies widely.

FIG. 7 is a comparison of training times for different computational models when cluster node performance is significantly different.

Detailed Description

The invention is further described with reference to the accompanying drawings and specific implementation steps:

a load balancing method based on Adaptive Fast response (AdaptFR) of workload comprises the following steps:

As shown in fig. 2, the present invention adopts a multithreading manner to implement a parameter server, where each transmission thread on the parameter server corresponds to a computing node, and is used to maintain communication between the parameter server and the computing node; meanwhile, a thread is specially arranged on the parameter server to serve as a parameter updating manager for processing the updating of the global model parameters.

Step 2: and deploying a performance monitoring module. And deploying performance monitoring tools on the parameter server and the computing nodes to realize the real-time acquisition of the node performance parameters.

The invention uses distributed monitoring tool Ganglia as cluster monitoring system, which mainly comprises the following parts: the system comprises a Monitoring Daemon (Gmond) for Monitoring performance indexes of the computing nodes, a Meta Daemon (Gmedad) for collecting summarized data, and a Gweb (Ganglia-web, Gweb) for visually displaying the performance index data.

As shown in fig. 1, the performance monitoring system is mainly used for monitoring various performance indexes on the computing nodes. Therefore, a monitoring daemon Gmond is deployed on each computing node, and the Gmond first collects performance index Data of the node and then sends the performance index Data to other computing nodes in an External Data Representation (XDR) format through a UDP protocol. The real-time performance of the UDP protocol is good, the resource occupation is less, and the data can be sent to other computing nodes as soon as possible. And the XDR has good distributed property, and can encapsulate data into a form independent of a transmission medium so that the data can have the capability of being transmitted at heterogeneous nodes. In this way, each compute node can have performance index data for the entire cluster. The performance monitoring system is mainly used for collecting performance indexes on the computing nodes on the parameter server, and then adjusting the workload of iterative computation of each computing node according to a performance index model. Thus, a meta daemon Gmetad is deployed on the parameter server. Gmetad periodically polls Gmond for performance indicator data collected via TCP. The TCP protocol has strong reliability, no error, no loss and no repetition, and can ensure that the performance index data sent to the parameter server has no error. Because each Gmond has complete cluster performance index data, when Gnetad fails to pull performance index data from a certain Gmond, the performance index data can be pulled from other Gmond, and the robustness of the monitoring system is ensured. In addition, a visualized monitoring data display webpage Gweb is also deployed on the parameter server, and the Gweb can provide various performance indexes required for judging the node computing performance for the performance monitoring system.

According to the test, the CPU occupancy rate is the performance index which influences the training speed of the distributed machine learning model most. Therefore, the present invention defines the performance model using the CPU occupancy as an influence factor. For a cluster with N computing nodes, the CPU occupancy rates are defined as CPU occupancy rates_i(i-1, 2, …, N), the node performance factor of the performance index model is

Wherein F is a constant, AVG { cpuOctup₁…,cpuOccup_nDenotes the average of all compute node CPU occupancy.

The AdaptFR method dynamically adjusts the iterative training quantity of each computing node based on the difference of the node performance factor delta of the computing node, thereby balancing the cluster load and improving the performance of model training. When the CPU occupancy rates of all the computing nodes are different greatly, namely delta is larger, the current performance difference of all the computing nodes is larger, and at the moment, the workload of the next iteration of the slow node is greatly reduced or the workload of the next iteration of the fast node is greatly increased to balance the node load; on the contrary, if the CPU occupancy rate difference of each computing node is smaller, that is, δ is smaller, it indicates that the current performance difference of each computing node is smaller, and at this time, the workload of the next iteration of the slow node is slightly reduced, or the workload of the next iteration of the fast node is slightly increased, or remains unchanged, that is, the node load can be balanced.

The A-DSP model adds a synchronization mechanism on the basis of an SSP model flexible consistency model: and assuming that in each iterative training, the slow node can enter the synchronization barrier after completing one iterative training, and performs global synchronization updating. The mechanism avoids that the local model is outdated excessively due to the fact that the synchronization condition cannot be achieved later under the condition that cluster performance of the A-DSP model is similar, and the final accuracy rate is reduced, and is shown in figure 3.

Fig. 4 and 5 show the comparison of the accuracy and training time of different computational models when the cluster performance is close. It can be seen that the strict consistency model makes the model accuracy of the BSP model very high, but due to the frequent use of the synchronization barrier and the influence of the hysteresis problem, the training time of the BSP model is far beyond that of other calculation models. The SSP model uses a flexible consistency model, which greatly reduces the training time. However, under the condition that cluster performances are similar, accuracy is greatly reduced due to the fact that the computing nodes cannot be guaranteed to update global model parameters in time, and the method is not suitable for being applied to the scene. Under the same condition, the A-DSP model greatly reduces the training time and better adapts to the scene on the premise of ensuring certain accuracy through a more flexible synchronization mechanism and an AdaptFR method.

Fig. 6 and 7 show the comparison of the accuracy and training time of different computational models when the cluster performance is greatly different. It can be seen that the BSP model still maintains this high accuracy and high training time. Under the same condition, the accuracy of the A-DSP model is similar to that of the SSP model, but the training time is greatly reduced compared with that of the SSP model, and the A-DSP model is more suitable for the scene than the SSP model.

The implementation process of the AdaptFR method on the computing node and the parameter server is respectively shown below.

Claims

1. A load balancing method based on workload self-adaptive fast redistribution is characterized by comprising the following steps:

step 1: adopting a parameter server system, wherein one node is used as a parameter server, and other nodes are used as computing nodes;

step 2: deploying a performance monitoring model, deploying performance monitoring tools on a parameter server and a computing node, and realizing real-time acquisition of node performance parameters;

and step 3: the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module;

and 4, step 4: the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes;

the performance index which influences the training speed of the distributed machine learning model is the CPU occupancy rate, so the CPU occupancy rate is used as an influence factor to define the performance model; for a cluster set with N computing nodes, CPU occupancy rates of the cluster set are defined as CPU occupancy rates_i(i-1, 2, …, N), the node performance factor of the performance index model is

Wherein F is a constant, AVG { cpuOctup_i…，cpuOccup_nExpressing the average value of CPU occupancy rates of all the computing nodes;

when the CPU occupancy rates of all the computing nodes are different greatly, namely delta is larger, the current performance difference of all the computing nodes is larger, and at the moment, the workload of the next iteration of the slow node is greatly reduced or the workload of the next iteration of the fast node is greatly increased to balance the node load; on the contrary, if the CPU occupancy rates of the computing nodes are smaller, namely delta is smaller, the current performance difference of the computing nodes is smaller, at the moment, the workload of the next iteration of the slow node is slightly reduced, or the workload of the next iteration of the fast node is slightly increased or is kept unchanged, and the node load can be balanced;

and 5: the calculation node enters the next iterative training by using the updated workload;

on the basis of the SSP model, a synchronization mechanism is added: and assuming that in each iterative training, the slow node enters a synchronization barrier after completing one iterative training, and performs global synchronization updating.

2. The method of claim 1, wherein the method comprises: the parameter server is realized in a multi-thread mode, each transmission thread corresponds to one computing node and is used for maintaining communication between the parameter server and the computing node; meanwhile, a thread is specially arranged as a parameter updating manager and used for processing the updating of the global model parameters.

3. The method of claim 1, wherein the method comprises: the performance monitoring tool comprises a monitoring daemon Gmond for monitoring performance indexes of the computing nodes, a meta daemon Gnetad for collecting summarized data and a Gweb for visually displaying the performance index data.