CN110046048B - Load balancing method based on workload self-adaptive fast redistribution - Google Patents
Load balancing method based on workload self-adaptive fast redistribution Download PDFInfo
- Publication number
- CN110046048B CN110046048B CN201910314058.6A CN201910314058A CN110046048B CN 110046048 B CN110046048 B CN 110046048B CN 201910314058 A CN201910314058 A CN 201910314058A CN 110046048 B CN110046048 B CN 110046048B
- Authority
- CN
- China
- Prior art keywords
- performance
- node
- model
- workload
- computing nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a load balancing method based on Adaptive Fast redistribution (AdaptFR). The method obtains performance parameters of each computing node through a performance monitoring tool, then redistributes the workload for each computing node according to a performance model, so that slow nodes obtain less computation amount, Fast nodes obtain more computation amount, and the time for completing single iteration among the nodes is balanced, thereby indirectly balancing the load of a cluster and improving the performance of model training.
Description
Technical Field
The invention belongs to the technical field of distributed machine learning acceleration, and particularly relates to a load balancing method based on workload self-adaptive rapid reallocation.
Background
With the advent of the big data age, large-scale machine learning has become an important component of many modern application services. In order to adapt to the complexity of big data and reduce the computing time of an application program, more and more machine learning algorithms are turned to parallel distributed implementation, so that distributed machine learning becomes the hot of research gradually.
The iterative-convergence algorithm is an important subset of the machine learning algorithm. Such algorithms start with randomly generated solutions and are continually refined to obtain solutions by iteratively iterating through the input data to converge. An iterative convergence algorithm generally selects to fragment input data, and then adopts a Bulk Synchronization Parallel (BSP) model to perform distributed model training, wherein a distributed system is composed of a plurality of parameter servers and computing nodes. The data parallelization based on the parameter server system is the parallelization scheme, and the training process is as follows:
1) model parameters are randomly initialized and subsequent updates of the model parameters are maintained by the parameter server.
2) And distributing the current global model parameters to each computing node, and maintaining a local model parameter copy by each node.
3) And dividing the training set sample into subdata sets with the same size and distributing the subdata sets to each computing node.
4) And performing iterative training, wherein the computing node performs local updating on the model copy by training the corresponding sub data set.
5) And synchronously waiting for all the computing nodes to finish uploading local updates, and then distributing the new global model parameters to all the computing nodes by the parameter server.
6) If the total iteration times do not reach the preset maximum value, continuing to start from the step 4); otherwise, ending the training.
The problem of hysteresis: due to unbalanced cluster load, the phenomenon that the overall operation speed of the computing nodes with poor performance is slowed down is caused.
The main problem with the BSP model is the hysteresis problem. The hysteresis problem becomes more serious due to the increase of the scale of the computing nodes and the dynamic change of the computing nodes in time consumption when the computing nodes complete single iterative training, so that the model training performance under the BSP model is greatly reduced.
In order to solve the lag problem of the BSP model, Dean proposes an Asynchronous Parallel Execution (ASP) model for distributed machine learning, in which a computing node may use local model parameters to execute a next iteration before receiving global model parameters, so that the lag problem due to unbalanced cluster load is avoided, and the time cost of model training is greatly reduced. However, the ASP model infinitely amplifies the fault tolerance of the iterative algorithm, which may cause the machine learning model to fall into local optimum, and it cannot be guaranteed that the machine learning model finally converges to the optimum solution, and it cannot be guaranteed that the accuracy is high. Qirong Ho proposes a delay Synchronous Parallel (SSP) model, which allows each compute node to use non-latest global model parameters during iterative training, greatly reduces the lag time of the compute node, and strictly controls the number of iterations using the non-latest global model parameters to ensure model convergence. However, the capability of the SSP model to balance the cluster load is limited, and when the cluster load difference is large, the SSP model cannot balance the load well, so that it cannot solve the hysteresis problem completely. Therefore, how to improve the performance of the distributed machine learning model training is an urgent problem to be solved.
Disclosure of Invention
The invention aims to solve the technical problem of how to reduce the influence of unbalanced cluster performance load on the training of the distributed machine learning model, solve the problem of hysteresis on the premise of ensuring certain accuracy and improve the overall performance of model training.
In the distributed machine learning model training based on the iterative convergence algorithm, each computing node continuously and repeatedly enters next iterative training before reaching a stopping condition. In traditional distributed machine learning model training, the training data set for each iteration of each compute node is of an equal fixed size.
The technical scheme adopted by the invention for solving the technical problems is as follows: the performance monitoring module is used for realizing the real-time acquisition of the performance parameters of the computing nodes, and then the self-adaptive adjustment is carried out on the workload of each computing node according to the performance parameters. The method is realized by the following steps:
step 1: a parameter server system is employed. One node acts as a parameter server and the other nodes act as compute nodes.
Step 2: and deploying the performance monitoring model. And deploying a performance monitoring tool (Ganglia) on the parameter server and the computing node to realize the real-time acquisition of the node performance parameters.
And step 3: and the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module.
And 4, step 4: and the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes.
And 5: and the computing node enters the next iteration training by using the updated workload.
According to the invention, the slow nodes obtain less calculated amount, and the fast nodes obtain more calculated amount, so that the time for completing single iteration among the nodes is balanced, the load of the cluster is indirectly balanced, and the performance of model training is improved.
The invention has the beneficial effects that: when the distributed machine learning model is trained, the distributed system based on the BSP model is always limited by the influence of the hysteresis problem, so that a large amount of system resources are wasted in a real environment, and the overall performance is very low. The invention dynamically and rapidly adjusts the workload of each iterative training of each computing node, effectively balances the performance difference among the nodes of the cluster, thereby effectively relieving the lag problem and greatly improving the performance of model training.
Drawings
FIG. 1 is a performance monitoring system architecture.
Fig. 2 is a Caffe-based distributed machine learning framework.
FIG. 3 is a flow chart of the training of a parallel computing model (A-DSP) based on the AdaptFR method.
FIG. 4 is a comparison of the accuracy of different computational models when cluster nodes perform closely.
FIG. 5 is a comparison of training times of different computational models when cluster nodes perform similarly
FIG. 6 is a comparison of the accuracy of different computational models when cluster node performance varies widely.
FIG. 7 is a comparison of training times for different computational models when cluster node performance is significantly different.
Detailed Description
The invention is further described with reference to the accompanying drawings and specific implementation steps:
a load balancing method based on Adaptive Fast response (AdaptFR) of workload comprises the following steps:
step 1: a parameter server system is employed. One node acts as a parameter server and the other nodes act as compute nodes.
As shown in fig. 2, the present invention adopts a multithreading manner to implement a parameter server, where each transmission thread on the parameter server corresponds to a computing node, and is used to maintain communication between the parameter server and the computing node; meanwhile, a thread is specially arranged on the parameter server to serve as a parameter updating manager for processing the updating of the global model parameters.
Step 2: and deploying a performance monitoring module. And deploying performance monitoring tools on the parameter server and the computing nodes to realize the real-time acquisition of the node performance parameters.
The invention uses distributed monitoring tool Ganglia as cluster monitoring system, which mainly comprises the following parts: the system comprises a Monitoring Daemon (Gmond) for Monitoring performance indexes of the computing nodes, a Meta Daemon (Gmedad) for collecting summarized data, and a Gweb (Ganglia-web, Gweb) for visually displaying the performance index data.
And step 3: and the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module.
As shown in fig. 1, the performance monitoring system is mainly used for monitoring various performance indexes on the computing nodes. Therefore, a monitoring daemon Gmond is deployed on each computing node, and the Gmond first collects performance index Data of the node and then sends the performance index Data to other computing nodes in an External Data Representation (XDR) format through a UDP protocol. The real-time performance of the UDP protocol is good, the resource occupation is less, and the data can be sent to other computing nodes as soon as possible. And the XDR has good distributed property, and can encapsulate data into a form independent of a transmission medium so that the data can have the capability of being transmitted at heterogeneous nodes. In this way, each compute node can have performance index data for the entire cluster. The performance monitoring system is mainly used for collecting performance indexes on the computing nodes on the parameter server, and then adjusting the workload of iterative computation of each computing node according to a performance index model. Thus, a meta daemon Gmetad is deployed on the parameter server. Gmetad periodically polls Gmond for performance indicator data collected via TCP. The TCP protocol has strong reliability, no error, no loss and no repetition, and can ensure that the performance index data sent to the parameter server has no error. Because each Gmond has complete cluster performance index data, when Gnetad fails to pull performance index data from a certain Gmond, the performance index data can be pulled from other Gmond, and the robustness of the monitoring system is ensured. In addition, a visualized monitoring data display webpage Gweb is also deployed on the parameter server, and the Gweb can provide various performance indexes required for judging the node computing performance for the performance monitoring system.
And 4, step 4: and the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes.
According to the test, the CPU occupancy rate is the performance index which influences the training speed of the distributed machine learning model most. Therefore, the present invention defines the performance model using the CPU occupancy as an influence factor. For a cluster with N computing nodes, the CPU occupancy rates are defined as CPU occupancy ratesi(i-1, 2, …, N), the node performance factor of the performance index model is
Wherein F is a constant, AVG { cpuOctup1…,cpuOccupnDenotes the average of all compute node CPU occupancy.
The AdaptFR method dynamically adjusts the iterative training quantity of each computing node based on the difference of the node performance factor delta of the computing node, thereby balancing the cluster load and improving the performance of model training. When the CPU occupancy rates of all the computing nodes are different greatly, namely delta is larger, the current performance difference of all the computing nodes is larger, and at the moment, the workload of the next iteration of the slow node is greatly reduced or the workload of the next iteration of the fast node is greatly increased to balance the node load; on the contrary, if the CPU occupancy rate difference of each computing node is smaller, that is, δ is smaller, it indicates that the current performance difference of each computing node is smaller, and at this time, the workload of the next iteration of the slow node is slightly reduced, or the workload of the next iteration of the fast node is slightly increased, or remains unchanged, that is, the node load can be balanced.
And 5: and the computing node enters the next iteration training by using the updated workload.
The A-DSP model adds a synchronization mechanism on the basis of an SSP model flexible consistency model: and assuming that in each iterative training, the slow node can enter the synchronization barrier after completing one iterative training, and performs global synchronization updating. The mechanism avoids that the local model is outdated excessively due to the fact that the synchronization condition cannot be achieved later under the condition that cluster performance of the A-DSP model is similar, and the final accuracy rate is reduced, and is shown in figure 3.
Fig. 4 and 5 show the comparison of the accuracy and training time of different computational models when the cluster performance is close. It can be seen that the strict consistency model makes the model accuracy of the BSP model very high, but due to the frequent use of the synchronization barrier and the influence of the hysteresis problem, the training time of the BSP model is far beyond that of other calculation models. The SSP model uses a flexible consistency model, which greatly reduces the training time. However, under the condition that cluster performances are similar, accuracy is greatly reduced due to the fact that the computing nodes cannot be guaranteed to update global model parameters in time, and the method is not suitable for being applied to the scene. Under the same condition, the A-DSP model greatly reduces the training time and better adapts to the scene on the premise of ensuring certain accuracy through a more flexible synchronization mechanism and an AdaptFR method.
Fig. 6 and 7 show the comparison of the accuracy and training time of different computational models when the cluster performance is greatly different. It can be seen that the BSP model still maintains this high accuracy and high training time. Under the same condition, the accuracy of the A-DSP model is similar to that of the SSP model, but the training time is greatly reduced compared with that of the SSP model, and the A-DSP model is more suitable for the scene than the SSP model.
The implementation process of the AdaptFR method on the computing node and the parameter server is respectively shown below.
Claims (3)
1. A load balancing method based on workload self-adaptive fast redistribution is characterized by comprising the following steps:
step 1: adopting a parameter server system, wherein one node is used as a parameter server, and other nodes are used as computing nodes;
step 2: deploying a performance monitoring model, deploying performance monitoring tools on a parameter server and a computing node, and realizing real-time acquisition of node performance parameters;
and step 3: the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module;
and 4, step 4: the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes;
the performance index which influences the training speed of the distributed machine learning model is the CPU occupancy rate, so the CPU occupancy rate is used as an influence factor to define the performance model; for a cluster set with N computing nodes, CPU occupancy rates of the cluster set are defined as CPU occupancy ratesi(i-1, 2, …, N), the node performance factor of the performance index model is
Wherein F is a constant, AVG { cpuOctupi…,cpuOccupnExpressing the average value of CPU occupancy rates of all the computing nodes;
when the CPU occupancy rates of all the computing nodes are different greatly, namely delta is larger, the current performance difference of all the computing nodes is larger, and at the moment, the workload of the next iteration of the slow node is greatly reduced or the workload of the next iteration of the fast node is greatly increased to balance the node load; on the contrary, if the CPU occupancy rates of the computing nodes are smaller, namely delta is smaller, the current performance difference of the computing nodes is smaller, at the moment, the workload of the next iteration of the slow node is slightly reduced, or the workload of the next iteration of the fast node is slightly increased or is kept unchanged, and the node load can be balanced;
and 5: the calculation node enters the next iterative training by using the updated workload;
on the basis of the SSP model, a synchronization mechanism is added: and assuming that in each iterative training, the slow node enters a synchronization barrier after completing one iterative training, and performs global synchronization updating.
2. The method of claim 1, wherein the method comprises: the parameter server is realized in a multi-thread mode, each transmission thread corresponds to one computing node and is used for maintaining communication between the parameter server and the computing node; meanwhile, a thread is specially arranged as a parameter updating manager and used for processing the updating of the global model parameters.
3. The method of claim 1, wherein the method comprises: the performance monitoring tool comprises a monitoring daemon Gmond for monitoring performance indexes of the computing nodes, a meta daemon Gnetad for collecting summarized data and a Gweb for visually displaying the performance index data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910314058.6A CN110046048B (en) | 2019-04-18 | 2019-04-18 | Load balancing method based on workload self-adaptive fast redistribution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910314058.6A CN110046048B (en) | 2019-04-18 | 2019-04-18 | Load balancing method based on workload self-adaptive fast redistribution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110046048A CN110046048A (en) | 2019-07-23 |
CN110046048B true CN110046048B (en) | 2021-09-28 |
Family
ID=67277861
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910314058.6A Active CN110046048B (en) | 2019-04-18 | 2019-04-18 | Load balancing method based on workload self-adaptive fast redistribution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110046048B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110784555A (en) * | 2019-11-07 | 2020-02-11 | 中电福富信息科技有限公司 | Intelligent monitoring and load scheduling method based on deep learning |
TWI729606B (en) * | 2019-12-05 | 2021-06-01 | 財團法人資訊工業策進會 | Load balancing device and method for an edge computing network |
CN111144584B (en) * | 2019-12-31 | 2024-01-19 | 深圳Tcl新技术有限公司 | Parameter tuning method and device and computer storage medium |
CN111752713B (en) * | 2020-06-28 | 2022-08-05 | 浪潮电子信息产业股份有限公司 | Method, device and equipment for balancing load of model parallel training task and storage medium |
US11609794B2 (en) | 2020-11-10 | 2023-03-21 | Oracle International Corporation | Techniques for modifying cluster computing environments |
CN113806082A (en) * | 2021-09-05 | 2021-12-17 | 济南浪潮数据技术有限公司 | Method, device and equipment for collecting node performance data and readable medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105550374A (en) * | 2016-01-29 | 2016-05-04 | 湖南大学 | Random forest parallelization machine studying method for big data in Spark cloud service environment |
WO2016102738A1 (en) * | 2014-12-22 | 2016-06-30 | Nokia Technologies Oy | Similarity determination and selection of music |
CN109271015A (en) * | 2018-10-10 | 2019-01-25 | 杭州电子科技大学 | A method of reducing large-scale distributed machine learning system energy consumption |
CN109635948A (en) * | 2018-12-19 | 2019-04-16 | 北京达佳互联信息技术有限公司 | On-line training method, apparatus, system and computer readable storage medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8660822B2 (en) * | 2010-02-11 | 2014-02-25 | Airbus Engineering Centre India | System and method for generating three dimensional functional space reservation systems of a vehicle |
CN105446979B (en) * | 2014-06-27 | 2019-02-01 | 华为技术有限公司 | Data digging method and node |
CN107025205B (en) * | 2016-01-30 | 2021-06-22 | 华为技术有限公司 | Method and equipment for training model in distributed system |
CN106293942A (en) * | 2016-08-10 | 2017-01-04 | 中国科学技术大学苏州研究院 | Neutral net load balance optimization method based on the many cards of multimachine and system |
CN107018184B (en) * | 2017-03-28 | 2019-08-30 | 华中科技大学 | Distributed deep neural network cluster packet synchronization optimization method and system |
CN109445953A (en) * | 2018-08-30 | 2019-03-08 | 北京大学 | A kind of machine learning model training method towards large-scale machines learning system |
-
2019
- 2019-04-18 CN CN201910314058.6A patent/CN110046048B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016102738A1 (en) * | 2014-12-22 | 2016-06-30 | Nokia Technologies Oy | Similarity determination and selection of music |
CN105550374A (en) * | 2016-01-29 | 2016-05-04 | 湖南大学 | Random forest parallelization machine studying method for big data in Spark cloud service environment |
CN109271015A (en) * | 2018-10-10 | 2019-01-25 | 杭州电子科技大学 | A method of reducing large-scale distributed machine learning system energy consumption |
CN109635948A (en) * | 2018-12-19 | 2019-04-16 | 北京达佳互联信息技术有限公司 | On-line training method, apparatus, system and computer readable storage medium |
Non-Patent Citations (1)
Title |
---|
分布式机器学习/深度学习论文整理;Chenfan Blog;《https://jcf94.com/2017/12/20/2017-12-20-distributeddl/》;20171220;第1-17页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110046048A (en) | 2019-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110046048B (en) | Load balancing method based on workload self-adaptive fast redistribution | |
CN106648904B (en) | Adaptive rate control method for streaming data processing | |
US9386086B2 (en) | Dynamic scaling for multi-tiered distributed systems using payoff optimization of application classes | |
US11784931B2 (en) | Network burst load evacuation method for edge servers | |
CN110990155B (en) | Parameter communication method for large-scale safety monitoring | |
CN108416465B (en) | Workflow optimization method in mobile cloud environment | |
CN103701635B (en) | Method and device for configuring Hadoop parameters on line | |
CN104038392A (en) | Method for evaluating service quality of cloud computing resources | |
US20170104623A1 (en) | Server Load Management | |
CN115934333A (en) | Historical data perception-based cloud computing resource scheduling method and system | |
CN113778691B (en) | Task migration decision method, device and system | |
CN107483292A (en) | Dynamic monitoring and controlling method for cloud platform | |
CN114490048A (en) | Task execution method and device, electronic equipment and computer storage medium | |
CN110888744B (en) | Load balancing method based on automatic adjustment and optimization of workload | |
CN109976873B (en) | Scheduling scheme obtaining method and scheduling method of containerized distributed computing framework | |
CN111585915A (en) | Long and short flow balanced transmission method and system, storage medium and cloud server | |
CN107948330A (en) | Load balancing based on dynamic priority under a kind of cloud environment | |
CN110377411B (en) | Distributed cloud-oriented workflow task scheduling method and system | |
CN110971451B (en) | NFV resource allocation method | |
CN114841341B (en) | Image processing model training and image processing method, device, equipment and medium | |
CN109002666A (en) | Emulated computation method based on DR second order algorithm and DDS-QOS | |
WO2023097661A1 (en) | Big data system resource configuration parameter tuning method based on generative adversarial network | |
CN113377544A (en) | Web cluster load balancing method based on load data dynamic update rate | |
CN115081619A (en) | Heterogeneous cluster-oriented acceleration distributed training method and system | |
CN115345306A (en) | Deep neural network scheduling method and scheduler |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |