[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110046048B - Load balancing method based on workload self-adaptive fast redistribution - Google Patents

Load balancing method based on workload self-adaptive fast redistribution Download PDF

Info

Publication number
CN110046048B
CN110046048B CN201910314058.6A CN201910314058A CN110046048B CN 110046048 B CN110046048 B CN 110046048B CN 201910314058 A CN201910314058 A CN 201910314058A CN 110046048 B CN110046048 B CN 110046048B
Authority
CN
China
Prior art keywords
performance
node
model
workload
computing nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910314058.6A
Other languages
Chinese (zh)
Other versions
CN110046048A (en
Inventor
张纪林
李明伟
万健
任永坚
魏振国
张俊聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201910314058.6A priority Critical patent/CN110046048B/en
Publication of CN110046048A publication Critical patent/CN110046048A/en
Application granted granted Critical
Publication of CN110046048B publication Critical patent/CN110046048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a load balancing method based on Adaptive Fast redistribution (AdaptFR). The method obtains performance parameters of each computing node through a performance monitoring tool, then redistributes the workload for each computing node according to a performance model, so that slow nodes obtain less computation amount, Fast nodes obtain more computation amount, and the time for completing single iteration among the nodes is balanced, thereby indirectly balancing the load of a cluster and improving the performance of model training.

Description

Load balancing method based on workload self-adaptive fast redistribution
Technical Field
The invention belongs to the technical field of distributed machine learning acceleration, and particularly relates to a load balancing method based on workload self-adaptive rapid reallocation.
Background
With the advent of the big data age, large-scale machine learning has become an important component of many modern application services. In order to adapt to the complexity of big data and reduce the computing time of an application program, more and more machine learning algorithms are turned to parallel distributed implementation, so that distributed machine learning becomes the hot of research gradually.
The iterative-convergence algorithm is an important subset of the machine learning algorithm. Such algorithms start with randomly generated solutions and are continually refined to obtain solutions by iteratively iterating through the input data to converge. An iterative convergence algorithm generally selects to fragment input data, and then adopts a Bulk Synchronization Parallel (BSP) model to perform distributed model training, wherein a distributed system is composed of a plurality of parameter servers and computing nodes. The data parallelization based on the parameter server system is the parallelization scheme, and the training process is as follows:
1) model parameters are randomly initialized and subsequent updates of the model parameters are maintained by the parameter server.
2) And distributing the current global model parameters to each computing node, and maintaining a local model parameter copy by each node.
3) And dividing the training set sample into subdata sets with the same size and distributing the subdata sets to each computing node.
4) And performing iterative training, wherein the computing node performs local updating on the model copy by training the corresponding sub data set.
5) And synchronously waiting for all the computing nodes to finish uploading local updates, and then distributing the new global model parameters to all the computing nodes by the parameter server.
6) If the total iteration times do not reach the preset maximum value, continuing to start from the step 4); otherwise, ending the training.
The problem of hysteresis: due to unbalanced cluster load, the phenomenon that the overall operation speed of the computing nodes with poor performance is slowed down is caused.
The main problem with the BSP model is the hysteresis problem. The hysteresis problem becomes more serious due to the increase of the scale of the computing nodes and the dynamic change of the computing nodes in time consumption when the computing nodes complete single iterative training, so that the model training performance under the BSP model is greatly reduced.
In order to solve the lag problem of the BSP model, Dean proposes an Asynchronous Parallel Execution (ASP) model for distributed machine learning, in which a computing node may use local model parameters to execute a next iteration before receiving global model parameters, so that the lag problem due to unbalanced cluster load is avoided, and the time cost of model training is greatly reduced. However, the ASP model infinitely amplifies the fault tolerance of the iterative algorithm, which may cause the machine learning model to fall into local optimum, and it cannot be guaranteed that the machine learning model finally converges to the optimum solution, and it cannot be guaranteed that the accuracy is high. Qirong Ho proposes a delay Synchronous Parallel (SSP) model, which allows each compute node to use non-latest global model parameters during iterative training, greatly reduces the lag time of the compute node, and strictly controls the number of iterations using the non-latest global model parameters to ensure model convergence. However, the capability of the SSP model to balance the cluster load is limited, and when the cluster load difference is large, the SSP model cannot balance the load well, so that it cannot solve the hysteresis problem completely. Therefore, how to improve the performance of the distributed machine learning model training is an urgent problem to be solved.
Disclosure of Invention
The invention aims to solve the technical problem of how to reduce the influence of unbalanced cluster performance load on the training of the distributed machine learning model, solve the problem of hysteresis on the premise of ensuring certain accuracy and improve the overall performance of model training.
In the distributed machine learning model training based on the iterative convergence algorithm, each computing node continuously and repeatedly enters next iterative training before reaching a stopping condition. In traditional distributed machine learning model training, the training data set for each iteration of each compute node is of an equal fixed size.
The technical scheme adopted by the invention for solving the technical problems is as follows: the performance monitoring module is used for realizing the real-time acquisition of the performance parameters of the computing nodes, and then the self-adaptive adjustment is carried out on the workload of each computing node according to the performance parameters. The method is realized by the following steps:
step 1: a parameter server system is employed. One node acts as a parameter server and the other nodes act as compute nodes.
Step 2: and deploying the performance monitoring model. And deploying a performance monitoring tool (Ganglia) on the parameter server and the computing node to realize the real-time acquisition of the node performance parameters.
And step 3: and the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module.
And 4, step 4: and the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes.
And 5: and the computing node enters the next iteration training by using the updated workload.
According to the invention, the slow nodes obtain less calculated amount, and the fast nodes obtain more calculated amount, so that the time for completing single iteration among the nodes is balanced, the load of the cluster is indirectly balanced, and the performance of model training is improved.
The invention has the beneficial effects that: when the distributed machine learning model is trained, the distributed system based on the BSP model is always limited by the influence of the hysteresis problem, so that a large amount of system resources are wasted in a real environment, and the overall performance is very low. The invention dynamically and rapidly adjusts the workload of each iterative training of each computing node, effectively balances the performance difference among the nodes of the cluster, thereby effectively relieving the lag problem and greatly improving the performance of model training.
Drawings
FIG. 1 is a performance monitoring system architecture.
Fig. 2 is a Caffe-based distributed machine learning framework.
FIG. 3 is a flow chart of the training of a parallel computing model (A-DSP) based on the AdaptFR method.
FIG. 4 is a comparison of the accuracy of different computational models when cluster nodes perform closely.
FIG. 5 is a comparison of training times of different computational models when cluster nodes perform similarly
FIG. 6 is a comparison of the accuracy of different computational models when cluster node performance varies widely.
FIG. 7 is a comparison of training times for different computational models when cluster node performance is significantly different.
Detailed Description
The invention is further described with reference to the accompanying drawings and specific implementation steps:
a load balancing method based on Adaptive Fast response (AdaptFR) of workload comprises the following steps:
step 1: a parameter server system is employed. One node acts as a parameter server and the other nodes act as compute nodes.
As shown in fig. 2, the present invention adopts a multithreading manner to implement a parameter server, where each transmission thread on the parameter server corresponds to a computing node, and is used to maintain communication between the parameter server and the computing node; meanwhile, a thread is specially arranged on the parameter server to serve as a parameter updating manager for processing the updating of the global model parameters.
Step 2: and deploying a performance monitoring module. And deploying performance monitoring tools on the parameter server and the computing nodes to realize the real-time acquisition of the node performance parameters.
The invention uses distributed monitoring tool Ganglia as cluster monitoring system, which mainly comprises the following parts: the system comprises a Monitoring Daemon (Gmond) for Monitoring performance indexes of the computing nodes, a Meta Daemon (Gmedad) for collecting summarized data, and a Gweb (Ganglia-web, Gweb) for visually displaying the performance index data.
And step 3: and the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module.
As shown in fig. 1, the performance monitoring system is mainly used for monitoring various performance indexes on the computing nodes. Therefore, a monitoring daemon Gmond is deployed on each computing node, and the Gmond first collects performance index Data of the node and then sends the performance index Data to other computing nodes in an External Data Representation (XDR) format through a UDP protocol. The real-time performance of the UDP protocol is good, the resource occupation is less, and the data can be sent to other computing nodes as soon as possible. And the XDR has good distributed property, and can encapsulate data into a form independent of a transmission medium so that the data can have the capability of being transmitted at heterogeneous nodes. In this way, each compute node can have performance index data for the entire cluster. The performance monitoring system is mainly used for collecting performance indexes on the computing nodes on the parameter server, and then adjusting the workload of iterative computation of each computing node according to a performance index model. Thus, a meta daemon Gmetad is deployed on the parameter server. Gmetad periodically polls Gmond for performance indicator data collected via TCP. The TCP protocol has strong reliability, no error, no loss and no repetition, and can ensure that the performance index data sent to the parameter server has no error. Because each Gmond has complete cluster performance index data, when Gnetad fails to pull performance index data from a certain Gmond, the performance index data can be pulled from other Gmond, and the robustness of the monitoring system is ensured. In addition, a visualized monitoring data display webpage Gweb is also deployed on the parameter server, and the Gweb can provide various performance indexes required for judging the node computing performance for the performance monitoring system.
And 4, step 4: and the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes.
According to the test, the CPU occupancy rate is the performance index which influences the training speed of the distributed machine learning model most. Therefore, the present invention defines the performance model using the CPU occupancy as an influence factor. For a cluster with N computing nodes, the CPU occupancy rates are defined as CPU occupancy ratesi(i-1, 2, …, N), the node performance factor of the performance index model is
Figure BDA0002032501610000041
Wherein F is a constant, AVG { cpuOctup1…,cpuOccupnDenotes the average of all compute node CPU occupancy.
The AdaptFR method dynamically adjusts the iterative training quantity of each computing node based on the difference of the node performance factor delta of the computing node, thereby balancing the cluster load and improving the performance of model training. When the CPU occupancy rates of all the computing nodes are different greatly, namely delta is larger, the current performance difference of all the computing nodes is larger, and at the moment, the workload of the next iteration of the slow node is greatly reduced or the workload of the next iteration of the fast node is greatly increased to balance the node load; on the contrary, if the CPU occupancy rate difference of each computing node is smaller, that is, δ is smaller, it indicates that the current performance difference of each computing node is smaller, and at this time, the workload of the next iteration of the slow node is slightly reduced, or the workload of the next iteration of the fast node is slightly increased, or remains unchanged, that is, the node load can be balanced.
And 5: and the computing node enters the next iteration training by using the updated workload.
The A-DSP model adds a synchronization mechanism on the basis of an SSP model flexible consistency model: and assuming that in each iterative training, the slow node can enter the synchronization barrier after completing one iterative training, and performs global synchronization updating. The mechanism avoids that the local model is outdated excessively due to the fact that the synchronization condition cannot be achieved later under the condition that cluster performance of the A-DSP model is similar, and the final accuracy rate is reduced, and is shown in figure 3.
Fig. 4 and 5 show the comparison of the accuracy and training time of different computational models when the cluster performance is close. It can be seen that the strict consistency model makes the model accuracy of the BSP model very high, but due to the frequent use of the synchronization barrier and the influence of the hysteresis problem, the training time of the BSP model is far beyond that of other calculation models. The SSP model uses a flexible consistency model, which greatly reduces the training time. However, under the condition that cluster performances are similar, accuracy is greatly reduced due to the fact that the computing nodes cannot be guaranteed to update global model parameters in time, and the method is not suitable for being applied to the scene. Under the same condition, the A-DSP model greatly reduces the training time and better adapts to the scene on the premise of ensuring certain accuracy through a more flexible synchronization mechanism and an AdaptFR method.
Fig. 6 and 7 show the comparison of the accuracy and training time of different computational models when the cluster performance is greatly different. It can be seen that the BSP model still maintains this high accuracy and high training time. Under the same condition, the accuracy of the A-DSP model is similar to that of the SSP model, but the training time is greatly reduced compared with that of the SSP model, and the A-DSP model is more suitable for the scene than the SSP model.
The implementation process of the AdaptFR method on the computing node and the parameter server is respectively shown below.
Figure BDA0002032501610000051
Figure BDA0002032501610000052

Claims (3)

1. A load balancing method based on workload self-adaptive fast redistribution is characterized by comprising the following steps:
step 1: adopting a parameter server system, wherein one node is used as a parameter server, and other nodes are used as computing nodes;
step 2: deploying a performance monitoring model, deploying performance monitoring tools on a parameter server and a computing node, and realizing real-time acquisition of node performance parameters;
and step 3: the parameter server acquires the performance parameters of the computing nodes through the performance monitoring module;
and 4, step 4: the parameter server calculates the workload of next iterative training of each computing node by using the performance index model according to the collected performance parameters, and sends the workload to the computing nodes;
the performance index which influences the training speed of the distributed machine learning model is the CPU occupancy rate, so the CPU occupancy rate is used as an influence factor to define the performance model; for a cluster set with N computing nodes, CPU occupancy rates of the cluster set are defined as CPU occupancy ratesi(i-1, 2, …, N), the node performance factor of the performance index model is
Figure FDA0003148179610000011
Wherein F is a constant, AVG { cpuOctupi…,cpuOccupnExpressing the average value of CPU occupancy rates of all the computing nodes;
when the CPU occupancy rates of all the computing nodes are different greatly, namely delta is larger, the current performance difference of all the computing nodes is larger, and at the moment, the workload of the next iteration of the slow node is greatly reduced or the workload of the next iteration of the fast node is greatly increased to balance the node load; on the contrary, if the CPU occupancy rates of the computing nodes are smaller, namely delta is smaller, the current performance difference of the computing nodes is smaller, at the moment, the workload of the next iteration of the slow node is slightly reduced, or the workload of the next iteration of the fast node is slightly increased or is kept unchanged, and the node load can be balanced;
and 5: the calculation node enters the next iterative training by using the updated workload;
on the basis of the SSP model, a synchronization mechanism is added: and assuming that in each iterative training, the slow node enters a synchronization barrier after completing one iterative training, and performs global synchronization updating.
2. The method of claim 1, wherein the method comprises: the parameter server is realized in a multi-thread mode, each transmission thread corresponds to one computing node and is used for maintaining communication between the parameter server and the computing node; meanwhile, a thread is specially arranged as a parameter updating manager and used for processing the updating of the global model parameters.
3. The method of claim 1, wherein the method comprises: the performance monitoring tool comprises a monitoring daemon Gmond for monitoring performance indexes of the computing nodes, a meta daemon Gnetad for collecting summarized data and a Gweb for visually displaying the performance index data.
CN201910314058.6A 2019-04-18 2019-04-18 Load balancing method based on workload self-adaptive fast redistribution Active CN110046048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910314058.6A CN110046048B (en) 2019-04-18 2019-04-18 Load balancing method based on workload self-adaptive fast redistribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910314058.6A CN110046048B (en) 2019-04-18 2019-04-18 Load balancing method based on workload self-adaptive fast redistribution

Publications (2)

Publication Number Publication Date
CN110046048A CN110046048A (en) 2019-07-23
CN110046048B true CN110046048B (en) 2021-09-28

Family

ID=67277861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910314058.6A Active CN110046048B (en) 2019-04-18 2019-04-18 Load balancing method based on workload self-adaptive fast redistribution

Country Status (1)

Country Link
CN (1) CN110046048B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110784555A (en) * 2019-11-07 2020-02-11 中电福富信息科技有限公司 Intelligent monitoring and load scheduling method based on deep learning
TWI729606B (en) * 2019-12-05 2021-06-01 財團法人資訊工業策進會 Load balancing device and method for an edge computing network
CN111144584B (en) * 2019-12-31 2024-01-19 深圳Tcl新技术有限公司 Parameter tuning method and device and computer storage medium
CN111752713B (en) * 2020-06-28 2022-08-05 浪潮电子信息产业股份有限公司 Method, device and equipment for balancing load of model parallel training task and storage medium
US11609794B2 (en) 2020-11-10 2023-03-21 Oracle International Corporation Techniques for modifying cluster computing environments
CN113806082A (en) * 2021-09-05 2021-12-17 济南浪潮数据技术有限公司 Method, device and equipment for collecting node performance data and readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550374A (en) * 2016-01-29 2016-05-04 湖南大学 Random forest parallelization machine studying method for big data in Spark cloud service environment
WO2016102738A1 (en) * 2014-12-22 2016-06-30 Nokia Technologies Oy Similarity determination and selection of music
CN109271015A (en) * 2018-10-10 2019-01-25 杭州电子科技大学 A method of reducing large-scale distributed machine learning system energy consumption
CN109635948A (en) * 2018-12-19 2019-04-16 北京达佳互联信息技术有限公司 On-line training method, apparatus, system and computer readable storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8660822B2 (en) * 2010-02-11 2014-02-25 Airbus Engineering Centre India System and method for generating three dimensional functional space reservation systems of a vehicle
CN105446979B (en) * 2014-06-27 2019-02-01 华为技术有限公司 Data digging method and node
CN107025205B (en) * 2016-01-30 2021-06-22 华为技术有限公司 Method and equipment for training model in distributed system
CN106293942A (en) * 2016-08-10 2017-01-04 中国科学技术大学苏州研究院 Neutral net load balance optimization method based on the many cards of multimachine and system
CN107018184B (en) * 2017-03-28 2019-08-30 华中科技大学 Distributed deep neural network cluster packet synchronization optimization method and system
CN109445953A (en) * 2018-08-30 2019-03-08 北京大学 A kind of machine learning model training method towards large-scale machines learning system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016102738A1 (en) * 2014-12-22 2016-06-30 Nokia Technologies Oy Similarity determination and selection of music
CN105550374A (en) * 2016-01-29 2016-05-04 湖南大学 Random forest parallelization machine studying method for big data in Spark cloud service environment
CN109271015A (en) * 2018-10-10 2019-01-25 杭州电子科技大学 A method of reducing large-scale distributed machine learning system energy consumption
CN109635948A (en) * 2018-12-19 2019-04-16 北京达佳互联信息技术有限公司 On-line training method, apparatus, system and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
分布式机器学习/深度学习论文整理;Chenfan Blog;《https://jcf94.com/2017/12/20/2017-12-20-distributeddl/》;20171220;第1-17页 *

Also Published As

Publication number Publication date
CN110046048A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
CN110046048B (en) Load balancing method based on workload self-adaptive fast redistribution
CN106648904B (en) Adaptive rate control method for streaming data processing
US9386086B2 (en) Dynamic scaling for multi-tiered distributed systems using payoff optimization of application classes
US11784931B2 (en) Network burst load evacuation method for edge servers
CN110990155B (en) Parameter communication method for large-scale safety monitoring
CN108416465B (en) Workflow optimization method in mobile cloud environment
CN103701635B (en) Method and device for configuring Hadoop parameters on line
CN104038392A (en) Method for evaluating service quality of cloud computing resources
US20170104623A1 (en) Server Load Management
CN115934333A (en) Historical data perception-based cloud computing resource scheduling method and system
CN113778691B (en) Task migration decision method, device and system
CN107483292A (en) Dynamic monitoring and controlling method for cloud platform
CN114490048A (en) Task execution method and device, electronic equipment and computer storage medium
CN110888744B (en) Load balancing method based on automatic adjustment and optimization of workload
CN109976873B (en) Scheduling scheme obtaining method and scheduling method of containerized distributed computing framework
CN111585915A (en) Long and short flow balanced transmission method and system, storage medium and cloud server
CN107948330A (en) Load balancing based on dynamic priority under a kind of cloud environment
CN110377411B (en) Distributed cloud-oriented workflow task scheduling method and system
CN110971451B (en) NFV resource allocation method
CN114841341B (en) Image processing model training and image processing method, device, equipment and medium
CN109002666A (en) Emulated computation method based on DR second order algorithm and DDS-QOS
WO2023097661A1 (en) Big data system resource configuration parameter tuning method based on generative adversarial network
CN113377544A (en) Web cluster load balancing method based on load data dynamic update rate
CN115081619A (en) Heterogeneous cluster-oriented acceleration distributed training method and system
CN115345306A (en) Deep neural network scheduling method and scheduler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant