-
Notifications
You must be signed in to change notification settings - Fork 9.1k
YARN-7327: Enable asynchronous scheduling by default for capacity scheduler #7138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of ser 8000 vice and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Submitted a first cut PR to trigger unit tests for Yarn capacity scheduler with async scheduling. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@slfan1989 @aajisaka @TaoYang526 Could you please review the changes ? Note: #7129 is a blocker for this change to now in. |
@shameersss1 Thanks for your contribution. |
@slfan1989 - Gentle reminder for the review |
// disable async-scheduling for simulating complex scene | ||
Configuration disableAsyncConf = new Configuration(conf); | ||
disableAsyncConf.setBoolean( | ||
CapacitySchedulerConfiguration.SCHEDULE_ASYNCHRONOUSLY_ENABLE, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why disable async scheduler in UT? And enable by default? I think we'd better enable async schedule by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following are the reasons why async scheduling is disabled in UT :
-
unlike async scheduling, sync scheduling gives more flexibility while writing UTs to mimic complex scenarios since the UT writer will have full control over when to schedule, where as with async scheduling this cannot be done.
-
There were 200+ failing when async scheduling was enabled (due to scheduling assertion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,get it.
@Hexiaoqiao could you please review the changes ? |
@shameersss1 Thank you very much for submitting the code! However, I am not very familiar with the Capacity Scheduler. To ensure the quality of the code, I will ask a member who is more familiar with the relevant code to help with the review. |
@brumi1024 Could you please help review this PR? Thank you very much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shameersss1 thanks for the contribution. While the documentation updates are very welcome, what is the reason behind changing the default of this setting?
Nit: there is a typo in the title, the JIRA is actually YARN-7327
@brumi1024 - Thanks for looking into this.
And hence the proposal here is to change the default scheduling stratergy for capacity scheduler from synchronous to asynchronous. Already companies like Alibaba cloud use this in their production https://www.alibabacloud.com/help/en/emr/emr-on-ecs/user-guide/yarn-schedulers @brumi1024 - Do you think is there any blocker/issue in enabling it by default ? |
@brumi1024 - Please let me know if there is any more concerns. Now that the dependent YARN issue is also merged, Can we take it forward ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No @shameersss1, I'm fine with the change. Thanks for the contribution, if no objections I'll merge in a few hours.
@shameersss1 Thank you for your contribution. From my perspective, +1. A big thanks to @brumi1024 for helping review the code. |
Thanks @shameersss1 for the change, @TaoYang526 @zeekling @slfan1989 for the reviews, merged to trunk. |
Description of PR
Enable asynchronous scheduling by default for capacity scheduler.
This commit enables asynchronous scheduling by changing the config default value to true. Additionally the commit also adds documentation changes for different scheduling strategies.
The current default scheduling mechanism is synchronous (node-heart driven) which is not efficient when there are large number of containers to be allocated.
It also has additional issues like scheduling won't happen if there is node-heartbeat loss due to transient network issue and slow start (although there are pending container request , Scheduling is triggered only by node heart beat).
@wangdatan did an amazing job of making the async scheudling production ready : Refer https://issues.apache.org/jira/browse/YARN-7327?focusedCommentId=16205259&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16205259 for benchmark details.
The above benchmark shows async scheudling throughput is better than sync scheduling
And hence the proposal here is to change the default scheduling stratergy for capacity scheduler from synchronous to asynchronous. Already companies like Alibaba cloud use this in their production https://www.alibabacloud.com/help/en/emr/emr-on-ecs/user-guide/yarn-schedulers
Note: Although asynchronous scheduling is enabled by default, for unit testing disabling it by default to give more control over container scheduling while simulating complex tests (i.e synchronous scheduling gives more control over when to schedule the container for simulating certain cases while writing unit test).
How was this patch tested?
Existing Unit Tests
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?