Open
Description
Graph partition automatically moves cpu scalar tensors to gpu when possible (#154464). It's better to use pin memory and copy with non_blocking. This depends on #155121. More context in this issue.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov