Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
Others
Description
基于fast_ln,支持了fast_rms_norm。
对性能的影响:
使得rms_norm算子速度提升了1倍,模型吞吐如下:
对精度的影响:


修改前后保证了fast_ln的结果不变:
具体测试是打印了此算子前向和反向的md5sum值,结果不变,具体如下:
PR前的结果:
fast_rms_norm和fused_rms_norm无法做到诸位对齐。但不影响收敛,收敛的验证是通过TE来验证的,TE中用的就是fast_rms_norm,已知bf16精度的情况下,开关TE不影响收敛。


具体的精度测试结果如下:
可以看到,前向反向的md5sum值对不上,tensor值不完全相同,从diff上看,两边值几乎相同,对于shape=[10, 4096]的输出tensor,通过print(paddle.nonzero(output1 - output2)),可以看到有462个元素的值结果不同,占比1.1%,元素在1e-4精度有diff。反向亦如此
端到端影响:


控制相同输入和参数初始化
只看第一个loss的话,绝对误差1e-3,相对误差在1e-5