[LLM] Add MTP for Deepseekv3 #9876

DrownFish19 · 2025-02-17T02:57:25Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

Add MTP for Deepseekv3.

paddle-bot · 2025-02-17T02:57:29Z

Thanks for your contribution!

codecov · 2025-02-17T03:32:42Z

Codecov Report

Attention: Patch coverage is 12.71186% with 206 lines in your changes missing coverage. Please review.

Project coverage is 51.28%. Comparing base (30df8b6) to head (f1676df).
Report is 338 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/deepseek_v2/modeling.py	3.57%	108 Missing ⚠️
paddlenlp/transformers/deepseek_v2/modeling_pp.py	7.69%	84 Missing ⚠️
paddlenlp/transformers/moe_gate.py	42.85%	8 Missing ⚠️
paddlenlp/transformers/deepseek_v3/modeling.py	0.00%	3 Missing ⚠️
paddlenlp/transformers/moe_layer.py	71.42%	2 Missing ⚠️
...addlenlp/transformers/deepseek_v2/configuration.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9876      +/-   ##
===========================================
- Coverage    51.34%   51.28%   -0.07%     
===========================================
  Files          745      745              
  Lines       118590   118778     +188     
===========================================
+ Hits         60886    60910      +24     
- Misses       57704    57868     +164

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…epseek_mtp

ZHUI · 2025-02-18T08:50:25Z

paddlenlp/transformers/deepseek_v2/modeling.py

@@ -682,13 +687,14 @@ def __init__(self, config, num_experts, expert_hidden_size, **kwargs):
                dtype=paddle.get_default_dtype(),
                default_initializer=nn.initializer.Constant(0.0),
            )
+            self.e_score_correction_bias.is_distributed = True


这个有梯度是吗？

有梯度，需要更新

ZHUI · 2025-02-18T08:51:06Z

paddlenlp/transformers/deepseek_v2/modeling.py

+            k_pe = GatherOp.apply(k_pe)
+        k_pe = k_pe.reshape([-1, q_len, 1, self.qk_rope_head_dim]).expand(
+            [-1, q_len, self.num_heads, self.qk_rope_head_dim]
+        )


这个是修复sp？

sp 还没修复，部分代码在这里保留了

ZHUI · 2025-02-18T08:52:03Z

paddlenlp/transformers/deepseek_v2/modeling.py

-        key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
-        key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
+        query_states = paddle.concat([q_nope, q_pe], axis=-1)
+        key_states = paddle.concat([k_nope, k_pe], axis=-1)


这里是换了实现？

参考自动并行实现，结果是一致的

ZHUI · 2025-02-18T09:53:44Z

paddlenlp/transformers/deepseek_v2/modeling_pp.py

+            if self.sequence_parallel:
+                inputs_embeds = inputs_embeds.reshape([-1, inputs_embeds.shape[-1]])
+                inputs_embeds = ScatterOp.apply(inputs_embeds)
+            return return_args(inputs_embeds, attention_mask, attn_mask_startend_row_indices, position_ids)


可能有潜在显存问题，现在是 input_emb mtp_emb 一路pp，向后发送吗？

好像确实存在显存问题，每一层都会有mtp_emb，这部分我再想一下

这里暂时没有更好的方式来计算，后续可以尝试直接输入完整的input_embed向后传，应该只多占用一个hidden_state

ZHUI · 2025-02-18T09:54:59Z

paddlenlp/transformers/deepseek_v2/modeling_pp.py



 class DeepseekV2DecoderLayerPipe(DeepseekV2DecoderLayer):
    def forward(self, args):
        hidden_states, attention_mask, attn_mask_startend_row_indices, position_ids = parse_args(args)

+        if self.config.num_nextn_predict_layers > 0:
+            hidden_states_list = paddle.split(hidden_states, self.config.num_nextn_predict_layers + 1)
+            inputs_embeds_mtp = hidden_states_list[-self.config.num_nextn_predict_layers :]


很低效的，应该最后一层share weight的话，之前去最后一层取。最后一层看是不是可以根据label idx取

ZHUI · 2025-02-18T09:57:16Z

paddlenlp/transformers/deepseek_v2/modeling.py

+
+                inputs_embeds_cur_depth = paddle.concat(
+                    [inputs_embeds_ori[:, (nextn + 1) :, :], inputs_embeds_extra[:, : (nextn + 1), :]], axis=1
+                )


最好是这里重新取embeding，显存消耗小一些。

计算或显存占用，当前方式多占用显存为[batch_size, n, hidden_size], 其中n为MTP层数，这部分显存占用也还能接受。

ZHUI · 2025-02-18T10:02:09Z

paddlenlp/transformers/deepseek_v2/modeling.py

+                    hidden_states = hidden_states.reshape([-1, seq_length, hidden_states.shape[-1]])
+
+                inputs_embeds_cur_depth = paddle.concat(
+                    [inputs_embeds_ori[:, (nextn + 1) :, :], inputs_embeds_extra[:, : (nextn + 1), :]], axis=1


这concat的是？

输入是【1,2,3,4,5】,embedding的是【1,2,3,4,5】,其中【1,2,3,4】和decoder架构一致forward，MTP layer处理【2,3,4,5】,此处的concat是用于拼接【2,3,4】和【5】

ZHUI · 2025-02-18T10:04:05Z

paddlenlp/transformers/deepseek_v2/modeling.py

-            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+            return tuple(
+                v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, mtp_outputs] if v is not None
+            )
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,


这里也返回一个 mtp_outputs ？

…epseek_mtp

mtp

8f3028d

DrownFish19 added 5 commits February 17, 2025 15:33

MTP

c75fd64

update deafult config

175d735

update MTP

068e192

update seq_aux_loss

02f407f

Merge remote-tracking branch 'paddlenlp/develop' into dev_20250214_de…

45b386e

…epseek_mtp

ZHUI reviewed Feb 18, 2025

View reviewed changes

DrownFish19 added 4 commits February 24, 2025 21:08

update output

255b853

lint

fc1da16

Merge remote-tracking branch 'paddlenlp/develop' into dev_20250214_de…

e969104

…epseek_mtp

fix for qwen2moe

f1676df

ZHUI merged commit 7acaf18 into PaddlePaddle:develop Feb 25, 2025
8 of 12 checks passed

DrownFish19 deleted the dev_20250214_deepseek_mtp branch February 25, 2025 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLM] Add MTP for Deepseekv3 #9876

[LLM] Add MTP for Deepseekv3 #9876

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[LLM] Add MTP for Deepseekv3 #9876

[LLM] Add MTP for Deepseekv3 #9876

Uh oh!

Conversation

Before submitting

PR types

PR changes

Description

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!