deepseek-ai/DeepSeek-V3 混合专家(MoE)语言模型

DeepSeek-V3 是 DeepSeek-AI 开发的先进混合专家(MoE)语言模型,具有671B总参数和每个token激活37B参数的强大能力 1

核心技术架构

创新架构设计

DeepSeek-V3 基于三大核心技术构建 2

  1. 多头潜在注意力(MLA) – 实现高效的注意力机制
  2. DeepSeekMoE架构 – 混合专家模型设计
  3. 多令牌预测(MTP) – 新的训练目标,支持推测解码加速

训练效率

模型在14.8万亿高质量token上进行预训练,仅需2.788M H800 GPU小时完成全训练 3 。训练过程极其稳定,整个训练过程中未出现不可恢复的损失峰值或回滚 4

模型规格

模型版本总参数激活参数上下文长度下载地址
DeepSeek-V3-Base671B37B128KHugging Face 5
DeepSeek-V3671B37B128KHugging Face 6

模型总大小为685B参数,包括671B主模型权重和14B多令牌预测(MTP)模块权重 7

性能表现

DeepSeek-V3 在各项基准测试中表现卓越,特别是在数学和代码任务上 8

  • 数学能力: GSM8K达到89.3%,MATH达到61.6%
  • 代码能力: HumanEval达到65.2%,MBPP达到75.4%
  • 综合推理: MMLU达到87.1%,BBH达到87.5%

在聊天模型评估中,DeepSeek-V3 在开放生成任务上表现突出,Arena-Hard达到85.5,AlpacaEval 2.0达到70.0 9

部署选项

支持的框架

DeepSeek-V3 支持多种部署框架 10

  1. DeepSeek-Infer Demo – 轻量级FP8和BF16推理演示
  2. SGLang – 完整支持BF16和FP8推理模式
  3. LMDeploy – 高效本地和云端部署
  4. TensorRT-LLM – 支持BF16推理和INT4/8量化
  5. vLLM – 支持张量并行和流水线并行
  6. LightLLM – 单节点或多节点部署
  7. AMD GPU – 通过SGLang支持
  8. 华为昇腾NPU – 支持INT8和BF16

权重格式

模型原生采用FP8格式提供,支持128×128块缩放 11 。如需BF16格式,可使用转换脚本进行转换 12

使用许可

DeepSeek-V3 系列模型支持商业使用 13 。代码仓库采用MIT许可证,模型使用遵循模型许可证条款 14

使用限制

模型使用受到以下限制 15

  • 不得违反适用法律法规
  • 不得用于军事用途
  • 不得用于伤害未成年人
  • 不得生成虚假信息伤害他人
  • 不得用于歧视性或有害的自动化决策

获取方式


Notes

DeepSeek-V3 是目前最强大的开源语言模型之一,在保持高性能的同时显著降低了训练成本。其创新的架构设计和训练方法为大规模语言模型的发展提供了新的方向。模型支持多种硬件平台和部署方式,为开发者和企业提供了灵活的选择。

Wiki pages you might want to explore:

Citations

File: README.md (L47-47)

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. 

File: README.md (L48-49)

To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. 
Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. 

File: README.md (L50-52)

We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. 
Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.

File: README.md (L53-54)

In addition, its training process is remarkably stable. 
Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. 

File: README.md (L93-93)

| DeepSeek-V3-Base | 671B | 37B | 128K   | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base)   |

File: README.md (L94-94)

| DeepSeek-V3   | 671B | 37B |  128K   | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V3)   |

File: README.md (L99-99)

> The total size of DeepSeek-V3 models on Hugging Face is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights.

File: README.md (L153-153)

> Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks.

File: README.md (L214-214)

| DeepSeek-V3 | **85.5** | **70.0** |

File: README.md (L223-223)

You can chat with DeepSeek-V3 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com/sign_in)

File: README.md (L225-225)

We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/)

File: README.md (L231-238)

1. **DeepSeek-Infer Demo**: We provide a simple and lightweight demo for FP8 and BF16 inference.
2. **SGLang**: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes, with Multi-Token Prediction [coming soon](https://github.com/sgl-project/sglang/issues/2591).
3. **LMDeploy**: Enables efficient FP8 and BF16 inference for local and cloud deployment.
4. **TensorRT-LLM**: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon.
5. **vLLM**: Support DeepSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism.
6. **LightLLM**: Supports efficient single-node or multi-node deployment for FP8 and BF16.
7. **AMD GPU**: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes.
8. **Huawei Ascend NPU**: Supports running DeepSeek-V3 on Huawei Ascend devices in both INT8 and BF16.

File: README.md (L244-247)

```shell
cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights
**File:** README.md (L345-345)

markdown
This code repository is licensed under the MIT License. The use of DeepSeek-V3 Base/Chat models is subject to the Model License. DeepSeek-V3 series (including Base and Chat) supports commercial use.

**File:** README_WEIGHTS.md (L62-62)

markdown
DeepSeek-V3 natively supports FP8 weight format with 128×128 block scaling.

**File:** LICENSE-MODEL (L37-39)

text

  1. Grant of Copyright License. Subject to the terms and conditions of this License, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Complementary Material, the Model, and Derivatives of the Model.
  2. Grant of Patent License. Subject to the terms and conditions of this License and where and as applicable, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this paragraph) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model and the Complementary Material, where such license applies only to those patent claims licensable by DeepSeek that are necessarily infringed by its contribution(s). If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model and/or Complementary Material constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for … (truncated)
**File:** LICENSE-MODEL (L79-90)

text
You agree not to use the Model or Derivatives of the Model:

  • In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party;
  • For military use in any way;
  • For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
  • To generate or disseminate verifiably false information and/or content with the purpose of harming others;
  • To generate or disseminate inappropriate content subject to applicable regulatory requirements;
  • To generate or disseminate personal identifiable information without due authorization or for unreasonable use;
  • To defame, disparage or otherwise harass others;
  • For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
  • For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
  • To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
    “`