从 config.json 开始看懂模型结构
前面几节,我们一直在从零实现:写 Tokenizer、搭 Embedding、堆 Transformer Block——所有结构参数都硬编码在 Python 代码里,改一个隐藏维度要重新改代码。但真实世界的大模型不是这样组织和分发的。
这一节打开 SmolLM2-135M 的仓库,看一个现代 LLM 到底由哪些文件构成,每个文件管什么,以及怎么用这些文件把模型加载起来、跑起来。从「写代码定义模型」切换到「读配置描述模型」。
一个典型的 HuggingFace 模型仓库(如 SmolLM2-135M)里,权重文件(.safetensors)只占了一半。另一半是几张 JSON 配置表,各自管一摊:
- config.json 管模型长什么样——多少层、多宽、几个头
- tokenizer_config.json 管文本怎么变成数字——加不加 BOS/EOS、最多切多长
- tokenizer.json 存 BPE 词表和合并规则,是 tokenizer_config.json 的「数据文件」
- generation_config.json 管模型怎么输出——temperature、top_p、top_k
这些文件合在一起,就是一份完整的模型说明书。
config.json 里写的每一个参数,都会对应到一个具体的 PyTorch 模块。
1. 仓库文件地图
一个 HuggingFace 模型仓库通常包含这些文件。config.json 是必选项,其余取决于模型类型和配置方式。
files = [
("config.json", "必选", "模型结构参数:层数、维度、头数、激活函数等"),
("tokenizer_config.json", "标配", "分词器行为:特殊 token、最大长度、截断/填充策略"),
("generation_config.json","标配", "生成策略:temperature、top_p、top_k、repetition_penalty"),
("tokenizer.json", "标配", "分词器模型文件(BPE 词表 + 合并规则),通常几 MB"),
("special_tokens_map.json","可选", "特殊 token 的名称到 ID 映射"),
("vocab.json", "部分", "BPE 词表(如 GPT-2),从 token 字符串到 ID"),
("merges.txt", "部分", "BPE 合并规则(如 GPT-2),按优先级排列"),
]
print(f"{'文件名':<28} {'必要性':<8} {'用途'}")
print("-" * 80)
for name, required, purpose in files:
print(f"{name:<28} {required:<8} {purpose}")
文件名 必要性 用途
--------------------------------------------------------------------------------
config.json 必选 模型结构参数:层数、维度、头数、激活函数等
tokenizer_config.json 标配 分词器行为:特殊 token、最大长度、截断/填充策略
generation_config.json 标配 生成策略:temperature、top_p、top_k、repetition_penalty
tokenizer.json 标配 分词器模型文件(BPE 词表 + 合并规则),通常几 MB
special_tokens_map.json 可选 特殊 token 的名称到 ID 映射
vocab.json 部分 BPE 词表(如 GPT-2),从 token 字符串到 ID
merges.txt 部分 BPE 合并规则(如 GPT-2),按优先级排列
2. config.json —— 把结构参数变成 PyTorch 模块
下面是 SmolLM2-135M 的真实 config.json。接下来的每个小节,不是打印这些值的含义,而是用它们建出对应的 PyTorch 模块,验证形状、核算参数量。
config = {
"architectures": ["LlamaForCausalLM"],
"hidden_size": 576,
"intermediate_size": 1536,
"num_attention_heads": 9,
"num_key_value_heads": 3,
"num_hidden_layers": 30,
"vocab_size": 49152,
"max_position_embeddings": 8192,
"hidden_act": "silu",
"rms_norm_eps": 1e-05,
"rope_theta": 100000,
"tie_word_embeddings": True,
"attention_bias": False,
}
V, D, L, H, KV, FF = (config[k] for k in (
"vocab_size", "hidden_size", "num_hidden_layers",
"num_attention_heads", "num_key_value_heads", "intermediate_size"))
head_dim = D // H
2.1 TransformerBlock —— 用 config 组装一个完整的 Block
config.json 里的参数最终要装进一个 TransformerBlock:Attention(GQA)+ FFN(SwiGLU)+ RMSNorm。
下面直接用 PyTorch 自带的模块(nn.Linear、nn.RMSNorm),用 config 数字把它们拼起来,然后打印它的结构。
import torch.nn as nn
import torch.nn.functional as F
class TransformerBlock(nn.Module):
"""TransformerBlock:RMSNorm → GQA Attention → RMSNorm → SwiGLU FFN"""
def __init__(self, config):
super().__init__()
d = config['hidden_size']
ff = config['intermediate_size']
h = config['num_attention_heads']
kv = config['num_key_value_heads']
hd = d // h
bias = config['attention_bias']
self.attn_norm = nn.RMSNorm(d, eps=config['rms_norm_eps'])
self.ffn_norm = nn.RMSNorm(d, eps=config['rms_norm_eps'])
# Attention 四个投影
self.q_proj = nn.Linear(d, h * hd, bias=bias)
self.k_proj = nn.Linear(d, kv * hd, bias=bias)
self.v_proj = nn.Linear(d, kv * hd, bias=bias)
self.o_proj = nn.Linear(h * hd, d, bias=bias)
# FFN 三个投影
self.gate = nn.Linear(d, ff, bias=False)
self.up = nn.Linear(d, ff, bias=False)
self.down = nn.Linear(ff, d, bias=False)
self.n_heads = h
self.n_kv_heads = kv
self.head_dim = hd
def forward(self, x):
# Attention (简化版,不包含 causal mask 和 RoPE)
residual = x
x = self.attn_norm(x)
B, T, D = x.shape
q = self.q_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
k = self.k_proj(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2)
v = self.v_proj(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2)
# GQA 广播
k = k.repeat_interleave(self.n_heads // self.n_kv_heads, dim=1)
v = v.repeat_interleave(self.n_heads // self.n_kv_heads, dim=1)
scale = self.head_dim ** -0.5
attn = (q @ k.transpose(-2, -1)) * scale
attn = F.softmax(attn, dim=-1)
out = (attn @ v).transpose(1, 2).contiguous().view(B, T, D)
x = residual + self.o_proj(out)
# FFN
residual = x
x = self.ffn_norm(x)
x = residual + self.down(F.silu(self.gate(x)) * self.up(x))
return x
# 用 SmolLM2 的 config 建一个 Block
block = TransformerBlock(config)
print("=== 单个 TransformerBlock 的完整结构 ===")
print(block)
block_params = sum(p.numel() for p in block.parameters())
print(f"\n这个 Block 的参数量: {block_params:,} ({block_params/1e6:.2f}M)")
=== 单个 TransformerBlock 的完整结构 ===
TransformerBlock(
(attn_norm): RMSNorm((576,), eps=1e-05, elementwise_affine=True)
(ffn_norm): RMSNorm((576,), eps=1e-05, elementwise_affine=True)
(q_proj): Linear(in_features=576, out_features=576, bias=False)
(k_proj): Linear(in_features=576, out_features=192, bias=False)
(v_proj): Linear(in_features=576, out_features=192, bias=False)
(o_proj): Linear(in_features=576, out_features=576, bias=False)
(gate): Linear(in_features=576, out_features=1536, bias=False)
(up): Linear(in_features=576, out_features=1536, bias=False)
(down): Linear(in_features=1536, out_features=576, bias=False)
)
这个 Block 的参数量: 3,540,096 (3.54M)
2.2 Attention —— GQA 让 K/V 投影比 Q 小
num_attention_heads=9、num_key_value_heads=3、attention_bias=false。 Q 投影是 [576, 9×64],K/V 投影是 [576, 3×64]——K 和 V 的参数只有 Q 的 1/3。 建出四个投影矩阵,看形状直接验证。
import torch
import torch.nn as nn
torch.manual_seed(42)
W_q = nn.Linear(D, H * head_dim, bias=False)
W_k = nn.Linear(D, KV * head_dim, bias=False)
W_v = nn.Linear(D, KV * head_dim, bias=False)
W_o = nn.Linear(H * head_dim, D, bias=False)
x = torch.randn(2, 16, D) # 模拟 hidden states
q = W_q(x).view(2, 16, H, head_dim).transpose(1, 2) # [2, 9, 16, 64]
k = W_k(x).view(2, 16, KV, head_dim).transpose(1, 2) # [2, 3, 16, 64]
v = W_v(x).view(2, 16, KV, head_dim).transpose(1, 2) # [2, 3, 16, 64]
print(f"输入: [2, 16, {D}]")
print(f"Q 投影: {list(W_q.weight.shape)} → Q shape: {list(q.shape)}")
print(f"K 投影: {list(W_k.weight.shape)} → K shape: {list(k.shape)}")
print(f"V 投影: {list(W_v.weight.shape)} → V shape: {list(v.shape)}")
print(f"O 投影: {list(W_o.weight.shape)}")
print()
q_p = sum(p.numel() for p in W_q.parameters())
k_p = sum(p.numel() for p in W_k.parameters())
print(f"Q 参数: {q_p:,} K 参数: {k_p:,} K/Q = {k_p/q_p:.2f}")
print(f"如果是 MHA (Q=K=V=9): K 参数也会是 {q_p:,}")
print(f"GQA 节省了 {(q_p - k_p) * 30:,.0f} 个 K+V 参数 (30 层合计)")
GQA 的核心是在 Attention 计算时把 KV 头「广播」给 Q 头。下面用一个小例子模拟这个过程:
groups = H // KV # 每组 3 个 Q 头
k_repeated = k.repeat_interleave(groups, dim=1) # [2, 3, 16, 64] → [2, 9, 16, 64]
v_repeated = v.repeat_interleave(groups, dim=1)
# 现在 Q 和 K 头数一致了,可以正常做 Attention
scale = head_dim ** -0.5
attn_weights = (q @ k_repeated.transpose(-2, -1)) * scale # [2, 9, 16, 16]
print(f"K 原始形状: {list(k.shape)} → repeat_interleave({groups}) → {list(k_repeated.shape)}")
print(f"V 原始形状: {list(v.shape)} → repeat_interleave({groups}) → {list(v_repeated.shape)}")
print(f"QK^T 结果: {list(attn_weights.shape)}")
print(f"\n分组关系:")
for kv_idx in range(KV):
q_idx = list(range(kv_idx * groups, (kv_idx + 1) * groups))
print(f" KV[{kv_idx}] → Q{q_idx}")
K 原始形状: [2, 3, 16, 64] → repeat_interleave(3) → [2, 9, 16, 64]
V 原始形状: [2, 3, 16, 64] → repeat_interleave(3) → [2, 9, 16, 64]
QK^T 结果: [2, 9, 16, 16]
分组关系:
KV[0] → Q[0, 1, 2]
KV[1] → Q[3, 4, 5]
KV[2] → Q[6, 7, 8]
2.3 FFN —— SwiGLU 的三权重结构
读 config 时,FFN 重点看两个数字:
hidden_size:Block 内部统一宽度,也就是 05 节里的d_modelintermediate_size:FFN 中间层宽度,也就是 05 节里的d_ff
SmolLM2-135M 里 hidden_size=576,intermediate_size=1536。这表示每个 token 先从 576 维升到 1536 维,在更宽的空间里加工,再降回 576 维。
Llama-style 的 FFN 和 05 节最简单的两层 FFN 不同:它有三个权重矩阵(gate、up、down),activation 用 SiLU。
教学版 FFN:
x → up_proj → ReLU/GELU → down_proj → out
SwiGLU FFN:
x → up_proj ┐
x → gate_proj ├→ SiLU(gate) * up → down_proj → out
所以参数量也不同:教学版主要是 2 个矩阵,SwiGLU 是 3 个矩阵。下面建出来,做一次前向,验证形状变化。
import torch.nn as nn
import torch.nn.functional as F
import torch
class LlamaFFN(nn.Module):
"""Llama 风格的 SwiGLU FFN:gate 做门控,up 做投影,down 收回来"""
def __init__(self, dim, intermediate_dim):
super().__init__()
self.gate = nn.Linear(dim, intermediate_dim, bias=False)
self.up = nn.Linear(dim, intermediate_dim, bias=False)
self.down = nn.Linear(intermediate_dim, dim, bias=False)
def forward(self, x):
return self.down(F.silu(self.gate(x)) * self.up(x))
ffn = LlamaFFN(D, FF)
x = torch.randn(2, 16, D)
out = ffn(x)
gate_p = sum(p.numel() for p in ffn.gate.parameters())
up_p = sum(p.numel() for p in ffn.up.parameters())
down_p = sum(p.numel() for p in ffn.down.parameters())
print(f"LlamaFFN 结构:")
print(f" gate: {list(ffn.gate.weight.shape)} ({gate_p:,} params)")
print(f" up: {list(ffn.up.weight.shape)} ({up_p:,} params)")
print(f" down: {list(ffn.down.weight.shape)} ({down_p:,} params)")
print(f" 合计: {gate_p + up_p + down_p:,} params")
print(f"\n前向: [2, 16, {D}] → gate/up → [2, 16, {FF}] → SiLU · up → down → [2, 16, {D}]")
print(f"输入 shape: {list(x.shape)} 输出 shape: {list(out.shape)}")
print(f"05 节对比: 两权重 FFN ({D}→{D*4}→{D}), 这里三权重 ({D}→{FF}→{D}), 多了一个 gate")
LlamaFFN 结构:
gate: [1536, 576] (884,736 params)
up: [1536, 576] (884,736 params)
down: [576, 1536] (884,736 params)
合计: 2,654,208 params
前向: [2, 16, 576] → gate/up → [2, 16, 1536] → SiLU · up → down → [2, 16, 576]
输入 shape: [2, 16, 576] 输出 shape: [2, 16, 576]
05 节对比: 两权重 FFN (576→2304→576), 这里三权重 (576→1536→576), 多了一个 gate
2.4 RMSNorm —— 只缩放不平移
05 节用的是 LayerNorm(平移 + 缩放)。现代 LLM 几乎全用 RMSNorm——只做缩放, 省掉平移这一步。config 里 rms_norm_eps=1e-05。PyTorch 1.13 起内置了 nn.RMSNorm, 直接打印它的结构,和 nn.LayerNorm 对比。
# 直接用 PyTorch 内置的 RMSNorm 和 LayerNorm
import torch
import torch.nn as nn
rn = nn.RMSNorm(D, eps=config['rms_norm_eps'])
ln = nn.LayerNorm(D, eps=config['rms_norm_eps'])
print("=== nn.RMSNorm 结构 ===")
print(rn)
print("\n=== nn.LayerNorm 结构 ===")
print(ln)
# 同一份随机输入
x = torch.randn(4, 8, D)
with torch.no_grad():
ln_out = ln(x)
rn_out = rn(x)
print(f"\nLayerNorm: weight + bias = {sum(p.numel() for p in ln.parameters())} 个参数")
print(f"RMSNorm: 只有 weight = {sum(p.numel() for p in rn.parameters())} 个参数")
print(f"\n归一化前: mean={x.mean():.3f}, std={x.std():.3f}")
print(f"LayerNorm: mean={ln_out.mean():.6f}, std={ln_out.std():.3f} ← 均值为 0")
print(f"RMSNorm: mean={rn_out.mean():.4f}, std={rn_out.std():.3f} ← 均值不为 0")
print(f"\n每层省 {D} 个 bias 参数,30 层合计省 {D * 30:,} 个")
=== nn.RMSNorm 结构 ===
RMSNorm((576,), eps=1e-05, elementwise_affine=True)
=== nn.LayerNorm 结构 ===
LayerNorm((576,), eps=1e-05, elementwise_affine=True)
LayerNorm: weight + bias = 1152 个参数
RMSNorm: 只有 weight = 576 个参数
归一化前: mean=-0.013, std=1.000
LayerNorm: mean=-0.000000, std=1.000 ← 均值为 0
RMSNorm: mean=-0.0130, std=1.000 ← 均值不为 0
每层省 576 个 bias 参数,30 层合计省 17,280 个
2.5 核算总参数量
现在已经有了所有组件的尺寸。下面分两步验证 135M 这个数字:
第一步用公式逐项算(理论),第二步把完整模型建出来、用 sum(p.numel()) 算(真实),看两者是否对得上。
# ========== 理论计算:逐项按公式算 ==========
# 每层 Attention: Q, K, V, O 四个投影
import torch.nn as nn
import torch.nn.functional as F
q_p = D * H * head_dim # Q: [D, H*head_dim]
k_p = D * KV * head_dim # K: [D, KV*head_dim]
v_p = D * KV * head_dim # V: [D, KV*head_dim]
o_p = H * head_dim * D # O: [H*head_dim, D]
attn_p = q_p + k_p + v_p + o_p
# 每层 FFN: gate, up, down 三个投影
ffn_p = 3 * D * FF
# 每层 Norm: 2 个 RMSNorm,每个只有 D 个 weight
norm_p = 2 * D
per_layer = attn_p + ffn_p + norm_p
total_theory = V * D + L * per_layer + D # Embedding + 30层 + 最终Norm
print("========== 理论计算(公式) ==========")
print(f"每层 Attention (Q+K+V+O): {attn_p:>10,}")
print(f"每层 FFN (gate+up+down): {ffn_p:>10,}")
print(f"每层 RMSNorm × 2: {norm_p:>10,}")
print(f"每层小计: {per_layer:>10,} ≈ {per_layer/1e6:.2f}M")
print(f"\nEmbedding ({V}×{D}): {V*D:>10,} ≈ {V*D/1e6:.1f}M")
print(f"{L} 层 Block: {L*per_layer:>10,} ≈ {L*per_layer/1e6:.1f}M")
print(f"最终 RMSNorm: {D:>10,}")
print(f"理论总计: {total_theory:>10,} ≈ {total_theory/1e6:.1f}M")
# ========== 真实计算:建出完整模型,用 sum(p.numel()) 算 ==========
class SmolLM2ConfigModel(nn.Module):
"""用 config 组装 SmolLM2:Embedding → 30×Block → RMSNorm → 输出投影"""
def __init__(self, config):
super().__init__()
d, V = config['hidden_size'], config['vocab_size']
self.embed = nn.Embedding(V, d)
self.blocks = nn.ModuleList(
[TransformerBlock(config) for _ in range(config['num_hidden_layers'])])
self.final_norm = nn.RMSNorm(d, eps=config['rms_norm_eps'])
def forward(self, x):
x = self.embed(x)
for blk in self.blocks:
x = blk(x)
x = self.final_norm(x)
# tie_word_embeddings=True:输出投影复用 embed.weight,不另建 lm_head
return F.linear(x, self.embed.weight)
model = SmolLM2ConfigModel(config)
total_real = sum(p.numel() for p in model.parameters())
print("\n========== 真实计算 (sum(p.numel())) ==========")
emb_real = sum(p.numel() for p in model.embed.parameters())
blocks_real = sum(p.numel() for p in model.blocks.parameters())
norm_real = sum(p.numel() for p in model.final_norm.parameters())
print(f"Embedding: {emb_real:>10,} ({V}×{D})")
print(f"30×Block: {blocks_real:>10,} ({L}×{per_layer:,})")
print(f"final_norm: {norm_real:>10,} ({D})")
print(f"{'─'*45}")
print(f"真实总计: {total_real:>10,} ≈ {total_real/1e6:.1f}M")
print(f"\n========== 对照 ==========")
print(f"理论计算: {total_theory:,}")
print(f"真实计算: {total_real:,}")
print(f"两者一致: {total_theory == total_real}")
if total_theory == total_real:
print("✓ 理论公式和 PyTorch 实际参数完全对上了")
else:
print(f"✗ 差 {abs(total_theory - total_real):,},请检查")
========== 理论计算(公式) ==========
每层 Attention (Q+K+V+O): 884,736
每层 FFN (gate+up+down): 2,654,208
每层 RMSNorm × 2: 1,152
每层小计: 3,540,096 ≈ 3.54M
Embedding (49152×576): 28,311,552 ≈ 28.3M
30 层 Block: 106,202,880 ≈ 106.2M
最终 RMSNorm: 576
理论总计: 134,515,008 ≈ 134.5M
========== 真实计算 (sum(p.numel())) ==========
Embedding: 28,311,552 (49152×576)
30×Block: 106,202,880 (30×3,540,096)
final_norm: 576 (576)
─────────────────────────────────── ──────────
真实总计: 134,515,008 ≈ 134.5M
========== 对照 ==========
理论计算: 134,515,008
真实计算: 134,515,008
两者一致: True
✓ 理论公式和 PyTorch 实际参数完全对上了
参数都花在哪了:
import matplotlib.pyplot as plt
emb_s = V * D
attn_s = attn_p * L
ffn_s = ffn_p * L
norm_s = norm_p * L
labels = [f'Embedding\n{emb_s/1e6:.1f}M', f'Attention×{L}\n{attn_s/1e6:.1f}M',
f'FFN×{L}\n{ffn_s/1e6:.1f}M', f'RMSNorm×{L}\n{norm_s/1e6:.2f}M']
sizes = [emb_s, attn_s, ffn_s, norm_s]
plt.figure(figsize=(6, 5))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['#5DADE2','#F5B041','#E74C3C','#58D68D'])
plt.title(f'SmolLM2-135M 参数分布')
plt.tight_layout()
plt.show()
/var/folders/fv/xkn6r25n41j9fm98mh1l73hm0000gn/T/ipykernel_96704/1675561165.py:16: UserWarning: Glyph 21442 (\N{CJK UNIFIED IDEOGRAPH-53C2}) missing from font(s) DejaVu Sans.
plt.tight_layout()
/var/folders/fv/xkn6r25n41j9fm98mh1l73hm0000gn/T/ipykernel_96704/1675561165.py:16: UserWarning: Glyph 25968 (\N{CJK UNIFIED IDEOGRAPH-6570}) missing from font(s) DejaVu Sans.
plt.tight_layout()
/var/folders/fv/xkn6r25n41j9fm98mh1l73hm0000gn/T/ipykernel_96704/1675561165.py:16: UserWarning: Glyph 20998 (\N{CJK UNIFIED IDEOGRAPH-5206}) missing from font(s) DejaVu Sans.
plt.tight_layout()
/var/folders/fv/xkn6r25n41j9fm98mh1l73hm0000gn/T/ipykernel_96704/1675561165.py:16: UserWarning: Glyph 24067 (\N{CJK UNIFIED IDEOGRAPH-5E03}) missing from font(s) DejaVu Sans.
plt.tight_layout()
/Users/sanbu/miniconda3/lib/python3.12/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 21442 (\N{CJK UNIFIED IDEOGRAPH-53C2}) missing from font(s) DejaVu Sans.
fig.canvas.print_figure(bytes_io, **kw)
/Users/sanbu/miniconda3/lib/python3.12/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 25968 (\N{CJK UNIFIED IDEOGRAPH-6570}) missing from font(s) DejaVu Sans.
fig.canvas.print_figure(bytes_io, **kw)
/Users/sanbu/miniconda3/lib/python3.12/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 20998 (\N{CJK UNIFIED IDEOGRAPH-5206}) missing from font(s) DejaVu Sans.
fig.canvas.print_figure(bytes_io, **kw)
/Users/sanbu/miniconda3/lib/python3.12/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 24067 (\N{CJK UNIFIED IDEOGRAPH-5E03}) missing from font(s) DejaVu Sans.
fig.canvas.print_figure(bytes_io, **kw)

3. tokenizer_config.json —— 控制文本怎么被"翻译"成数字
config.json 只管模型内部。文本进入模型之前和之后,都由 tokenizer_config.json 和 tokenizer.json 接管。 tokenizer_config.json 定义「特殊 token 是什么」「最多切多长」「要不要加 BOS/EOS」这些行为参数。
下面用 SmolLM2 的 tokenizer 配置(精简版),演示这些参数实际影响什么。
import json
tokenizer_config = {
"add_bos_token": True,
"add_eos_token": True,
"bos_token": "<|im_start|>",
"eos_token": "<|im_end|>",
"pad_token": "<|im_end|>",
"model_max_length": 8192,
"truncation_side": "right",
"padding_side": "right",
}
print("tokenizer_config.json (精简):")
print(json.dumps(tokenizer_config, indent=2, ensure_ascii=False))
tokenizer_config.json (精简):
{
"add_bos_token": true,
"add_eos_token": true,
"bos_token": "<|im_start|>",
"eos_token": "<|im_end|>",
"pad_token": "<|im_end|>",
"model_max_length": 8192,
"truncation_side": "right",
"padding_side": "right"
}
tokenizer_config 里的参数直接影响 encode 的结果。 我们用一个小词表模拟 encode 过程,展示 BOS/EOS 的插入、padding、截断是怎么运作的:
# 模拟一个小词表(SmolLM2 实际词表有 49152 个 token,这里用 10 个示意)
mini_vocab = {"<|im_start|>": 0, "<|im_end|>": 1, "我": 2, "爱": 3, "机器": 4, "学习": 5, "。": 6}
def simulate_encode(text, cfg):
"""模拟 tokenizer 的 encode 行为:切词 + 加特殊 token"""
# 极简切词:每字一个 token(真 tokenizer 会用 BPE,但行为模式相同)
tokens = [c for c in text]
ids = [mini_vocab.get(t, -1) for t in tokens]
if cfg.get("add_bos_token"):
ids = [mini_vocab[cfg["bos_token"]]] + ids
if cfg.get("add_eos_token"):
ids = ids + [mini_vocab[cfg["eos_token"]]]
return ids
# 不同设置下的 encode 结果
text = "我爱机器学习"
cfg_with_both = {"add_bos_token": True, "add_eos_token": True, "bos_token": "<|im_start|>", "eos_token": "<|im_end|>"}
cfg_without = {"add_bos_token": False, "add_eos_token": False}
cfg_bos_only = {"add_bos_token": True, "add_eos_token": False, "bos_token": "<|im_start|>", "eos_token": "<|im_end|>"}
print(f"原文: '{text}'\n")
print(f"add_bos=True, add_eos=True: {simulate_encode(text, cfg_with_both)}")
print(f"add_bos=False, add_eos=False: {simulate_encode(text, cfg_without)}")
print(f"add_bos=True, add_eos=False: {simulate_encode(text, cfg_bos_only)}")
print(f"\n关键观察:BOS/EOS 的插入由 tokenizer_config 控制,不是模型自己决定的。")
原文: '我爱机器学习'
add_bos=True, add_eos=True: [0, 2, 3, -1, -1, -1, -1, 1]
add_bos=False, add_eos=False: [2, 3, -1, -1, -1, -1]
add_bos=True, add_eos=False: [0, 2, 3, -1, -1, -1, -1]
关键观察:BOS/EOS 的插入由 tokenizer_config 控制,不是模型自己决定的。
# Padding 演示:不同长度的句子怎么对齐
sentences = ["我爱机器学习", "猫", "深度学习很有意思"]
encoded = [simulate_encode(s, cfg_with_both) for s in sentences]
max_len = max(len(e) for e in encoded)
print("Padding 演示 (pad_token_id = 1 = <|im_end|>):\n")
for i, (s, ids) in enumerate(zip(sentences, encoded)):
pad_len = max_len - len(ids)
padded = ids + [1] * pad_len # 用 pad_token_id 填充
print(f" '{s}': {ids} → padded: {padded}")
print(f"\npadding_side = '{tokenizer_config['padding_side']}' → 在右边补 pad")
print(f"model_max_length = {tokenizer_config['model_max_length']} → 超过此长度会被截断")
Padding 演示 (pad_token_id = 1 = <|im_end|>):
'我爱机器学习': [0, 2, 3, -1, -1, -1, -1, 1] → padded: [0, 2, 3, -1, -1, -1, -1, 1, 1, 1]
'猫': [0, -1, 1] → padded: [0, -1, 1, 1, 1, 1, 1, 1, 1, 1]
'深度学习很有意思': [0, -1, -1, -1, -1, -1, -1, -1, -1, 1] → padded: [0, -1, -1, -1, -1, -1, -1, -1, -1, 1]
padding_side = 'right' → 在右边补 pad
model_max_length = 8192 → 超过此长度会被截断