【八股】Self-Attention、Multi-Head Attention、Cross Attention详解及手撕

1015 words

5 minutes

【八股】Self-Attention、Multi-Head Attention、Cross Attention详解及手撕

2026-02-11

1. 缩放点积注意力（Scaled Dot-Product Attention）#

alt text 个人认为可以理解成用 token 的 query 去查询每一个 token 的 key，相乘得到的矩阵中元素就是两个 token 之间的关联程度，也就是注意力的大小。在后面 softmax 之后作为权重与 value 相乘求和得到最终的输出，token 之间关联性较大的权重就高，最终输出受他 value 的影响就更大。

下一步就是大家经常讨论的为什么要在 $QK^T$ 后除以 $\sqrt d_k$ 。

我们先假设 $Q_i,\ K_i$ 独立，均值为 $0$ ，方差为 $1$ ，点积为： $\sum\limits_{i=1}^{d_k} Q_i K_i$ ，这时均值是0，但方差变成了 $d_k$ （对于独立随机变量之和，方差等于方差之和），这会导致 softmax 变得很 “尖”，比如最大值比次大值大很多的时候最大的那个权重在 softmax 后会接近 $1$ ，而其他的权重接近 $0$ ，从而导致梯度消失或训练不稳定等问题。于是我们需要除以 $\sqrt d_k$ 进行放缩，把方差降回 $1$ 。

在这之后可选加 mask，也就是把不允许关注的位置忽略。

下一步就是用 softmax 把 $Q$ 和 $K$ 相乘后除以 $\sqrt d_k$ 的每个元素转化为一个介于 $0$ 到 $1$ 之间的实数，且所有元素的和等于 $1$ 。

最后与 $V$ 相乘求和得到最终的 $Output$ 。

$\mathrm{Attention}(Q, K, V)=\mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V$

2. 自注意力 (Self-Attention)#

当 $Q,\ K,\ V$ 由同一坨东西乘以各自的权重矩阵 $w_q,\ w_k,\ w_v$ ，得出时，就是 Self-Attention。

代码实现：

1
class SelfAttention(nn.Module):
2
  def __init__(self, input_dim, dim_qk, dim_v):
3
    super().__init__()
4
    self.q = nn.Linear(input_dim, dim_qk)
5
    self.k = nn.Linear(input_dim, dim_qk)
6
    self.v = nn.Linear(input_dim, dim_v)
7
    self.scale = sqrt(dim_q_k)
8

9
  def forward(self, x):
10
    q = self.q(x)
11
    k = self.k(x)
12
    v = self.v(x)
13

14
    scores = torch.bmm(q, k.transpose(1, 2)) / self.norm
15
    weights = torch.softmax(scores, dim = -1)
16
    out = torch.bmm(weights, v)
17

18
    return out
19

20
#--------------------------------------------------
21

22
input_dim = 128
23
dim_qk = 64
24
dim_v = 64
25

26
attn = SelfSelfAttention(input_dim, dim_qk, dim_v)
27

28
batch_size=2
29
seq_len=10
30

31
x = torch.randn(batch_size, seq_len, input_dim)
32

33
out = attn(x)
34

35
print("input_shape: ", x.shape)
36
print("output_shape: ", out.shape)
37
print(out)

3. 多头自注意力机制（Multi-Head Attention, MHA）#

alt text

多头注意力就是对同样的 $Q,\ K,\ V$ 做多次注意力得到不同的 output，不同的 output 连起来得到最终的 output。多头注意力机制使模型能够联合关注不同位置、不同表示子空间的信息。也就是说不同头的 output 是从不同层面考虑相关性得到的不同输出。

代码实现：

1
class MultiHeadAttention(nn.Module):
2
  def __init__(self, input_dim, num_heads, dim_qk, dim_v):
3
    super().__init__()
4
    self.num_heads = num_heads
5
    self.head_dim_qk = dim_qk // num_heads
6
    self.head_dim_v = dim_v // num_heads
7

8
    self.q = nn.Linear(input_dim, dim_qk)
9
    self.k = nn.Linear(input_dim, dim_qk)
10
    self.v = nn.Linear(input_dim, dim_v)
11

12
    self.scale = sqrt(self.head_dim_qk)
13

14
    self.out = nn.Linear(dim_v, input_dim)
15

16
  def forward(self, x):
17
    batch, seq = x.shape[:2]
18

19
    q = self.q(x)
20
    k = self.k(x)
21
    v = self.v(x)
22

23
    q = q.view(batch, seq, self.num_heads, self.head_dim_qk).transpose(1, 2)
24
    k = k.view(batch, seq, self.num_heads, self.head_dim_qk).transpose(1, 2)
25
    v = v.view(batch, seq, self.num_heads, self.head_dim_v).transpose(1, 2)
26

27
    scores = torch.matmul(q, k.transpose(-2, -1)) / self.scale
28
    weights = torch.softmax(scores, dim = -1)
29

30
    out = torch.matmul(weights, v)
31
    out = out.transpose(1, 2).contiguous().view(batch, seq, -1)
32
    out = self.out(out)
33

34
    return out
35

36
#--------------------------------------------------
37

38
#保证qkv维度能被头数整除
39
input_dim = 512
40
num_heads = 8
41
dim_qk = 512
42
dim_v = 512
43

44
mha = MultiHeadAttention(input_dim, num_heads, dim_qk, dim_v)
45

46
batch_size=2
47
seq_len=10
48

49
x = torch.randn(batch_size, seq_len, input_dim)
50

51
out = mha(x)
52

53
print("input_shape: ", x.shape)
54
print("output_shape: ", out.shape)
55
print(out)

4. 交叉注意力（Cross Attention）#

Cross Attention（交叉注意力）就是“用一段序列去查询另一段序列的内容”。它和 Self-Attention 的区别在于： $Q$ 来自一边， $K/V$ 来自另一边；Self-Attention 则 $Q/K/V$ 都来自同一个输入。

decoder 先把自己当前位置的状态变成一个问题向量Q（Query）。可以理解成：现在要找什么信息。

encoder 把每个源端 token 的表示变成索引向量K（Key）和内容向量V（Value）。K 用来匹配“跟我的问题像不像“， V 像正文内容，真正要抄回来的信息。

代码实现：

1
class CrossAttention(nn.Module):
2
  def __init__(self, input_dim, dim_qk, dim_v):
3
    super().__init__()
4
    self.q = nn.Linear(input_dim, dim_qk)
5
    self.k = nn.Linear(input_dim, dim_qk)
6
    self.v = nn.Linear(input_dim, dim_v)
7

8
    self.scale = sqrt(dim_qk)
9

10
  def forward(self, encoder_input, decoder_input):
11
    q = self.q(decoder_input)
12
    k = self.k(encoder_input)
13
    v = self.v(encoder_input)
14

15
    scores = torch.bmm(q, k.transpose(1, 2)) / self.scale
16
    weights = torch.softmax(scores, dim = -1)
17
    out = torch.bmm(weights, v)
18

19
    return out
20

21
#--------------------------------------------------
22

23
input_dim = 128
24
dim_q_k = 64
25
dim_v = 64
26

27
cross_attn = CrossAttention(input_dim, dim_qk, dim_v)
28

29
batch_size = 2
30
src_len = 10
31
tgt_len = 8
32

33
encoder_output = torch.randn(batch_size, src_len, input_dim)
34
decoder_input = torch.randn(batch_size, tgt_len, input_dim)
35

36
out = cross_attn(encoder_output, decoder_input)
37

38
print("encoder_output_shape: ", encoder_output.shape)
39
print("decoder_input_shape: ", decoder_input.shape)
40
print("output_shape: ", out.shape)
41
print(out)