Create 第六章——数值缩放.md

1 year ago · 2038d0b2a6
parent 2ebbad7d35
commit 2038d0b2a6
1 changed files with 50 additions and 0 deletions
--- a/人人都能看懂的Transformer/第六章——数值缩放.md
+++ b/人人都能看懂的Transformer/第六章——数值缩放.md
@ -0,0 +1,50 @@
+# 第六章——数值缩放
+
+<img src="../assets/image-20240424171227926.png" alt="数值缩放" style="zoom:50%;" />
+
+### 前言
+
+我们的多头注意力已经输出了A矩阵，接下来需要继续矩阵相加和层归一化了。
+
+基本公式如下
+$$
+\text{Output} = \text{LayerNorm}(\text{Input} + x)
+$$
+`x`是上一步的输入，`Input`是多头注意力里输出的结果。前面我们也看到最终输出的多头注意力的矩阵跟一开始的X的矩阵的维度是一样的。所以两者是可以同位置相加。
+
+
+
+### 残差连接
+
+`Input`跟`x`的相加，则是对应位置元素简单相加，可以理解为跟位置编码一样的，即`Input[i][j]`+`x[i][j]`，i、j分别表示不同维度的某个位置。
+
+模拟代码如下：
+
+~~~python
+import numpy as np
+
+np.random.seed(0)
+Input = np.random.rand(3, 3)
+x = np.random.rand(3, 3)
+residual_output = Input + x
+
+print("Input:")
+print(Input)
+print("\nx:")
+print(x)
+print("\nResidual Output (Input + x):")
+print(residual_output)
+~~~
+
+<img src="../assets/image-20240503154451656.png" alt="image-20240503154451656" style="zoom:50%;" />
+
+可以看到结果是同个位置的元素相加。
+
+用大白话来解释：
+
+残差连接很像人，从心理学上讲，每个人都有对成功的路径依赖。比如说你上次是通过努力背公式，让你考试拿了高分，那你下次考试前，还会努力背公式。亦或者是你发现运动过程中，边运动边听音乐能帮忙你运动更长时间，你下次运动的时候还是会边听音乐边运动。
+
+
+
+### 层归一化
+