Add. GPT-2的位置编码方法

1 year ago · 91256041f9
parent 29136ec781
commit 91256041f9
3 changed files with 59 additions and 0 deletions
--- a/assets/image-20240427181354113.png
+++ b/assets/image-20240427181354113.png
--- a/assets/image-20240427181415430.png
+++ b/assets/image-20240427181415430.png
--- a/人人都能看懂的Transformer/第三章——位置编码.md
+++ b/人人都能看懂的Transformer/第三章——位置编码.md
@ -111,6 +111,65 @@ We chose this function because we hypothesized it would allow the model to easil

 ### GPT-2的位置编码方法

+~~~python
+import tensorflow as tf
+
+class HParams:
+    def __init__(self, **kwargs):
+        self.__dict__.update(kwargs)
+
+def positions_for(tokens, past_length):
+    batch_size = tf.shape(tokens)[0]
+    nsteps = tf.shape(tokens)[1]
+    position_ids = past_length + tf.range(nsteps)
+    return tf.tile(tf.expand_dims(position_ids, 0), [batch_size, 1])
+
+def position_embedding(hparams, position_ids):
+    wpe = tf.Variable(tf.random.normal([hparams.n_ctx, hparams.n_embd], stddev=0.01), name='wpe')
+    position_embeddings = tf.gather(wpe, position_ids)
+    return position_embeddings
+
+# Hyperparameters for the model
+hparams = HParams(
+    n_vocab=0,
+    n_ctx=1024,
+    n_embd=768,
+    n_head=12,
+    n_layer=12,
+)
+
+input_tokens = tf.constant([[0, 1, 2, 3]], dtype=tf.int32)
+past_length = tf.constant(0)  # Assuming no past context
+
+position_ids = positions_for(input_tokens, past_length)
+position_embeddings = position_embedding(hparams, position_ids)
+print(position_embeddings)
+"""out:
+tf.Tensor(
+[[[ 0.01702908  0.00268412  0.01296544 ...  0.00706888  0.00186165
+    0.01521429]
+  [ 0.00431    -0.01150406  0.01421692 ... -0.00568195  0.00935402
+    0.01863918]
+  [-0.00091886 -0.00914316 -0.0180154  ...  0.00033014  0.00344726
+    0.01064758]
+  [ 0.00253335 -0.01882706  0.00029727 ...  0.0026667  -0.00202818
+   -0.00463023]]], shape=(1, 4, 768), dtype=float32)
+"""
+~~~
+
+<img src="../assets/image-20240427181354113.png" alt="image-20240427181354113" style="zoom:50%;" />
+
+<img src="../assets/image-20240427181415430.png" alt="image-20240427181415430" style="zoom:50%;" />
+
+可以看到GPT-2就是直接用模型训练的方法去迭代位置编码的参数。可以直接去看[源码](https://github.com/openai/gpt-2/blob/9b63575ef42771a015060c964af2c3da4cf7c8ab/src/model.py)。两者的优点前面也已经说了，这也是为什么大家都称GPT为“暴力美学”。当然不仅是这里，后面还有“暴力美学”的相关佐证。
+
+过程可以理解为：
+
+1. **初始化位置嵌入**：在模型初始化时，创建一个位置嵌入矩阵，这个矩阵的每一行对应于序列中的一个位置，每一行是一个向量，其长度等于模型的嵌入维度。这个矩阵是随机初始化的，就像其他神经网络权重一样。
+2. **查找位置嵌入**：在模型的前向传播过程中，对于输入序列中的每个位置，模型会从位置嵌入矩阵中查找对应的嵌入向量。这通常是通过一个简单的索引操作完成的，而不是通过乘积或其他复杂的神经网络运算。
+3. **结合位置嵌入和词嵌入**：得到的位置嵌入向量会与相应的词嵌入向量相加，以此来为每个词提供其在序列中位置的信息。这个相加操作是简单的向量加法。
+4. **训练调整位置嵌入**：在训练过程中，模型会通过反向传播算法来调整位置嵌入矩阵中的值，以便最小化预测错误。这意味着位置嵌入会根据模型在训练数据上的表现进行优化。
+


 ### 向量加法