diff --git a/.DS_Store b/.DS_Store
index 7b781dd..77ccf83 100644
Binary files a/.DS_Store and b/.DS_Store differ
diff --git a/assets/image-20240425205340177.png b/assets/image-20240425205340177.png
new file mode 100644
index 0000000..1aaf766
Binary files /dev/null and b/assets/image-20240425205340177.png differ
diff --git a/assets/image-20240425205432896.png b/assets/image-20240425205432896.png
new file mode 100644
index 0000000..0e933a9
Binary files /dev/null and b/assets/image-20240425205432896.png differ
diff --git a/assets/image-20240425210044737.png b/assets/image-20240425210044737.png
new file mode 100644
index 0000000..d032f1b
Binary files /dev/null and b/assets/image-20240425210044737.png differ
diff --git a/人人都能看懂的Transformer/第二章——文字向量化.md b/人人都能看懂的Transformer/第二章——文字向量化.md
new file mode 100644
index 0000000..f55c2ea
--- /dev/null
+++ b/人人都能看懂的Transformer/第二章——文字向量化.md
@@ -0,0 +1,49 @@
+# 第二章——文字向量化
+
+
+
+在传给Transformer前,文本会先通过tokenizer(分词器),将原始文本分割成词汇单元(tokens),这些词汇单元对应于模型词汇表中的索引。然后,这些索引会被转换成模型能够处理的输入序列。
+
+也就是上面的"LLM with me"会转成4个索引,代码如下(下面用GPT2做示例,因为GPT2是开源的):
+
+~~~python
+from transformers import GPT2Tokenizer, GPT2Model
+
+# 初始化分词器和模型
+tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+model = GPT2Model.from_pretrained('gpt2')
+# 待向量化的文本
+text = "LLM with me"
+# 分词并转换为索引
+inputs = tokenizer(text, return_tensors="pt")
+# 输出token及其对应的索引
+print(inputs)
+"""out: {'input_ids': tensor([[3069, 44, 351, 502]]), 'attention_mask': tensor([[1, 1, 1, 1]])}
+"""
+~~~
+
+
+
+我们可以看到"LLM with me"明明是3个词,但是输出的有4个索引。我们拆开文本来看
+
+~~~python
+text = "LLM"
+inputs = tokenizer(text, return_tensors="pt")
+print(inputs)
+
+# 查看索引对应的token
+print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))
+
+"""out:
+{'input_ids': tensor([[3069, 44]]), 'attention_mask': tensor([[1, 1]])}
+['LL', 'M']
+"""
+~~~
+
+
+
+通过上面的代码,我们发现"LLM"文本被分成了"LL"和"M"两个词。
+
+为什么会被分成两个,原因是即使英文也有数十万甚至更多的词,大部分词都是通过subword(子单词)组成,也就是input可以由in和put组成。如果每个独立的词都单独做词汇会造成极大浪费,最后每次词可能要与几十万的向量去做点积。为了提高资源利用率,以及不造成数据维度爆炸,我们会现在词汇量的大小,如GPT2Tokenizer的词汇量是50257。代码如下:
+
+
\ No newline at end of file