Update. 增加实际代码

master
ben.guo 1 year ago
parent 5d5b429869
commit 64bd461b02

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

@ -48,4 +48,17 @@ print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))
为什么会被分成两个原因是即使英文也有数十万甚至更多的词大部分词都是通过subword子单词组成也就是input可以由in和put组成。如果每个独立的词都单独做词汇会造成极大浪费最后每次词可能要与几十万的向量去做点积。为了提高资源利用率以及不造成数据维度爆炸我们会现在词汇量的大小如GPT2Tokenizer的词汇量是50257。代码如下
<img src="../assets/image-20240425210044737.png" alt="image-20240425210044737" style="zoom:50%;" />
~~~python
from transformers import GPT2Tokenizer
# 初始化分词器
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 获取词汇表的大小
vocab_size = len(tokenizer)
print(f"The vocabulary size of GPT2Tokenizer is: {vocab_size}")
"""out:
The vocabulary size of GPT2Tokenizer is: 50257
"""
~~~
<img src="../assets/image-20240426165150933.png" alt="image-20240426165150933" style="zoom:50%;" />
Loading…
Cancel
Save