# 从头开始实现llama3
在这个文件中,我逐个张量和矩阵地从头实现了llama3。
本地可以运行:llama3-from-scratch.ipynb
此外,我将直接从meta提供给llama3的模型文件中加载张量,你需要在运行此文件之前下载权重。
这是下载权重的官方链接: [点击这里下载权重](https://llama.meta.com/llama-downloads/)
https://hf-mirror.com/NousResearch/Meta-Llama-3-8B
https://gitee.com/hf-models/Meta-Llama-3-8B-Instruct/
## 分词器
我不打算实现一个BPE分词器(但是Andrej Karpathy有一个非常干净的实现)。
他的实现链接: [点击这里查看他的实现](https://github.com/karpathy/minbpe)
```python
%env HF_ENDPOINT = "https://hf-mirror.com"
```
env: HF_ENDPOINT="https://hf-mirror.com"
```python
%pip install blobfile -q
```
Note: you may need to restart the kernel to use updated packages.
```python
from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import torch
import json
import matplotlib.pyplot as plt
tokenizer_path = "./tokenizer.model"
special_tokens = [
"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|reserved_special_token_2|>",
"<|reserved_special_token_3|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|reserved_special_token_4|>",
"<|eot_id|>", # end of turn
] + [f"<|reserved_special_token_{i}|>" for i in range(5, 256 - 5)]
mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
tokenizer = tiktoken.Encoding(
name=Path(tokenizer_path).name,
pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
mergeable_ranks=mergeable_ranks,
special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
)
tokenizer.decode(tokenizer.encode("hello world!"))
```
'hello world!'
## 读取模型文件
通常,读取模型文件取决于模型类的编写方式以及其中的变量名。
但由于我们是从头开始实现llama3,我们将逐个张量地读取文件。
可以在这里下载模型:https://gitee.com/hf-models/Meta-Llama-3-8B-Instruct/blob/main/original/consolidated.00.pth
```python
!wget 'https://lfs.gitee.com/api/lfs/storage/projects/34266234/be52262c9289304f3e8240e0749bf257bc04264405a86cd4de38efb9068724ee?Expires=1716626632&Signature=xgDOu9JHNM6ECazR3nA4NQHwXs%2BiG%2BCtnzza6ekSuqs%3D&FileName=consolidated.00.pth'
```
--2024-05-25 16:24:15-- https://lfs.gitee.com/api/lfs/storage/projects/34266234/be52262c9289304f3e8240e0749bf257bc04264405a86cd4de38efb9068724ee?Expires=1716626632&Signature=xgDOu9JHNM6ECazR3nA4NQHwXs%2BiG%2BCtnzza6ekSuqs%3D&FileName=consolidated.00.pth
Resolving lfs.gitee.com (lfs.gitee.com)... 180.76.198.180
Connecting to lfs.gitee.com (lfs.gitee.com)|180.76.198.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16060617592 (15G) [application/octet-stream]
Saving to: ‘be52262c9289304f3e8240e0749bf257bc04264405a86cd4de38efb9068724ee?Expires=1716626632&Signature=xgDOu9JHNM6ECazR3nA4NQHwXs+iG+Ctnzza6ekSuqs=&FileName=consolidated.00.pth’
0% [ ] 105,193,134 453KB/s eta 11h 21m^C
我的机器12s可以载入,接下来仅用cpu进行推理,我这边内存30G足够了,然后cpu推理一个词大约30s,稍微慢了一些,不过我们主要理解原理
```python
model = torch.load("/data1/ckw/consolidated.00.pth")
print(json.dumps(list(model.keys())[:20], indent=4))
```
[
"tok_embeddings.weight",
"layers.0.attention.wq.weight",
"layers.0.attention.wk.weight",
"layers.0.attention.wv.weight",
"layers.0.attention.wo.weight",
"layers.0.feed_forward.w1.weight",
"layers.0.feed_forward.w3.weight",
"layers.0.feed_forward.w2.weight",
"layers.0.attention_norm.weight",
"layers.0.ffn_norm.weight",
"layers.1.attention.wq.weight",
"layers.1.attention.wk.weight",
"layers.1.attention.wv.weight",
"layers.1.attention.wo.weight",
"layers.1.feed_forward.w1.weight",
"layers.1.feed_forward.w3.weight",
"layers.1.feed_forward.w2.weight",
"layers.1.attention_norm.weight",
"layers.1.ffn_norm.weight",
"layers.2.attention.wq.weight"
]
```python
with open("./params.json", "r") as f:
config = json.load(f)
config
```
{'dim': 4096,
'n_layers': 32,
'n_heads': 32,
'n_kv_heads': 8,
'vocab_size': 128256,
'multiple_of': 1024,
'ffn_dim_multiplier': 1.3,
'norm_eps': 1e-05,
'rope_theta': 500000.0}
## 我们使用这个配置来推断模型的细节,比如:
1. 模型有32个Transformer层
2. 每个多头注意力块有32个头
3. 词汇表大小,等等
```python
dim = config["dim"]
n_layers = config["n_layers"]
n_heads = config["n_heads"]
n_kv_heads = config["n_kv_heads"]
vocab_size = config["vocab_size"]
multiple_of = config["multiple_of"]
ffn_dim_multiplier = config["ffn_dim_multiplier"]
norm_eps = config["norm_eps"]
rope_theta = torch.tensor(config["rope_theta"])
```
## 将文本转换为标记
这里我们使用tiktoken(我认为是OpenAI的一个库)作为分词器
```python
prompt = "the answer to the ultimate question of life, the universe, and everything is "
tokens = [128000] + tokenizer.encode(prompt)
print(tokens)
tokens = torch.tensor(tokens)
prompt_split_as_tokens = [tokenizer.decode([token.item()]) for token in tokens]
print(prompt_split_as_tokens)
```
[128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]
['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']
## 将标记转换为它们的嵌入向量
这是代码库中我唯一使用内置神经网络模块的部分。
无论如何,我们的[17x1]标记现在是[17x4096],即长度为4096的17个嵌入向量(每个标记一个)。
注意: 跟踪形状,这样可以更容易理解所有内容
```python
embedding_layer = torch.nn.Embedding(vocab_size, dim)
embedding_layer.weight.data.copy_(model["tok_embeddings.weight"])
token_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)
token_embeddings_unnormalized.shape
```
torch.Size([17, 4096])
## 然后我们使用RMS归一化来标准化嵌入向量
请注意,在此步骤之后,形状不会改变,只是值被标准化了。
需要记住的一些事情,我们需要一个norm_eps(来自配置),因为我们不希望意外地将RMS设置为0并除以0。
以下是公式:
```python
# def rms_norm(tensor, norm_weights):
# rms = (tensor.pow(2).mean(-1, keepdim=True) + norm_eps)**0.5
# return tensor * (norm_weights / rms)
def rms_norm(tensor, norm_weights):
return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights
```
# 构建Transformer的第一层
### 标准化
你会看到我从模型字典中访问layer.0(这是第一层)。
无论如何,所以在我们标准化后,形状仍然是[17x4096],与嵌入向量相同,但是标准化了
```python
token_embeddings = rms_norm(token_embeddings_unnormalized, model["layers.0.attention_norm.weight"])
to