[AI趣事]从"专注力"到"全能学霸":注意力机制与Transformer的革命 5-18 巅峰篇
嗨,各位AI探索者!
还记得上学时老师总说”上课要专心听讲,不要开小差”吗?没想到AI也需要学会”专注”!今天我们来揭秘AI界最重要的突破:注意力机制,以及它如何催生了ChatGPT、BERT等超级明星模型!
🤔 RNN的”记忆力危机”
翻译的噩梦
想象你要把这句中文翻译成英文:
"虽然今天天气不太好,但是我还是决定出门买一些新鲜的苹果和香蕉。"
传统RNN的处理过程:
# RNN的"健忘症"表现
encoder_states = []
for word in chinese_sentence:
state = rnn_encode(word, previous_state)
encoder_states.append(state)
# 问题:最后的state很难记住句子开头的"虽然"
final_state = encoder_states[-1] # 只记住最后的信息
english_translation = decode(final_state) # 翻译质量糟糕
就像一个健忘的翻译官,读到句子末尾时已经忘记开头说了什么!
更大的问题:一视同仁
# 所有词都被平等对待
sentence = "我不喜欢吃苹果"
weights = [0.2, 0.2, 0.2, 0.2, 0.2] # 每个词权重相同
# 但实际上"不"字应该更重要!
ideal_weights = [0.1, 0.4, 0.1, 0.2, 0.2] # "不"应该有更高权重
💡 注意力机制:AI的”聚光灯”
什么是注意力?
[插入图片1:encoder-decoder-attention.png]
![[AI趣事]从"专注力"到"全能学霸":注意力机制与Transformer的革命 5-18 巅峰篇](https://cdn.learnku.com/uploads/images/202510/25/46135/R63txnkPFr.png!large)
图:编码器-解码器注意力机制,展示如何在解码时关注编码器的不同状态
class AttentionMechanism:
def __init__(self):
self.spotlight = "AI的专注力系统"
def focus_on(self, input_words, current_output):
"""
根据当前要生成的词,决定关注输入的哪些词
"""
attention_scores = []
for input_word in input_words:
# 计算相关性得分
score = self.calculate_relevance(input_word, current_output)
attention_scores.append(score)
# 归一化得分
attention_weights = softmax(attention_scores)
# 加权组合输入信息
context = weighted_sum(input_words, attention_weights)
return context, attention_weights
注意力矩阵的魔法
[插入图片2:bahdanau-fig3.png]
![[AI趣事]从"专注力"到"全能学霸":注意力机制与Transformer的革命 5-18 巅峰篇](https://cdn.learnku.com/uploads/images/202510/25/46135/nMELisej07.png!large)
图:真实的注意力矩阵示例,展示英法翻译时词汇间的对应关系
# 英译法示例的注意力权重
sentence_en = ["The", "cat", "sat", "on", "the", "mat"]
sentence_fr = ["Le", "chat", "était", "assis", "sur", "le", "tapis"]
attention_matrix = {
"Le": {"The": 0.9, "cat": 0.1, ...}, # "Le" 主要关注 "The"
"chat": {"The": 0.1, "cat": 0.8, ...}, # "chat" 主要关注 "cat"
"était": {"sat": 0.7, ...}, # "était" 主要关注 "sat"
# ... 其他对应关系
}
这个矩阵告诉我们:翻译每个法语词时,应该重点关注哪些英语词!
🚀 Transformer:注意力的终极形态
为什么需要Transformer?
RNN的三大痛点:
- 顺序处理:必须一个词一个词地处理,无法并行
- 长距离依赖:距离远的词汇关系难以捕捉
- 训练缓慢:串行处理导致训练效率低
Transformer的解决方案:
# RNN: 串行处理
for i, word in enumerate(sentence):
hidden[i] = rnn_cell(word, hidden[i-1]) # 必须等前一个完成
# Transformer: 并行处理
all_embeddings = embedding_layer(sentence) # 所有词同时处理
attention_output = multi_head_attention(all_embeddings) # 并行计算注意力
位置编码:给词汇”排座位”
[插入图片3:pos-embedding.png]
![[AI趣事]从"专注力"到"全能学霸":注意力机制与Transformer的革命 5-18 巅峰篇](https://cdn.learnku.com/uploads/images/202510/25/46135/zV3OXey2pY.png!large)
图:位置编码示意图,展示如何将位置信息融入词嵌入
import torch
import math
def positional_encoding(seq_len, d_model):
"""
生成位置编码
seq_len: 序列长度
d_model: 嵌入维度
"""
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1).float()
# 使用sin和cos函数创建位置编码
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # 偶数位置用sin
pe[:, 1::2] = torch.cos(position * div_term) # 奇数位置用cos
return pe
# 示例:5个词,每个词128维嵌入
pos_encoding = positional_encoding(5, 128)
print(f"位置编码形状: {pos_encoding.shape}")
class PositionalEmbedding(torch.nn.Module):
def __init__(self, vocab_size, d_model, max_len=512):
super().__init__()
self.token_embedding = torch.nn.Embedding(vocab_size, d_model)
self.pos_encoding = positional_encoding(max_len, d_model)
def forward(self, x):
seq_len = x.size(1)
# 词嵌入 + 位置编码
token_emb = self.token_embedding(x)
pos_emb = self.pos_encoding[:seq_len, :].unsqueeze(0)
return token_emb + pos_emb
🎯 多头注意力:AI的”多重人格”
自注意力机制
[插入图片4:CoreferenceResolution.png]
![[AI趣事]从"专注力"到"全能学霸":注意力机制与Transformer的革命 5-18 巅峰篇](https://cdn.learnku.com/uploads/images/202510/25/46135/lkc5rFBfNF.png!large)
图:自注意力如何解决共指消解问题,理解代词”it”指代什么
class SelfAttention(torch.nn.Module):
def __init__(self, d_model):
super().__init__()
self.d_model = d_model
self.W_q = torch.nn.Linear(d_model, d_model) # Query权重
self.W_k = torch.nn.Linear(d_model, d_model) # Key权重
self.W_v = torch.nn.Linear(d_model, d_model) # Value权重
def forward(self, x):
# x: [batch_size, seq_len, d_model]
Q = self.W_q(x) # Query: 我要查询什么?
K = self.W_k(x) # Key: 可以被查询的信息
V = self.W_v(x) # Value: 实际的信息内容
# 计算注意力分数
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_model)
# 应用softmax获得注意力权重
attention_weights = torch.softmax(scores, dim=-1)
# 加权求和得到输出
output = torch.matmul(attention_weights, V)
return output, attention_weights
# 示例:理解句子中的指代关系
sentence = ["The", "cat", "sat", "because", "it", "was", "tired"]
# 自注意力可以让"it"关注到"cat",理解指代关系
多头注意力:全方位理解
class MultiHeadAttention(torch.nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
# 每个头都有自己的权重矩阵
self.heads = torch.nn.ModuleList([
SelfAttention(self.d_k) for _ in range(num_heads)
])
self.output_linear = torch.nn.Linear(d_model, d_model)
def forward(self, x):
batch_size, seq_len, d_model = x.size()
# 将输入分配给各个注意力头
head_outputs = []
for head in self.heads:
head_input = x[:, :, head.d_k * i:head.d_k * (i + 1)]
head_output, _ = head(head_input)
head_outputs.append(head_output)
# 连接所有头的输出
multi_head_output = torch.cat(head_outputs, dim=-1)
# 最终线性变换
return self.output_linear(multi_head_output)
# 8个头的注意力,每个头关注不同的语言模式
# 头1: 语法关系 (主谓宾)
# 头2: 语义关系 (同义词、反义词)
# 头3: 长距离依赖
# 头4: 共指消解
# ...
[插入图片5:transformer-animated-explanation.gif]

图:Transformer工作流程动画,展示注意力计算的并行处理过程
🔄 完整的Transformer架构
class TransformerBlock(torch.nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.multi_head_attention = MultiHeadAttention(d_model, num_heads)
self.feed_forward = torch.nn.Sequential(
torch.nn.Linear(d_model, d_ff),
torch.nn.ReLU(),
torch.nn.Linear(d_ff, d_model)
)
self.norm1 = torch.nn.LayerNorm(d_model)
self.norm2 = torch.nn.LayerNorm(d_model)
self.dropout = torch.nn.Dropout(dropout)
def forward(self, x):
# 多头注意力 + 残差连接 + 层归一化
attn_output = self.multi_head_attention(x)
x = self.norm1(x + self.dropout(attn_output))
# 前馈网络 + 残差连接 + 层归一化
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
class TransformerModel(torch.nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers, num_classes):
super().__init__()
self.embedding = PositionalEmbedding(vocab_size, d_model)
self.transformer_blocks = torch.nn.ModuleList([
TransformerBlock(d_model, num_heads, d_model * 4)
for _ in range(num_layers)
])
self.classifier = torch.nn.Linear(d_model, num_classes)
def forward(self, x):
# 位置嵌入
x = self.embedding(x)
# 多层Transformer
for transformer in self.transformer_blocks:
x = transformer(x)
# 全局平均池化 + 分类
x = x.mean(dim=1) # [batch_size, d_model]
return self.classifier(x)
🤖 BERT:预训练的语言天才
[插入图片6:jalammarBERT-language-modeling-masked-lm.png]
![[AI趣事]从"专注力"到"全能学霸":注意力机制与Transformer的革命 5-18 巅峰篇](https://cdn.learnku.com/uploads/images/202510/25/46135/jNSwjn1Esa.png!large)
图:BERT的掩码语言建模训练过程,展示如何预测被遮挡的词汇
BERT的训练方式
class BERTPretraining:
def __init__(self):
self.model = "超大型Transformer"
self.training_data = "维基百科 + 书籍语料库"
def masked_language_modeling(self, sentence):
"""
掩码语言建模:随机遮挡15%的词,让模型猜测
"""
# 原句:"我喜欢吃苹果"
masked_sentence = "我喜欢吃[MASK]"
# 目标:预测[MASK]是"苹果"
return self.predict_masked_word(masked_sentence)
def next_sentence_prediction(self, sentence_a, sentence_b):
"""
下一句预测:判断两个句子是否连续
"""
# 句子A:"今天天气很好"
# 句子B:"我决定出去散步" → 连续 (IsNext)
# 句子B:"猫咪喜欢睡觉" → 不连续 (NotNext)
return self.predict_relationship(sentence_a, sentence_b)
# BERT通过这两个任务学会了语言的深层规律
使用预训练BERT
from transformers import BertTokenizer, BertForSequenceClassification
import torch
# 加载预训练模型和分词器
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(
model_name,
num_labels=4 # 4个新闻类别
)
def classify_with_bert(text):
"""使用BERT进行文本分类"""
# 分词和编码
inputs = tokenizer(
text,
return_tensors='pt',
padding=True,
truncation=True,
max_length=512
)
# 前向传播
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1)
return predicted_class.item()
# 微调训练
def fine_tune_bert(train_loader, epochs=3):
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
for epoch in range(epochs):
model.train()
total_loss = 0
for batch in train_loader:
inputs, labels = batch
# 前向传播
outputs = model(**inputs, labels=labels)
loss = outputs.loss
# 反向传播
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f'Epoch {epoch + 1}: Loss = {avg_loss:.4f}')
# 开始微调
fine_tune_bert(train_loader)
🎯 实战:新闻分类升级版
class NewsClassifierWithBERT:
def __init__(self):
self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
self.model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=4
)
self.classes = ['World', 'Sports', 'Business', 'Sci/Tech']
def preprocess_data(self, texts, labels, max_length=128):
"""数据预处理"""
encoded = self.tokenizer(
texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors='pt'
)
return {
'input_ids': encoded['input_ids'],
'attention_mask': encoded['attention_mask'],
'labels': torch.tensor(labels)
}
def train(self, train_texts, train_labels, val_texts, val_labels):
"""训练模型"""
# 预处理数据
train_data = self.preprocess_data(train_texts, train_labels)
val_data = self.preprocess_data(val_texts, val_labels)
# 创建数据加载器
train_dataset = torch.utils.data.TensorDataset(
train_data['input_ids'],
train_data['attention_mask'],
train_data['labels']
)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=16, shuffle=True
)
# 优化器设置
optimizer = torch.optim.Adam(self.model.parameters(), lr=2e-5)
# 训练循环
self.model.train()
for epoch in range(3):
total_loss = 0
for batch in train_loader:
input_ids, attention_mask, labels = batch
optimizer.zero_grad()
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f'Epoch {epoch + 1}: Loss = {avg_loss:.4f}')
def predict(self, text):
"""预测单个文本"""
inputs = self.tokenizer(
text,
return_tensors='pt',
padding=True,
truncation=True,
max_length=128
)
self.model.eval()
with torch.no_grad():
outputs = self.model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1)
return {
'class': self.classes[predicted_class.item()],
'confidence': predictions.max().item(),
'probabilities': {
cls: prob.item()
for cls, prob in zip(self.classes, predictions[0])
}
}
# 使用示例
classifier = NewsClassifierWithBERT()
# 训练
train_texts = ["Stock market surges...", "Football championship..."]
train_labels = [2, 1] # Business, Sports
classifier.train(train_texts, train_labels, val_texts, val_labels)
# 预测
result = classifier.predict("Apple Inc. reports quarterly earnings...")
print(f"预测类别: {result['class']}")
print(f"置信度: {result['confidence']:.4f}")
🔧 部署优化技巧
Python环境配置
# 按你的偏好配置
import pymysql
pymysql.install_as_MySQLdb()
# Transformer相关包
"""
torch>=1.9.0
transformers>=4.0.0
tokenizers>=0.10.0
datasets>=1.0.0
"""
阿里云服务器优化部署
# 在你的Linux服务器(宝塔面板+Apache)
pip install torch transformers accelerate
# 模型量化优化
pip install optimum
# 模型优化和部署
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification
# 量化模型以减少内存占用
def optimize_model_for_deployment(model_name):
# 转换为ONNX格式
model = ORTModelForSequenceClassification.from_pretrained(
model_name,
from_transformers=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 创建推理管道
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer
)
return classifier
# 部署API
from flask import Flask, request, jsonify
app = Flask(__name__)
classifier = optimize_model_for_deployment('your-fine-tuned-model')
@app.route('/classify', methods=['POST'])
def classify_text():
data = request.json
text = data.get('text', '')
result = classifier(text)
return jsonify({
'text': text,
'prediction': result,
'model': 'BERT-based classifier'
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Golang批处理脚本
// 大规模数据处理和结果导出
package main
import (
"encoding/json"
"fmt"
"log"
"os"
)
type ClassificationResult struct {
Text string `json:"text"`
Class string `json:"class"`
Confidence float64 `json:"confidence"`
Timestamp string `json:"timestamp"`
}
func batchClassification(texts []string) []ClassificationResult {
// 批量调用分类API
var results []ClassificationResult
for _, text := range texts {
// 调用分类服务
result := callClassificationAPI(text)
results = append(results, result)
}
return results
}
func exportResults(results []ClassificationResult, filename string) error {
file, err := os.Create(filename)
if err != nil {
return err
}
defer file.Close()
encoder := json.NewEncoder(file)
for _, result := range results {
if err := encoder.Encode(result); err != nil {
return err
}
}
return nil
}
🌟 Transformer的应用生态
1. 现代AI明星家族
# Transformer家族谱系
transformer_family = {
"BERT": "理解型选手,擅长分类、问答",
"GPT": "生成型天才,擅长写作、对话",
"T5": "万能型选手,text-to-text",
"RoBERTa": "BERT的加强版",
"DistilBERT": "BERT的轻量版",
"ELECTRA": "高效预训练方法",
"DeBERTa": "注意力机制改进版"
}
for model, description in transformer_family.items():
print(f"{model}: {description}")
2. 多任务应用
from transformers import pipeline
# 文本分类
classifier = pipeline("text-classification")
# 命名实体识别
ner = pipeline("ner")
# 问答系统
qa = pipeline("question-answering")
# 文本摘要
summarizer = pipeline("summarization")
# 情感分析
sentiment = pipeline("sentiment-analysis")
# 机器翻译
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh")
# 一条龙服务示例
def comprehensive_analysis(text):
return {
'classification': classifier(text),
'entities': ner(text),
'sentiment': sentiment(text),
'summary': summarizer(text, max_length=50)
}
🎉 今日收获
- 注意力机制:让AI学会”专注”,解决RNN的记忆问题
- Transformer架构:并行处理、位置编码、多头注意力
- BERT革命:预训练+微调的迁移学习范式
- 实战应用:从新闻分类到多任务NLP
- 部署优化:模型量化、API服务、批处理
🔮 系列总结
从词袋模型到Transformer,我们见证了NLP的完整进化:
- 5-1 基础篇:词袋模型 - AI的”数字化”启蒙
- 5-14 进阶篇:词嵌入 - 给词汇赋予”语义”
- 5-15 高级篇:语言建模 - 让AI理解”上下文”
- 5-16 专业篇:RNN/LSTM - AI的”记忆系统”
- 5-17 创作篇:生成网络 - AI的”文学创作”
- 5-18 巅峰篇:Transformer - AI的”全能大脑”
从简单的词频统计到能够理解、创作、翻译的AI系统,这就是深度学习在NLP领域的华丽蜕变!
觉得有用记得点赞分享!想深入了解ChatGPT、GPT-4背后原理的小伙伴,现在你已经掌握了核心技术栈!
#AI学习 #Transformer #BERT #注意力机制 #深度学习 #自然语言处理 #ChatGPT
本作品采用《CC 协议》,转载必须注明作者和本文链接
关于 LearnKu