[AI趣事]从借用到自制：教会电脑"说人话"的语言建模秘籍 5-15 进阶篇

wangchunbo 的个人博客 / 11 / 0 / 创建于 4个月前 / 更新于 4个月前

AI摘要

本文介绍了从零开始训练领域专用语言模型的必要性及方法。重点讲解了CBoW和Skip-Gram两种无监督学习模型的工作原理与代码实现，包括数据准备、模型构建和训练过程。文章还提供了模型选择建议、优化技巧及部署方案，帮助读者掌握定制化词嵌入技术。

嗨，各位AI探索者！

上次我们学会了用别人训练好的Word2Vec，就像用现成的”翻译词典”。但有没有想过，如果能自己训练一个专属的”AI语言老师”会怎样？今天我们就来揭秘：如何从零开始训练语言模型！

🤔 为什么要自己训练？

别人家的Word2Vec不香吗？

想象一下：

你在做医疗AI，但Word2Vec是用新闻训练的
你的数据全是专业术语，预训练模型一脸懵逼
“CT扫描”和”核磁共振”在通用模型里可能毫无关系

这就是为什么我们需要领域专用的语言模型！

无监督学习的魔法

最棒的是，训练语言模型不需要人工标注！

# 不需要这样的标注数据：
labeled_data = [
    ("这是医疗新闻", "医疗"),
    ("股票大涨了", "财经")
]

# 只需要大量文本：
raw_text = """
患者症状包括发热、咳嗽...
手术方案采用微创技术...
CT显示肺部有阴影...
"""

AI自己就能从文本中学会语言规律！

🎯 语言建模的三大门派

1. N-Gram：看前面猜后面

# 例如：3-gram语言模型
# 看到"我喜欢"，猜下一个词
context = ["我", "喜欢"]
candidates = ["吃", "玩", "看", "听"]
# 训练模型预测概率分布

2. CBoW：看周围猜中间

# 连续词袋模型(Continuous Bag of Words)
# 例句："我 喜欢 [?] 苹果 很甜"
# 任务：根据上下文猜中间的"吃"

context_words = ["我", "喜欢", "苹果", "很甜"]
target_word = "吃"

3. Skip-Gram：看中间猜周围

[AI趣事]从借用到自制：教会电脑"说人话"的语言建模秘籍 5-15 进阶篇

# 与CBoW相反
# 给定中心词"吃"
# 预测周围词：["我", "喜欢", "苹果", "很甜"]

center_word = "吃"
context_words = ["我", "喜欢", "苹果", "很甜"]

🛠️ 代码实战：CBoW模型

数据准备：制造训练样本

def to_cbow(sentence, window_size=2):
    """
    将句子转换为CBoW训练样本
    """
    samples = []
    for i, target_word in enumerate(sentence):
        # 获取窗口内的上下文词
        for j in range(max(0, i-window_size), 
                      min(i+window_size+1, len(sentence))):
            if i != j:  # 排除目标词本身
                samples.append([sentence[j], target_word])
    return samples

# 示例
sentence = ['我', '喜欢', '吃', '苹果', '很甜']
cbow_samples = to_cbow(sentence, window_size=1)
print(cbow_samples)
# [['喜欢', '我'], ['我', '喜欢'], ['吃', '喜欢'], ...]

PyTorch实现CBoW模型

import torch
import torch.nn as nn

class CBoWModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # 这就是我们要训练的词嵌入层！
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.linear = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        # x: 上下文词的ID
        embeds = self.embedding(x)  # 转换为向量
        # 预测目标词
        out = self.linear(embeds)
        return out

# 创建模型
vocab_size = 5000
embed_dim = 100
model = CBoWModel(vocab_size, embed_dim)

训练循环

import torch.optim as optim
import torch.nn.functional as F

def train_cbow(model, train_data, epochs=10):
    optimizer = optim.SGD(model.parameters(), lr=0.1)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        total_loss = 0
        for context_word, target_word in train_data:
            # 前向传播
            context_tensor = torch.tensor([context_word])
            target_tensor = torch.tensor([target_word])

            output = model(context_tensor)
            loss = criterion(output, target_tensor)

            # 反向传播
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_data):.4f}')

# 训练模型
train_cbow(model, cbow_samples)

🔄 Skip-Gram：CBoW的反向思维

Skip-Gram是CBoW的镜像版本：

def to_skipgram(sentence, window_size=2):
    """
    将句子转换为Skip-Gram训练样本
    """
    samples = []
    for i, center_word in enumerate(sentence):
        # 预测窗口内的每个上下文词
        for j in range(max(0, i-window_size), 
                      min(i+window_size+1, len(sentence))):
            if i != j:
                samples.append([center_word, sentence[j]])
    return samples

# CBoW vs Skip-Gram
sentence = ['我', '喜欢', '吃', '苹果']
print("CBoW样本:", to_cbow(sentence, 1))
print("Skip-Gram样本:", to_skipgram(sentence, 1))

选择建议

# CBoW: 适合高频词，训练快
# Skip-Gram: 适合低频词，效果好

if dataset_size == "large" and training_time == "limited":
    model_type = "CBoW"
elif rare_words == "important":
    model_type = "Skip-Gram"

🧪 完整实战：新闻数据训练

import torchtext
import collections

def prepare_training_data():
    # 加载AG News数据集
    train_dataset, _ = torchtext.datasets.AG_NEWS(root='./data')

    # 构建词汇表
    tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
    counter = collections.Counter()

    for label, text in train_dataset:
        tokens = tokenizer(text)
        counter.update(tokens)

    # 只保留最常见的5000个词
    vocab = torchtext.vocab.vocab(counter.most_common(5000))

    return train_dataset, vocab, tokenizer

def encode_text(text, vocab, tokenizer):
    """将文本转换为词ID序列"""
    tokens = tokenizer(text)
    return [vocab[token] for token in tokens if token in vocab]

# 准备数据
train_dataset, vocab, tokenizer = prepare_training_data()

# 生成CBoW训练样本
X, Y = [], []
for i, (label, text) in enumerate(train_dataset):
    if i >= 1000:  # 限制数据量以节省时间
        break

    encoded = encode_text(text, vocab, tokenizer)
    cbow_samples = to_cbow(encoded, window_size=2)

    for context, target in cbow_samples:
        X.append(context)
        Y.append(target)

print(f"生成了 {len(X)} 个训练样本")

🔍 验证训练效果

def find_similar_words(word, model, vocab, top_k=5):
    """寻找相似词汇"""
    if word not in vocab:
        return f"词汇 '{word}' 不在词典中"

    # 获取目标词的嵌入向量
    word_id = vocab[word]
    target_vec = model.embedding.weight[word_id].detach()

    # 计算与所有词的相似度
    all_vecs = model.embedding.weight.detach()
    similarities = torch.cosine_similarity(target_vec, all_vecs, dim=0)

    # 找到最相似的词
    _, indices = similarities.topk(top_k + 1)  # +1因为包含自己

    similar_words = []
    for idx in indices[1:]:  # 排除自己
        similar_words.append(vocab.get_itos()[idx])

    return similar_words

# 测试效果
print("与'china'相似的词:", find_similar_words('china', model, vocab))
print("与'sports'相似的词:", find_similar_words('sports', model, vocab))

🚀 部署优化小贴士

Python环境配置

# 按你的偏好配置
import pymysql
pymysql.install_as_MySQLdb()

# 推荐的包版本
"""
torch>=1.9.0
torchtext>=0.10.0
numpy>=1.21.0
"""

阿里云服务器部署

# 在宝塔面板Linux服务器上
pip install torch torchtext gensim

# Apache配置反向代理
# 将模型服务化部署

Golang数据处理脚本

// 用于大规模文本预处理的Golang脚本
package main

import (
    "bufio"
    "os"
    "strings"
)

func preprocessText(filename string) {
    file, _ := os.Open(filename)
    defer file.Close()

    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        text := strings.ToLower(scanner.Text())
        // 清理和标准化文本
        processedText := cleanText(text)
        // 写入处理后的文件
    }
}

🎯 高级技巧

1. 负采样优化

# 传统softmax计算量大，使用负采样
class CBoWWithNegativeSampling(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.output_embedding = nn.Embedding(vocab_size, embed_dim)

    def forward(self, context, target, negative_samples):
        context_vec = self.embedding(context)
        target_vec = self.output_embedding(target)
        neg_vecs = self.output_embedding(negative_samples)

        # 计算正样本和负样本的损失
        pos_score = torch.sum(context_vec * target_vec, dim=1)
        neg_scores = torch.bmm(neg_vecs, context_vec.unsqueeze(2))

        return pos_score, neg_scores

2. 层次化Softmax

# 用于大词汇表的优化技术
# 将词汇组织成树状结构，降低计算复杂度

🎉 今日收获

语言建模：无监督学习语言规律的艺术
CBoW模型：从上下文预测中心词
Skip-Gram：从中心词预测上下文
自训练嵌入：针对特定领域的定制化方案
优化技巧：负采样、层次化Softmax等

🔮 下期预告

下次我们将进入循环神经网络(RNN)的世界！

想知道如何让AI具备”记忆力”，能够理解长文本的上下文关系吗？从简单RNN到LSTM，再到现代的Transformer，敬请期待 5-16 RNN篇！

觉得有用记得点赞分享！有问题欢迎评论区讨论~

#AI学习 #语言建模 #Word2Vec #CBoW #SkipGram #深度学习

本作品采用《CC 协议》，转载必须注明作者和本文链接

• 15年技术深耕：理论扎实 + 实战丰富，教学经验让复杂技术变简单 • 8年企业历练：不仅懂技术，更懂业务落地与项目实操 • 全栈服务力：技术培训 | 软件定制开发 | AI智能化升级关注「上海PHP自学中心」获取实战干货

wangchunbo

版主 2.7k 声望

啥活都干 @ 一人企业

创业给我教学和编程带来了洞见，期待与您共同成长。

0 人点赞

讨论数量: 0

(=￣ω￣=)··· 暂无内容！

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

[AI趣事]从借用到自制：教会电脑"说人话"的语言建模秘籍 5-15 进阶篇

🤔 为什么要自己训练？

别人家的Word2Vec不香吗？

无监督学习的魔法

🎯 语言建模的三大门派

1. N-Gram：看前面猜后面

2. CBoW：看周围猜中间

3. Skip-Gram：看中间猜周围

🛠️ 代码实战：CBoW模型

数据准备：制造训练样本

PyTorch实现CBoW模型

训练循环

🔄 Skip-Gram：CBoW的反向思维

选择建议

🧪 完整实战：新闻数据训练

🔍 验证训练效果

🚀 部署优化小贴士

Python环境配置

阿里云服务器部署

Golang数据处理脚本

🎯 高级技巧

1. 负采样优化

2. 层次化Softmax

🎉 今日收获

🔮 下期预告

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

[AI趣事]从借用到自制：教会电脑"说人话"的语言建模秘籍 5-15 进阶篇

🤔 为什么要自己训练？

别人家的Word2Vec不香吗？

无监督学习的魔法

🎯 语言建模的三大门派

1. N-Gram：看前面猜后面

2. CBoW：看周围猜中间

3. Skip-Gram：看中间猜周围

🛠️ 代码实战：CBoW模型

数据准备：制造训练样本

PyTorch实现CBoW模型

训练循环

🔄 Skip-Gram：CBoW的反向思维

选择建议

🧪 完整实战：新闻数据训练

🔍 验证训练效果

🚀 部署优化小贴士

Python环境配置

阿里云服务器部署

Golang数据处理脚本

🎯 高级技巧

1. 负采样优化

2. 层次化Softmax

🎉 今日收获

🔮 下期预告

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录