python 用whisper做一个 video 处理字幕工具 ffmpeg

wangchunbo 的个人博客 / 32 / 2 / 创建于 2个月前 / 更新于 2个月前

AI摘要

本文分享了一个本地化视频字幕生成与文本提取的技术方案。作者针对学习视频无字幕、在线服务不佳的问题，提出使用开源Whisper模型进行语音识别，生成SRT字幕文件，并通过脚本解析为纯文本TXT。文章详细说明了FFmpeg环境配置、Whisper模型调用、SRT转TXT的代码实现，并提供了支持GPU加速的批量处理脚本，属于技术性的【知识分享】。

仓库地址

gitee.com/bobobobbb/py_whisper_ffm...

背景

嗯，拿到很多video，学习的，但是没字幕啊。直接听，效率没那么高，
转笔记麻烦。

win11 实时字幕还得挂着，线上一些网站提供的服务太垃圾了。

索性自己整一个吧。

整体思路

用 whisper（本地开源模型）对视频做语音识别，生成字幕文件（srt 或 vtt）。
再写一个简单解析脚本，把字幕里的文本部分抽取出来，按顺序写入 txt。

整个流程全本地，不依赖云服务。

配环境

python 做一个 video 处理字幕工具

下在ffmpeg

注意很多博主说去镜像站下https://www.gyan.dev/ffmpeg/builds/，实际上慢的要死。

python 做一个 video 处理字幕工具

这里建议大家直接去github 下，起码自己还能加速一下。

github.com/GyanD/codexffmpeg/relea...
python 做一个 video 处理字幕工具

python 做一个 video 处理字幕工具

速度提上来了。

python 做一个 video 处理字幕工具

下载完成后，把压缩包解压到一个固定目录，比如：

D:\apps\ffmpeg

解压后一般是这种结构：

D:\apps\ffmpeg
    └─ ffmpeg-6.x-full_build
        ├─ bin
        │   ├─ ffmpeg.exe
        │   ├─ ffprobe.exe
        │   └─ ...
        └─ ...

记住 bin 目录路径，例如：

D:\apps\ffmpeg\ffmpeg-6.x-full_build\bin

二、把 ffmpeg 加到系统 PATH

在桌面右键“此电脑” → 属性 → 左边“高级系统设置”
点“环境变量(N)…”
在“系统变量”里找到 Path → 选中 → 点击“编辑”

python 做一个 video 处理字幕工具

点击“新建”，把刚才的 bin 路径粘进去，例如：
```
D:\apps\ffmpeg\ffmpeg-6.x-full_build\bin
```

python 做一个 video 处理字幕工具

一路确定关闭所有窗口。

注意：改完 PATH 之后，要重新打开一个新的 PowerShell / CMD 窗口，旧窗口里看不到更新。

三、验证 ffmpeg 是否可用

新开一个终端（不要用之前那个窗口）：

ffmpeg -version

如果能看到版本信息，就说明 PATH 配好了。

代码

代码：视频生成 srt 字幕

import whisper
import os

def video_to_srt(video_path, model_size="small", language="zh"):
    """
    :param video_path: 视频文件路径
    :param model_size: 模型大小 tiny/base/small/medium/large
    :param language: 语言，中文 zh，英文 en，自动检测用 None
    """
    # 加载模型（首次会自动下载）
    model = whisper.load_model(model_size)

    # 识别
    result = model.transcribe(video_path, language=language)

    # 输出 srt 文件路径
    base, _ = os.path.splitext(video_path)
    srt_path = base + ".srt"

    # 写入 srt
    with open(srt_path, "w", encoding="utf-8") as f:
        counter = 1
        for seg in result["segments"]:
            start = seg["start"]
            end = seg["end"]
            text = seg["text"].strip()

            def format_time(t):
                h = int(t // 3600)
                m = int((t % 3600) // 60)
                s = int(t % 60)
                ms = int((t - int(t)) * 1000)
                return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

            f.write(f"{counter}\n")
            f.write(f"{format_time(start)} --> {format_time(end)}\n")
            f.write(text + "\n\n")
            counter += 1

    print("SRT 生成完成：", srt_path)
    return srt_path

if __name__ == "__main__":
    video_file = r"D:\video\example.mp4"  # 改成你自己的路径
    video_to_srt(video_file, model_size="small", language="zh")

说明：

model_size：
- 显卡好可以用 medium，效果更好。
- 机器一般就用 small/base。
language：
- 中文视频用 "zh"；
- 英文视频用 "en"；
- 不确定就 language=None 让它自动检测。

把 srt 字幕解析为 txt（纯文本）

1
00:00:00,000 --> 00:00:02,000
这是第一句字幕

2
00:00:02,000 --> 00:00:04,000
这是第二句字幕

代码：srt 转 txt

import os
import re

def srt_to_txt(srt_path):
    base, _ = os.path.splitext(srt_path)
    txt_path = base + ".txt"

    lines = []
    with open(srt_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            # 跳过空行、数字行、时间轴行
            if not line:
                continue
            if line.isdigit():
                continue
            if re.match(r"\d{2}:\d{2}:\d{2},\d{3} -->", line):
                continue
            # 其它认为是字幕正文
            lines.append(line)

    # 合并写入 txt，可按你喜好处理：直接一行一条、或合并成一段
    with open(txt_path, "w", encoding="utf-8") as f:
        for l in lines:
            f.write(l + "\n")

    print("TXT 生成完成：", txt_path)
    return txt_path

if __name__ == "__main__":
    srt_file = r"D:\video\example.srt"  # 改成你的 srt 路径
    srt_to_txt(srt_file)

整合为一个脚本

import os
import re
import whisper

def video_to_srt(video_path, model_size="small", language="zh"):
    model = whisper.load_model(model_size)
    result = model.transcribe(video_path, language=language)

    base, _ = os.path.splitext(video_path)
    srt_path = base + ".srt"

    with open(srt_path, "w", encoding="utf-8") as f:
        counter = 1
        for seg in result["segments"]:
            start = seg["start"]
            end = seg["end"]
            text = seg["text"].strip()

            def format_time(t):
                h = int(t // 3600)
                m = int((t % 3600) // 60)
                s = int(t % 60)
                ms = int((t - int(t)) * 1000)
                return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

            f.write(f"{counter}\n")
            f.write(f"{format_time(start)} --> {format_time(end)}\n")
            f.write(text + "\n\n")
            counter += 1

    return srt_path

def srt_to_txt(srt_path):
    base, _ = os.path.splitext(srt_path)
    txt_path = base + ".txt"

    lines = []
    with open(srt_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line.isdigit():
                continue
            if re.match(r"\d{2}:\d{2}:\d{2},\d{3} -->", line):
                continue
            lines.append(line)

    with open(txt_path, "w", encoding="utf-8") as f:
        for l in lines:
            f.write(l + "\n")

    return txt_path

if __name__ == "__main__":
    video_file = r"D:\video\example.mp4"  # 改成你的视频路径
    srt_file = video_to_srt(video_file, model_size="small", language="zh")
    txt_file = srt_to_txt(srt_file)
    print("全部完成：", txt_file)

执行结果。

也就2分钟得视频，怎么这么慢。。

你们可以根据自己电脑配置进行修改代码配置的哈。

python 做一个 video 处理字幕工具

成果

python 做一个 video 处理字幕工具

进阶版本代码

我的测试机笔记本是3080显卡，cpu是11th Gen Intel® Core™ i7-11800H @ 2.30GHz (2.30 GHz)，内存32.0 GB

扫描一个目录下所有视频文件
用 GPU（3080）+ whisper medium 模型识别
生成 .srt 和 .txt 两种文件
预设笔记本性能参数（GPU、batch、fp16 等）

依赖确认

虚拟环境里先装好：

pip install --upgrade pip
pip install openai-whisper torch

如果你本地还没装 GPU 版 torch，可以直接试一下，whisper 会自动尝试用 GPU；
如果报错说没有 CUDA，再按官方指引装对应 CUDA 版本的 torch。

批量处理脚本：`batch_video_to_txt.py`

import os
import re
import torch
import whisper
from typing import List

# -------------------- 配置区域 --------------------

# 要批量处理的视频目录（改成你的目录）
VIDEO_DIR = r"D:\work\videos"

# 支持的视频扩展名
VIDEO_EXTS = {".mp4", ".mkv", ".avi", ".mov", ".flv", ".wmv", ".m4v"}

# whisper 模型大小：tiny / base / small / medium / large
# 你 3080 + 32G，建议直接用 medium，效果和速度比较均衡
WHISPER_MODEL_SIZE = "medium"

# 语言：中文 "zh"，英文 "en"，None 自动检测
LANGUAGE = None  # 如果大部分是中文，也可以写成 "zh"

# 是否强制使用 GPU（如果有）
USE_GPU = True

# -------------------- 工具函数 --------------------

def list_videos(root: str) -> List[str]:
    """递归扫描目录下所有视频文件"""
    files = []
    for dirpath, _, filenames in os.walk(root):
        for name in filenames:
            ext = os.path.splitext(name)[1].lower()
            if ext in VIDEO_EXTS:
                files.append(os.path.join(dirpath, name))
    return files

def format_timestamp(t: float) -> str:
    """秒 -> SRT 时间格式 00:00:00,000"""
    h = int(t // 3600)
    m = int((t % 3600) // 60)
    s = int(t % 60)
    ms = int((t - int(t)) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

def save_srt(result: dict, srt_path: str):
    """whisper 识别结果保存为 .srt"""
    with open(srt_path, "w", encoding="utf-8") as f:
        for i, seg in enumerate(result["segments"], start=1):
            start = seg["start"]
            end = seg["end"]
            text = seg["text"].strip()

            f.write(f"{i}\n")
            f.write(f"{format_timestamp(start)} --> {format_timestamp(end)}\n")
            f.write(text + "\n\n")

def srt_to_txt(srt_path: str) -> str:
    """从 .srt 提取纯文本，保存为 .txt"""
    base, _ = os.path.splitext(srt_path)
    txt_path = base + ".txt"

    lines = []
    with open(srt_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line.isdigit():
                continue
            if re.match(r"\d{2}:\d{2}:\d{2},\d{3} -->", line):
                continue
            lines.append(line)

    with open(txt_path, "w", encoding="utf-8") as f:
        for l in lines:
            f.write(l + "\n")

    return txt_path

# -------------------- 主逻辑 --------------------

def main():
    # 检查 CUDA
    if USE_GPU and torch.cuda.is_available():
        device = "cuda"
    else:
        device = "cpu"

    print(f"使用设备: {device}")
    print(f"加载 whisper 模型: {WHISPER_MODEL_SIZE}")

    # 加载模型（放外面，只加载一次，批量用）
    model = whisper.load_model(WHISPER_MODEL_SIZE, device=device)

    # 性能参数：你 3080，可以开 fp16 + 合理的 beam_size
    # （如果要自定义可以改这里）
    common_kwargs = {
        "fp16": device == "cuda",  # GPU 用 fp16，CPU 用 fp32
        "beam_size": 5,
        "best_of": 5,
    }

    if LANGUAGE:
        common_kwargs["language"] = LANGUAGE

    # 扫描所有视频
    videos = list_videos(VIDEO_DIR)
    if not videos:
        print(f"目录下没有找到视频文件: {VIDEO_DIR}")
        return

    print(f"在 {VIDEO_DIR} 发现 {len(videos)} 个视频文件")

    for idx, video_path in enumerate(videos, start=1):
        base, ext = os.path.splitext(video_path)
        srt_path = base + ".srt"
        txt_path = base + ".txt"

        print(f"\n[{idx}/{len(videos)}] 处理: {video_path}")

        # 如果已经有 txt，默认跳过（避免重复算）
        if os.path.exists(txt_path):
            print(f"已存在 txt，跳过: {txt_path}")
            continue

        # 如果已经有 srt，就直接转 txt
        if os.path.exists(srt_path):
            print(f"已存在 srt，直接生成 txt: {srt_path}")
            srt_to_txt(srt_path)
            print(f"完成 txt: {txt_path}")
            continue

        # 识别
        print("开始识别（whisper.transcribe）...")
        result = model.transcribe(video_path, **common_kwargs)

        # 保存 srt
        save_srt(result, srt_path)
        print(f"完成 srt: {srt_path}")

        # srt -> txt
        txt_path = srt_to_txt(srt_path)
        print(f"完成 txt: {txt_path}")

    print("\n全部视频处理完毕。")

if __name__ == "__main__":
    main()

使用方式

把脚本保存为：batch_video_to_txt.py
把你要处理的视频都放到一个目录，例如：D:\work\videos
根据实际情况修改脚本顶部配置：

VIDEO_DIR = r"D:\work\videos"
WHISPER_MODEL_SIZE = "medium"   # 也可以试试 "large"
LANGUAGE = None                 # 全中文可以写成 "zh"

在虚拟环境中运行：

cd D:\work\py\video_to_txt
.\venv\Scripts\activate

python batch_video_to_txt.py

处理逻辑：

每个视频生成：xxx.srt + xxx.txt
如果已存在 .txt，默认跳过该视频；
如果有 .srt 没 .txt，只做 srt→txt，不重复识别。

本作品采用《CC 协议》，转载必须注明作者和本文链接

• 15年技术深耕：理论扎实 + 实战丰富，教学经验让复杂技术变简单 • 8年企业历练：不仅懂技术，更懂业务落地与项目实操 • 全栈服务力：技术培训 | 软件定制开发 | AI智能化升级关注「上海PHP自学中心」获取实战干货

python 用whisper做一个 video 处理字幕工具 ffmpeg

仓库地址

背景

整体思路

配环境

下在ffmpeg

二、把 ffmpeg 加到系统 PATH

三、验证 ffmpeg 是否可用

代码

代码：视频生成 srt 字幕

把 srt 字幕解析为 txt（纯文本）

代码：srt 转 txt

整合为一个脚本

执行结果。

成果

进阶版本代码

依赖确认

批量处理脚本：`batch_video_to_txt.py`

使用方式

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

python 用whisper做一个 video 处理字幕工具 ffmpeg

仓库地址

背景

整体思路

配环境

下在ffmpeg

二、把 ffmpeg 加到系统 PATH

三、验证 ffmpeg 是否可用

代码

代码：视频生成 srt 字幕

把 srt 字幕解析为 txt（纯文本）

代码：srt 转 txt

整合为一个脚本

执行结果。

成果

进阶版本代码

依赖确认

批量处理脚本：batch_video_to_txt.py

使用方式

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录

批量处理脚本：`batch_video_to_txt.py`