使用 cURL 函数下载B站字幕
前言
B站测试视频地址:
https://m.bilibili.com/video/BV1Dk4y1q781
代码我已经放上 Github,使用的是tp6框架,
README.md
中也有详细的流程说明。下载代码后可运行http://服务器/index/index?bvid=BV1Dk4y1q781&start=0&end=29
即可查看效果。代码中 cURL 用到的
$header 请求头
来自 Chrome,如图:
第一步、获取首页的 html
/**
* 获取视频数据 cid、page、part、aid
*
* @param string $bvid 视频的BV号
* @throws \Exception
*/
public function getPageData($bvid)
{
$url = "https://www.bilibili.com/video/" . $bvid;
// 设置 http 的请求头
$header = [
"authority: www.bilibili.com",
"cache-control: max-age=0",
'sec-ch-ua: "Chromium";v="86", "\"Not\\A;Brand";v="99", "Google Chrome";v="86"',
'sec-ch-ua-mobile: ?0',
'upgrade-insecure-requests: 1',
'user-agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site: same-origin',
'sec-fetch-mode: navigate',
'sec-fetch-user: ?1',
'sec-fetch-dest: document',
'accept-language: zh-CN,zh;q=0.9,en;q=0.8',
'cookie: finger=1295565314; bsource=search_google; _uuid=B789D864-3818-3B60-C60C-40572299324578222infoc; buvid3=43FE4003-58BE-4DFF-9EC9-D2673FBE9672138377infoc; CURRENT_FNVAL=80; blackside_state=1; sid=j0093qgz; finger=1295565314; PVID=3; rpdid=|(umR~Yuk~)k0J\'uY|RRYu)~Y'
];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // 返回 html
curl_setopt($ch, CURLOPT_HTTPHEADER, $header); // 设置 http 请求头
curl_setopt($ch, CURLOPT_ENCODING, ''); // 解决乱码
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); // https 需要加这句
curl_setopt($ch, CURLOPT_SSL_VERIFYSTATUS, 0); // https 需要加这句
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // 允许重定向
$content = curl_exec($ch);
curl_close($ch);
// //将首页html保存成文件
// $fp = fopen(app()->getRootPath().'view/pages.html','w');
// fwrite($fp,$content);
// fclose($fp);
// 获取视频数据
$pagesData = $this->logic->getPagesJsonData($content);
if ($pagesData['code'] == 500) throw new \Exception($pagesData['msg']);
return $pagesData['data'];
}
把“首页html保存成文件”这段代码的注释删掉,运行程序后就可以在
根目录/view/pages.html
查看 cURL 返回的内容。在 pages.html 搜索
<script>window.__INITIAL_STATE__
可以看到我们要用正则表达式匹配的数据在这个<script>标签内。调用
getPagesJsonData()
用正则表达式获取视频的aid、pages(含有part、cid、page)
,代码如下所示:/** * 根据首页 html 匹配获取到所有子视频的数据 * * @param $content string 根目录/view/index.html * @return array|\think\response\Json */ public function getPagesJsonData(string $content) { if (empty($content)) return ['code' => 500, 'msg' => '没有需要解析的内容']; $pagesData = []; // 存储视频数据 // 匹配获取 aid preg_match('/={"aid":(\d+).*/', $content, $matchAid); if (empty($matchAid)) return ['code' => 500, 'msg' => '没有匹配到 aid']; $pagesData['aid'] = $matchAid[1]; // 匹配 pages json 数组的数据 preg_match('/videoData.*pages\":(.*),\"subtitle\":/', $content, $matchPages); if (empty($matchPages)) return ['code' => 500, 'msg' => '没有匹配到 pages']; $jsonToArray = json_decode($matchPages[1], true); // 循环处理数据 foreach ($jsonToArray as $k => $v) { $pagesData[$k]['cid'] = $v['cid']; $pagesData[$k]['page'] = $v['page']; $pagesData[$k]['part'] = $v['part']; } return ['code' => 200, 'data' => $pagesData]; }
打印 $pagesData 数组。其结果如下:
第二步、获取 subtitle_url
利用上一步获取到的 cid、aid、bvid
参数值去请求接口,利用正则匹配到 subtitle_url 的值,这个值是字幕的 json 文件链接。
/**
* 获取视频的cc字幕json文件链接
*
* @param $cid int 不清楚
* @param $aid int AV号
* @param $bvid string BV号
*/
public function getSubtitleUrl(int $cid, int $aid, string $bvid)
{
$url = 'https://api.bilibili.com/x/player.so?id=' . urlencode('cid:') . $cid . '&aid=' . $aid . "&bvid=" . $bvid;
$header = [
"authority: www.bilibili.com",
'sec-ch-ua: "Chromium";v="86", "\"Not\\A;Brand";v="99", "Google Chrome";v="86"',
'user-agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'origin: https://www.bilibili.com',
'sec-fetch-site: same-origin',
'sec-fetch-mode: cors',
'sec-fetch-dest: empty',
'referer: https://www.bilibili.com/',
'accept-language: zh-CN,zh;q=0.9,en;q=0.8',
'cookie: finger=1295565314; bsource=search_google; _uuid=B789D864-3818-3B60-C60C-40572299324578222infoc; buvid3=43FE4003-58BE-4DFF-9EC9-D2673FBE9672138377infoc; CURRENT_FNVAL=80; blackside_state=1; sid=j0093qgz; finger=1295565314; PVID=3; rpdid=|(umR~Yuk~)k0J\'uY|RRYu)~Y'
];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYSTATUS, 0);
$content = curl_exec($ch);
curl_close($ch);
// // 将 subtitle html保存成文件
// $fp = fopen(app()->getRootPath().'view/subtitle.html','w');
// fwrite($fp,$content);
// fclose($fp);
$subtitleData = $this->logic->getSubtitleData($content);
if ($subtitleData['code'] == 500) throw new \Exception($subtitleData['msg']);
return $subtitleData['data'];
}
把“将 subtitle html保存成文件”这段代码的注释删掉,运行程序后就可以在
根目录/view/subtitle.html
查看 cURL 返回的内容。在 subtitle.html 中可以看到我们要用正则表达式匹配的数据在这个<subtitle></subtitle>标签内,
subtitle_url
的值即为字幕 json 文件的地址。{ "allow_submit": false, "lan": "", "lan_doc": "", "subtitles": [ { "id": 31982954292445190, "lan": "en-US", "lan_doc": "英语(美国)", "is_lock": false, "author_mid": 483301783, "subtitle_url": "//i0.hdslb.com/bfs/subtitle/1cc78982172c6892257eb955d3feef80f2d1560c.json" }, { "id": 31982964274364420, "lan": "zh-CN", "lan_doc": "中文(中国)", "is_lock": false, "author_mid": 483301783, "subtitle_url": "//i0.hdslb.com/bfs/subtitle/fc03562711af08687398775a4423dc71028e1203.json" } ] }
调用
getSubtitleData()
用正则表达式获取subtitle_url、lan
,代码如下所示:/** * 获取到中文和英文cc字幕的json链接 * * @param $content string 是根目录/view/subtitle.html * @return array */ public function getSubtitleData(string $content) { if (empty($content)) return ['code' => 500, 'data' => '没有需要解析的内容']; $subtitleData = []; // 存储cc字幕的json链接 // 匹配获取 subtitle preg_match('/subtitles":(.*?)}<\/subtitle>/', $content, $matchSubtitle); if (empty($matchSubtitle)) return ['code' => 500, 'msg' => '没有匹配到 subtitle']; $jsonToArray = json_decode($matchSubtitle[1], true); // 循环处理数据 foreach ($jsonToArray as $k => $v) { $subtitleData[$v['lan']] = $v['subtitle_url']; } return ['code' => 200, 'data' => $subtitleData]; }
打印 $subtitleData 数组。其结果如下:
第三步、获取字幕 json 文件
/**
* 获取 json 文件中的字幕字符串,并写入文件
*
* @param $jsonUrl string 字幕的json文件链
* @param $part string 字幕标题
* @param $preTitle int 字幕标题的前缀
*/
public function getSubtitleData($jsonUrl, $part, $preTitle)
{
$url = 'https:' . $jsonUrl;
$header = [
'sec-ch-ua: "Chromium";v="86", "\"Not\\A;Brand";v="99", "Google Chrome";v="86"',
'Accept: application/json, text/javascript, */*; q=0.01',
'Referer: https://www.bilibili.com/',
'sec-ch-ua-mobile: ?0',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
];
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYSTATUS, 0);
$content = curl_exec($ch);
curl_close($ch);
// 字幕写入文件
$fileData = $this->logic->writeSubtitleToFile($content, $part, $this->start, $this->end, $preTitle);
if ($fileData['code'] == 500) throw new \Exception($fileData['msg']);
return $fileData['data'];
}
请求字幕json文件地址后,返回如下内容:
{ "font_size": 0.4, "font_color": "#FFFFFF", "background_alpha": 0.5, "background_color": "#9C27B0", "Stroke": "none", "body": [ { "from": 0, "to": 3.41, "location": 2, "content": "Today we have something a little different." // 这是字幕 }, { "from": 3.41, "to": 7.41, "location": 2, "content": "Dr. Jane Goodall is going to tell you a story." // 这是字幕 }, // 省略代码 ] }
从上面返回的 json 数组中,拿到字幕
content
,然后拼接起来,写入文件。代码如下:/** * 将字幕写入文件 * * @param $content string 字幕的json数据 * @param $part string 字幕的标题 * @param $start int 开始爬取视频的 p 值 * @param $end int 结束爬取视频的 p 值 * @param $preTitle int 字幕标题的前缀 * @return array */ public function writeSubtitleToFile(string $content, string $part, int $start, int $end, int $preTitle) { $start += 1; $end += 1; $jsonToArray = json_decode($content, true); // 字幕在 body 中 $bodyData = $jsonToArray['body']; if (empty($bodyData)) return ['code' => 500, 'msg' => '没有获取到字幕']; // 拼接字幕 $ccString = "\n\n\n\nP$preTitle. " . $part . "\n"; foreach ($bodyData as $k => $v) { $ccString .= $v['content']; } // 写入文件 $ccString = str_replace(' ', ' ', $ccString); $filePath = app()->getRootPath() . "view/files/ClosedCaption_P$start-P$end.txt"; $fp = fopen($filePath, 'a'); $fileData = fwrite($fp, $ccString); fclose($fp); if ($fileData === false) return ['code' => 500, 'msg' => '将字幕写入文件失败']; return ['code' => 200, 'data' => $fileData]; }
生成的文件在
根目录\view\files\
下,如图所示:
结束
由于时间问题,代码并没有做过多的测试,如发现问题可在评论区提出
本作品采用《CC 协议》,转载必须注明作者和本文链接
推荐文章: