Python 第一个爬虫，爬取 147 小说

xingkong12138 的个人博客 / 97 / 1 / 创建于 3年前 / 更新于 3年前

最近刚学习了Python，所以做了一个Python 的爬虫，爬取147的小说。

可以参观下我的博客：我的博客

刚学习Python，有什么不足的地方大佬请指出

分析147网页结构

可以通过谷歌，使用F12打开控制台

发现章节列表是由<dd></dd>包裹

章节标题是由<div class="bookname"></div>下的H1标签包裹

章节内容是由 <div id="content"></div>下的P标签包裹

废话不多说，上代码

#爬取147小说网站的小说
# -*- coding: utf-8 -*-
import requests
import re
import random
import time

#实现抓取章节内容
def GetChapterContent(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
               'Cookie':'Hm_lvt_f9e74ced1e1a12f9e31d3af8376b6d63=1588752082; Hm_lpvt_f9e74ced1e1a12f9e31d3af8376b6d63=1588756920',
               'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'}
    cookies = dict(Hm_lpvt_f9e74ced1e1a12f9e31d3af8376b6d63="1588758919",Hm_lvt_f9e74ced1e1a12f9e31d3af8376b6d63="1588752082")
    res = requests.get(url,headers=headers,cookies=cookies)
    res.encoding = 'utf-8'
    content_html = res.text
    # 获取到标题
    title_div=re.findall(r'<div class="bookname">([\s\S]*?)</div>',content_html)[0]
    title = re.findall(r'<h1>(.*?)</h1>', title_div, re.S)[0]
    #获取内容
    content_div = re.findall(r'<div id="content">([\s\S]*?)</div>', content_html)[0]
    contents = re.findall(r'<p>(.*?)</p>', content_div, re.S)
    # 把标题和内容组合
    content = ''
    content += title + "\n"
    for i in contents:
        content += i + "\n"

    #然后返回内容
    return content





#实现抓取章节内容url
def GetChapterList(url):
    res = requests.get(url)
    res.encoding = 'utf-8'
    chapter_html = res.text

    #获取到章节列表
    chapter_list_div = re.findall(r'<dl>([\s\S]*?)</dl>',chapter_html)[0]

    #获取到章节列表以及链接
    chapter_list_dd = re.findall(r'<dd>(.*?)</dd>',chapter_list_div)
    chapter_url_info = []
    for info in chapter_list_dd:
        chapter_list_info = re.findall(r'href="(.*?)">(.*?)<',info)[0]
        chapter_url = "http://www.147xs.org" + chapter_list_info[0]
        chapter_url_info.append([chapter_url,chapter_list_info[1]])
    return chapter_url_info

url="http://www.147xs.org/book/13794/"

chapter_urls = GetChapterList(url)

for url in chapter_urls:
    content = GetChapterContent(url[0])
    #把内容储存到文件
    try:
        with open("./xiaoshuo.txt","a+",encoding="UTF-8") as f:
            f.write(content)
        print("章节：{} 抓取成功".format(url[1]))
    except Exception:
        print("章节：{} 抓取失败".format(url[1]))
    time.sleep(random.random())  # 暂停0~1秒，时间区间：[0,1]

print("抓取成功")

本作品采用《CC 协议》，转载必须注明作者和本文链接

xingkong12138

见习助教 190 声望

一个完全不懂修电脑的程序员

0 人点赞

Python 第一个爬虫，爬取 147 小说

分析147网页结构

废话不多说，上代码

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

Python 第一个爬虫，爬取 147 小说

分析147网页结构

废话不多说，上代码

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录