Python 第一个爬虫，爬取 147 小说

xingkong12138 的个人博客 / 0 / 1 / 创建于 5年前 / 更新于 5年前

最近刚学习了Python，所以做了一个Python 的爬虫，爬取147的小说。

可以参观下我的博客：我的博客

刚学习Python，有什么不足的地方大佬请指出

分析147网页结构

可以通过谷歌，使用F12打开控制台

发现章节列表是由<dd></dd>包裹

章节标题是由<div class="bookname"></div>下的H1标签包裹

章节内容是由 <div id="content"></div>下的P标签包裹

废话不多说，上代码

#爬取147小说网站的小说
# -*- coding: utf-8 -*-
import requests
import re
import random
import time

#实现抓取章节内容
def GetChapterContent(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
               'Cookie':'Hm_lvt_f9e74ced1e1a12f9e31d3af8376b6d63=1588752082; Hm_lpvt_f9e74ced1e1a12f9e31d3af8376b6d63=1588756920',
               'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'}
    cookies = dict(Hm_lpvt_f9e74ced1e1a12f9e31d3af8376b6d63="1588758919",Hm_lvt_f9e74ced1e1a12f9e31d3af8376b6d63="1588752082")
    res = requests.get(url,headers=headers,cookies=cookies)
    res.encoding = 'utf-8'
    content_html = res.text
    # 获取到标题
    title_div=re.findall(r'<div class="bookname">([\s\S]*?)</div>',content_html)[0]
    title = re.findall(r'<h1>(.*?)</h1>', title_div, re.S)[0]
    #获取内容
    content_div = re.findall(r'<div id="content">([\s\S]*?)</div>', content_html)[0]
    contents = re.findall(r'<p>(.*?)</p>', content_div, re.S)
    # 把标题和内容组合
    content = ''
    content += title + "\n"
    for i in contents:
        content += i + "\n"

    #然后返回内容
    return content





#实现抓取章节内容url
def GetChapterList(url):
    res = requests.get(url)
    res.encoding = 'utf-8'
    chapter_html = res.text

    #获取到章节列表
    chapter_list_div = re.findall(r'<dl>([\s\S]*?)</dl>',chapter_html)[0]

    #获取到章节列表以及链接
    chapter_list_dd = re.findall(r'<dd>(.*?)</dd>',chapter_list_div)
    chapter_url_info = []
    for info in chapter_list_dd:
        chapter_list_info = re.findall(r'href="(.*?)">(.*?)<',info)[0]
        chapter_url = "http://www.147xs.org" + chapter_list_info[0]
        chapter_url_info.append([chapter_url,chapter_list_info[1]])
    return chapter_url_info

url="http://www.147xs.org/book/13794/"

chapter_urls = GetChapterList(url)

for url in chapter_urls:
    content = GetChapterContent(url[0])
    #把内容储存到文件
    try:
        with open("./xiaoshuo.txt","a+",encoding="UTF-8") as f:
            f.write(content)
        print("章节：{} 抓取成功".format(url[1]))
    except Exception:
        print("章节：{} 抓取失败".format(url[1]))
    time.sleep(random.random())  # 暂停0~1秒，时间区间：[0,1]

print("抓取成功")

本作品采用《CC 协议》，转载必须注明作者和本文链接

xingkong12138

见习助教 190 声望

一个完全不懂修电脑的程序员

0 人点赞

讨论数量: 1

fd5556

51 声望

我觉得用Selenium也是很棒的一个套件，比较不会被网站判定为恶意攻击

5年前评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

Python 第一个爬虫，爬取 147 小说

分析147网页结构

废话不多说，上代码

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

Python 第一个爬虫，爬取 147 小说

分析147网页结构

废话不多说，上代码

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录