爬取某小说网站,代码未报错,运行后下载文件为空,请高手指点下

```python
import requests
from lxml import etree
url = "https://www.doupo321.com/yijianduzun/"  # 小说网址 斗破小说网
re = requests.get(url)  # 访问小说网站,发送一个get请求
re.encoding = "utf-8"
html = etree.HTML(re.text)
urs = html.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul//@href")
shu_name = html.xpath(
    "/html/body/div[1]/div[2]/div[1]/div[1]/div[2]/h1/text()")[0]
Y = 0
print(f"{shu_name}开始下载,共{len(urs)}章")
for i in urs:
    urls1 = url + i
    re1 = requests.get(urls1)  # re1章节页面
    re1.encoding = "utf-8"
    html1 = etree.HTML(re1.text)
    内容 = html1.xpath(
          "/html/body/div[1]/div[1]/div[4]//text()")
    neir = ''
    for x in 内容:
        neir = neir + str(x) + "\n"   # str(x) x为待被转换成字符串的参数;\n为分隔每行数据,用于打印多行数据
    with open(shu_name + ".txt", "a", encoding="utf-8") as f:  # 将内容写入[书名]
        f.write(neir)
    Y = Y + 1
    print(f"第{Y}章下载完成")
    if Y == 10:  
        exit()

程序运行后,下载的小说一剑独尊为空,print(urs)、print(shu_name)均有值,print(内容)为空,高度怀疑是
内容 = html1.xpath(
“/html/body/div[1]/div[1]/div[4]//text()”)
绝对路径有问题,结合小说网页html,麻烦懂绝对路径的指点下

Jason990420
最佳答案

Wrong url for each chapter, revised as following.

url_base = "https://www.doupo321.com"
urls1 = url_base + i


Demo Code

import requests
from lxml import etree


url = "https://www.doupo321.com/yijianduzun/"  # 小说网址 斗破小说网
url_base = "https://www.doupo321.com"

re = requests.get(url)  # 访问小说网站,发送一个get请求
re.encoding = "utf-8"
html = etree.HTML(re.text)
urs = html.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul//@href")
shu_name = html.xpath("/html/body/div[1]/div[2]/div[1]/div[1]/div[2]/h1/text()")[0]
Y = 0
print(f"{shu_name}开始下载,共{len(urs)}章")


for i in urs:
    urls1 = url_base + i
    re1 = requests.get(urls1)  # re1章节页面
    re1.encoding = "utf-8"
    html1 = etree.HTML(re1.text)
    内容 = html1.xpath("/html/body/div[1]/div[1]/div[4]//text()")
    neir = ''
    for x in 内容:
        neir = neir + str(x) + "\n"   # str(x) x为待被转换成字符串的参数;\n为分隔每行数据,用于打印多行数据
    """
    with open(shu_name + ".txt", "a", encoding="utf-8") as f:  # 将内容写入[书名]
        f.write(neir)
    """
    print(neir)
    Y = Y + 1
    print(f"第{Y}章下载完成")
    if Y == 1:
        exit()
一剑独尊开始下载,共2842章
上一章
返回目录
下一章

zj_wap2();

笨蛋只需一秒记住斗破小说网,
www.doupo321.com
,如果被/浏览器/转码,阅读体验极差请退出/转码/阅读。

 青城,叶家,祖祠。
...
 大长老冷声道:“这是我们众长老一致的决定。”

第1章下载完成
11个月前 评论
ZHY2023CXZ (楼主) 11个月前
Jason990420 (作者) 11个月前
ZHY2023CXZ (楼主) 11个月前
讨论数量: 10

建议你不要纯对url,小说网站一般都会有反爬虫策略,最好用playwright这种框架去爬,这个框架正好有python版本的

11个月前 评论
ZHY2023CXZ (楼主) 11个月前
  1. url = "https://www.doupo321.com/yijianduzun"
  2. urls1 = url + i.replace("/yijianduzun", "")
  3. 内容 = html1.xpath('/html/body/div[1]/div/div[4]/p/text()')
11个月前 评论
Jason990420

Wrong url for each chapter, revised as following.

url_base = "https://www.doupo321.com"
urls1 = url_base + i


Demo Code

import requests
from lxml import etree


url = "https://www.doupo321.com/yijianduzun/"  # 小说网址 斗破小说网
url_base = "https://www.doupo321.com"

re = requests.get(url)  # 访问小说网站,发送一个get请求
re.encoding = "utf-8"
html = etree.HTML(re.text)
urs = html.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul//@href")
shu_name = html.xpath("/html/body/div[1]/div[2]/div[1]/div[1]/div[2]/h1/text()")[0]
Y = 0
print(f"{shu_name}开始下载,共{len(urs)}章")


for i in urs:
    urls1 = url_base + i
    re1 = requests.get(urls1)  # re1章节页面
    re1.encoding = "utf-8"
    html1 = etree.HTML(re1.text)
    内容 = html1.xpath("/html/body/div[1]/div[1]/div[4]//text()")
    neir = ''
    for x in 内容:
        neir = neir + str(x) + "\n"   # str(x) x为待被转换成字符串的参数;\n为分隔每行数据,用于打印多行数据
    """
    with open(shu_name + ".txt", "a", encoding="utf-8") as f:  # 将内容写入[书名]
        f.write(neir)
    """
    print(neir)
    Y = Y + 1
    print(f"第{Y}章下载完成")
    if Y == 1:
        exit()
一剑独尊开始下载,共2842章
上一章
返回目录
下一章

zj_wap2();

笨蛋只需一秒记住斗破小说网,
www.doupo321.com
,如果被/浏览器/转码,阅读体验极差请退出/转码/阅读。

 青城,叶家,祖祠。
...
 大长老冷声道:“这是我们众长老一致的决定。”

第1章下载完成
11个月前 评论
ZHY2023CXZ (楼主) 11个月前
Jason990420 (作者) 11个月前
ZHY2023CXZ (楼主) 11个月前

刚开始学python爬虫跟着敲了然后优化一下

# 导入需要用到的模块
import os  # 处理文件和目录的模块
import requests  # 发送http请求的模块
from lxml import etree  # 解析html和xml文档的模块
from pypinyin import lazy_pinyin   # 一个将汉字转换为拼音的 Python 库

shu_name = input("请输入小说名:")
# 将中文小说名转换为拼音并直接连接起来
shu_pinyin = ''.join(lazy_pinyin(shu_name))
print(shu_pinyin)
# 设置目标小说的网址和基础网址
url_base = "https://www.doupo321.com/"
url = url_base + shu_pinyin + "/"
# 通过requests和xpath获取小说目录页面的内容
response = requests.get(url)  # 发起GET请求获取页面信息
response.encoding = "utf-8"   # 修改编码方式为utf-8
html = etree.HTML(response.text)  # 将页面内容转换为可解析的lxml.etree._Element对象

# 获取小说目录中所有章节的链接和小说名称
urs = html.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul//@href")
shu_name = html.xpath("/html/body/div[1]/div[2]/div[1]/div[1]/div[2]/h1/text()")[0]

# 初始化计数器Y,提示开始下载并显示总章节数量
Y = 0
print(f"{shu_name}开始下载,共{len(urs)}章")

# 创建保存小说的文件夹
if not os.path.exists(shu_name):
    os.mkdir(shu_name)

# 循环遍历每一章节并下载文字内容到指定的文件中
for i in urs:
    urls1 = url_base + i  # 拼接完整的章节链接
    response1 = requests.get(urls1)  # 通过requests发送GET请求获取每个章节的内容
    response1.encoding = "utf-8"  # 修改编码方式为utf-8
    html1 = etree.HTML(response1.text)  # 将页面内容转化为etree解析对象

    # 通过xpath获取每一章节的文字内容
    content = html1.xpath("/html/body/div[1]/div[1]/div[4]//text()")
    neir = ''
    for x in content:
        neir = neir + str(x) + "\n"

    # 将每一章节的文字内容写入单独的txt文件中
    with open(os.path.join(shu_name, f"{Y + 1}.txt"), "w", encoding="utf-8") as f:
        f.write(neir)

    # 计数器自增并提示下载完成的章节数量
    Y = Y + 1
    print(f"第{Y}章下载完成")

# 提示整本小说下载完成
print(f"{shu_name}下载完成!")
10个月前 评论
Max_SiChuan (作者) 10个月前
Jason990420 10个月前

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!