python爬取小说怎么爬取

请问一下怎么爬取一下这个一本小说的全部内容啊,我不是很能理解,for i 循环那边不是很明白怎么搞,小说的网址是www.favzoom.com/wushibuxiu/
import requests
from lxml import etree
import time
url = ‘www.favzoom.com/index/wushibuxiu/'
head = {
‘Referer’: ‘www.favzoom.com/index/wushibuxiu/',
‘users-agent’:’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.39’
}
response = requests.get(url,headers = head,verify = False)

print(response.text)

html = etree.HTML(response.text)

#[0]列表的第0位
novel_name = html.xpath(‘/html/body/div[1]/div/div[2]/div/h1’)[0]

print(novel_name)

novel_directory = html.xpath(‘/html/body/div[2]/div[1]’)

print(novel_directory)

#访问太快易报错,设置休眠时间
time.sleep(5)

for i in novel_directory:
com_url = ‘hwww.favzoom.com/wushibuxiu/143863.html'+i

# print(com_url)

response2 = requests.get(com_url,headers=head)
html2 = etree.HTML(response2.text)
novel_chapter = html2.xpath(‘//*[@id=”ss-reader-main”]/div[2]/h1’)[0]

# print(novel_chapter)

novel_content = ‘\n’.join(html2.xpath(‘//*[@id=”article”]’))

# print(novel_content)

‘w’每次写入文件时会把上一次文件中内容清空,’a’追加内容,不会覆盖前面的内容

with open(r”D:\浏览器下载\小说” + novel_chapter + “.txt”, “w”, encoding=”utf-8”) as file:
file.write(novel_chapter+’\n’+novel_content+’\n’)
file.close()
print(“下载成功”+novel_chapter)

讨论数量: 2

好歹把代码格式化一下吧

1个月前 评论

简单看了下,不用从目录嵌套循环去抓取内容也可以,大概思路就是,直接以第一章小说内容为起点,获取完内容检测下一页的链接,有就获取下一页的内容,没有则代表小说全部抓完了,这样只需要一个while循环即可完成。

file

1个月前 评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!