线程池爬取数据页数不对?

问答 / 29 / 4 / 创建于 1年前 / 更新于 1年前

from concurrent.futures import ThreadPoolExecutor
import csv
from bs4 import BeautifulSoup
import requests


f = open('data.csv',mode='w')
csvwriter = csv.writer(f)

def download_one_page(url):
    resp = requests.get(url)
    res = BeautifulSoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    ulss = div.find_all('ul')
    for ull in ulss:
        title = ull.find('p', class_="title").text
        img_url = ull.find('img').get('src')
        data = []
        data.append(title)
        data.append(img_url)
        csvwriter.writerow(data)
    print(url,'提取完成')


if __name__ == '__main__':
    with ThreadPoolExecutor(max_workers=10) as t:
        for i in range(1,16):
            t.submit(download_one_page,f"http://www.xinfadi.com.cn/newsCenter.html?current={i}")

print('ok')

请教下，为什么我只能爬取到4个页的数据？

线程池爬取数据页数不对?

Tangqy

课程读者 35 声望

暂无个人描述~

0 人点赞

推荐文章：

更多推荐...

博客

收集了一些各大网站 python 的登陆方式,希望对学习 python 的小白，和想写爬虫的你们有所帮助,,本项目用于研究和分享各大网站的模拟登陆方式 17 / 5 |

讨论数量: 4

deatil

见习助教 779 声望

看起来是其他页面爬取失败了，再爬取结果那可以判断下有没有成功

1年前评论

Tangqy （楼主）

谢谢 503

Jason990420

1.9k 声望 / 個人 @ 個人

The submit() method does not block while the task is executing, it returns immediately with a Future object that provides a handle on the task.

IMO, the script end before all your tasks done !

Not sure if it work well for following code

from concurrent.futures import ThreadPoolExecutor
import csv
from bs4 import BeautifulSoup
import requests

def download_one_page(i):
    url = f"http://www.xinfadi.com.cn/newsCenter.html?current={i}"
    resp = requests.get(url)
    res = BeautifulSoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    if div:
        ulss = div.find_all('ul')
        for ull in ulss:
            title = ull.find('p', class_="title").text
            img_url = ull.find('img').get('src')
            data = []
            data.append(title)
            data.append(img_url)
            csvwriter.writerow(data)
        print(url, '提取完成')
    else:
        print(url, 'No conter_con class found !')

if __name__ == '__main__':
    f = open('data.csv',mode='w')
    csvwriter = csv.writer(f)
    with ThreadPoolExecutor(max_workers=5) as t:
        all_task = [t.submit(download_one_page, i) for i in range(1, 16)]
        results = [t.result() for t in all_task]    # wait for the result all done
    f.close()
print('ok')

http://www.xinfadi.com.cn/newsCenter.html?current=3 No conter_con class found !
http://www.xinfadi.com.cn/newsCenter.html?current=5 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=1 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=4 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=2 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=7 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=8 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=9 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=10 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=11 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=13 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=14 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=12 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=15 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=6 提取完成
ok

1年前评论

Tangqy （楼主）

No conter_con class found ! 是因为请求返回503

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

线程池爬取数据页数不对?

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

线程池爬取数据页数不对?

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录