线程池爬取数据页数不对?

from concurrent.futures import ThreadPoolExecutor
import csv
from bs4 import BeautifulSoup
import requests


f = open('data.csv',mode='w')
csvwriter = csv.writer(f)

def download_one_page(url):
    resp = requests.get(url)
    res = BeautifulSoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    ulss = div.find_all('ul')
    for ull in ulss:
        title = ull.find('p', class_="title").text
        img_url = ull.find('img').get('src')
        data = []
        data.append(title)
        data.append(img_url)
        csvwriter.writerow(data)
    print(url,'提取完成')


if __name__ == '__main__':
    with ThreadPoolExecutor(max_workers=10) as t:
        for i in range(1,16):
            t.submit(download_one_page,f"http://www.xinfadi.com.cn/newsCenter.html?current={i}")

print('ok')

请教下,为什么我只能爬取到4个页的数据?

线程池爬取数据页数不对?

讨论数量: 4

看起来是其他页面爬取失败了,再爬取结果那可以判断下有没有成功

3周前 评论
Tangqy (楼主) 3周前
Jason990420

The submit() method does not block while the task is executing, it returns immediately with a Future object that provides a handle on the task.

IMO, the script end before all your tasks done !

Not sure if it work well for following code

from concurrent.futures import ThreadPoolExecutor
import csv
from bs4 import BeautifulSoup
import requests

def download_one_page(i):
    url = f"http://www.xinfadi.com.cn/newsCenter.html?current={i}"
    resp = requests.get(url)
    res = BeautifulSoup(resp.text, 'html.parser')
    div = res.find('div', class_="conter_con")
    if div:
        ulss = div.find_all('ul')
        for ull in ulss:
            title = ull.find('p', class_="title").text
            img_url = ull.find('img').get('src')
            data = []
            data.append(title)
            data.append(img_url)
            csvwriter.writerow(data)
        print(url, '提取完成')
    else:
        print(url, 'No conter_con class found !')

if __name__ == '__main__':
    f = open('data.csv',mode='w')
    csvwriter = csv.writer(f)
    with ThreadPoolExecutor(max_workers=5) as t:
        all_task = [t.submit(download_one_page, i) for i in range(1, 16)]
        results = [t.result() for t in all_task]    # wait for the result all done
    f.close()
print('ok')
http://www.xinfadi.com.cn/newsCenter.html?current=3 No conter_con class found !
http://www.xinfadi.com.cn/newsCenter.html?current=5 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=1 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=4 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=2 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=7 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=8 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=9 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=10 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=11 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=13 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=14 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=12 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=15 提取完成
http://www.xinfadi.com.cn/newsCenter.html?current=6 提取完成
ok
3周前 评论
Tangqy (楼主) 3周前

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!