请教一个关于批量下载的问题

刚学习python,遇到一个关于批量下载的问题,本地文档a.txt里有如下内容:

http://www.aaa.com/1.txt
http://www.aaa.com/2.txt
http://www.aaa.com/3.txt
http://www.aaa.com/4.txt
http://www.aaa.com/5.txt
...
http://www.aaa.com/999.txt
http://www.aaa.com/1000.txt

现在自己已经用代码能实现批量下载了,但是单线程,单进程的情况下速度很慢(文件很小,io等待时间比较长)如果想加速下载,不想排队一个个的下载,用什么样的方式比较好

Jason990420
最佳答案

Something like this

import time
import threading
import requests

def download(url, index):
    response = requests.get(url)
    # print(f'{index:0>2d}:{url} downlaoded.')

urls = [f'https://learnku.com/' for i in range(100)]

now = time.time()

threads = []
for i, url in enumerate(urls):
    thread = threading.Thread(target=download, args=(url, i), daemon=True)
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

seconds = time.time() - now

print(f'All URLs downloaded in {seconds:.2f} seconds')

Note:

  • The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time ("rate limiting"). A Retry-After header might be included to this response indicating how long to wait before making a new request.
  • Should try to download it again if failed.
1年前 评论
讨论数量: 4
  • 放入队列中,多进程消费,每个进程中还是单线程执行下载;
  • thread 多线程下载;
  • 如果是curl,可以用 libcurl multi 并行请求下载,也相当于多线程;

...应该差不多这个思路

1年前 评论

还是有点看不懂,大佬,能帮我代码化一下吗?万分感谢

1年前 评论
Jason990420

Something like this

import time
import threading
import requests

def download(url, index):
    response = requests.get(url)
    # print(f'{index:0>2d}:{url} downlaoded.')

urls = [f'https://learnku.com/' for i in range(100)]

now = time.time()

threads = []
for i, url in enumerate(urls):
    thread = threading.Thread(target=download, args=(url, i), daemon=True)
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

seconds = time.time() - now

print(f'All URLs downloaded in {seconds:.2f} seconds')

Note:

  • The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time ("rate limiting"). A Retry-After header might be included to this response indicating how long to wait before making a new request.
  • Should try to download it again if failed.
1年前 评论

可以用协程,更轻便一点

使用协程就需要把同步的操作都换成异步处理,引用aiohttp、aiofiles

import aiohttp
import asyncio
import aiofiles


async def download(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            await resp.text()
            # to do something
            async with aiofiles.open('a.txt', mode='w', encoding='utf-8') as f:
                await f.write('write something')


async def main():
    urls = ['url1', 'url2', 'url3']
    task = []
    for url in urls:
        task.append(asyncio.create_task(download(url)))
    await asyncio.wait(task)

if __name__ == '__main__':
    asyncio.run(main())
1年前 评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!