CSV 读档 + 写档时间慢，求协助寻找问题所在

小弟有一个CSV档案，总共有32000行资料，每一行资料都有8列
我需要从CSV档案读取第6列的每一行的资料并进行处理，且成功运作。

一开始使用1000行测试的时候，执行时间非常快，但是改为执行原档32000行的时候，却变得特别慢，我不是很懂问题所在，希望大家可以给我一些意见。

程式如下：

from datetime import date,datetime
import csv
import codecs
import time
import re
import sys
import os
import jieba
from itertools import repeat
sys.setrecursionlimit(100000000)
input_file = 'Data.csv'
output_file = 'All.csv'
with open(input_file, newline='', encoding='utf-8') as csvfile:
    total_line = len(csvfile.readlines())-1
for_loop = total_line + 1
print(for_loop)
with open(output_file, 'a', newline='', encoding='utf-8') as csvfile:
    csvfile.write('回應作者\n')
for post in range(1,for_loop):
    with open(input_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        column = [row[5] for row in reader]
        for i, rows in enumerate(column):
            if i == post:
                string = rows
    #print(string)
    string = re.sub('@.*?@', ' ', string, 1)
    flag = 1
    print('Post: ',post)
    while(flag):
        if (string.find('@!@') != -1):
            index = string.find('@!@')
            output_string = string[1:index]
            #print(output_string)
            with open(output_file, 'a', newline='', encoding='utf-8') as csvfile:
                csvfile.write(output_string+'\n')
            string = re.sub(' .*?@', '', string, 1)
            string = re.sub('!.*?@', ' ', string, 1)
        else:
            if(string != ''):
                string = re.sub(' ', '', string, 1)
                output_string = string
                #print(output_string)
                with open(output_file, 'a', newline='', encoding='utf-8') as csvfile:
                    csvfile.write(output_string+'\n')
                string = ''
                if(string == ''):
                    flag = 0
            else:
                print("not found")
                flag = 0

csv

fd5556

51 声望

暂无个人描述~

0 人点赞

Jason990420

1.9k 声望 / 個人 @ 個人

最佳答案

这里重复太多了, 32000 行, 就读取整个文件32000次, 还循环了32000次, 不慢才奇怪 !

for post in range(1,for_loop):
    with open(input_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        column = [row[5] for row in reader]
        for i, rows in enumerate(column):
            if i == post:
                string = rows

稍作修改, 没 Data.csv 没测试.

import re
import csv

input_file = 'Data.csv'
output_file = 'All.csv'

with open(input_file) as csvfile:
    lines = csv.reader(csvfile)
    columns = [line[5] for line in lines]

buffer = '回應作者\n'

for string in columns:
    string = re.sub('@.*?@', ' ', string, 1)
    while True:
        if '@!@' in string:
            index = string.find('@!@')
            buffer += string[1:index] + '\n'
            string = re.sub(' .*?@', '', string, 1)
            string = re.sub('!.*?@', ' ', string, 1)
        elif string:
            string = re.sub(' ', '', string, 1)
            buffer += string + '\n'
            break
        else:
            print("not found")
            break

with open(output_file, 'wt') as csvfile:
    csvfile.write(buffer)

5年前评论

zhang88597667

学到了

讨论数量: 2

Jason990420

1.9k 声望 / 個人 @ 個人

这里重复太多了, 32000 行, 就读取整个文件32000次, 还循环了32000次, 不慢才奇怪 !

for post in range(1,for_loop):
    with open(input_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        column = [row[5] for row in reader]
        for i, rows in enumerate(column):
            if i == post:
                string = rows

稍作修改, 没 Data.csv 没测试.

import re
import csv

input_file = 'Data.csv'
output_file = 'All.csv'

with open(input_file) as csvfile:
    lines = csv.reader(csvfile)
    columns = [line[5] for line in lines]

buffer = '回應作者\n'

for string in columns:
    string = re.sub('@.*?@', ' ', string, 1)
    while True:
        if '@!@' in string:
            index = string.find('@!@')
            buffer += string[1:index] + '\n'
            string = re.sub(' .*?@', '', string, 1)
            string = re.sub('!.*?@', ' ', string, 1)
        elif string:
            string = re.sub(' ', '', string, 1)
            buffer += string + '\n'
            break
        else:
            print("not found")
            break

with open(output_file, 'wt') as csvfile:
    csvfile.write(buffer)

5年前评论

zhang88597667

学到了

pardon110

862 声望 / 开发者 @ 社科大

未提供待匹配的目标数据及想要的结果，但可以肯定正则待改进空间大。提供几点建议如下
1.不要频繁的读写文件，考虑流式处理或一次分段分块读（比如读一定区间的行列数据)
2.正则尽可能的精准，同样不要多次重建正则对象
3.若不关系提取列数据顺序，可用并发任务处理，分段缓冲写。
基本思路：任务派发，分批次读，异步处理，缓冲写

5年前评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

CSV 读档 + 写档时间慢，求协助寻找问题所在

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

CSV 读档 + 写档时间慢，求协助寻找问题所在

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录