CSV 读档 + 写档时间慢,求协助寻找问题所在

小弟有一个CSV档案,总共有32000行资料,每一行资料都有8列
我需要从CSV档案读取第6列的每一行的资料并进行处理,且成功运作。

一开始使用1000行测试的时候,执行时间非常快,但是改为执行原档32000行的时候,却变得特别慢,我不是很懂问题所在,希望大家可以给我一些意见。

程式如下:

from datetime import date,datetime
import csv
import codecs
import time
import re
import sys
import os
import jieba
from itertools import repeat
sys.setrecursionlimit(100000000)
input_file = 'Data.csv'
output_file = 'All.csv'
with open(input_file, newline='', encoding='utf-8') as csvfile:
    total_line = len(csvfile.readlines())-1
for_loop = total_line + 1
print(for_loop)
with open(output_file, 'a', newline='', encoding='utf-8') as csvfile:
    csvfile.write('回應作者\n')
for post in range(1,for_loop):
    with open(input_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        column = [row[5] for row in reader]
        for i, rows in enumerate(column):
            if i == post:
                string = rows
    #print(string)
    string = re.sub('@.*?@', ' ', string, 1)
    flag = 1
    print('Post: ',post)
    while(flag):
        if (string.find('@!@') != -1):
            index = string.find('@!@')
            output_string = string[1:index]
            #print(output_string)
            with open(output_file, 'a', newline='', encoding='utf-8') as csvfile:
                csvfile.write(output_string+'\n')
            string = re.sub(' .*?@', '', string, 1)
            string = re.sub('!.*?@', ' ', string, 1)
        else:
            if(string != ''):
                string = re.sub(' ', '', string, 1)
                output_string = string
                #print(output_string)
                with open(output_file, 'a', newline='', encoding='utf-8') as csvfile:
                    csvfile.write(output_string+'\n')
                string = ''
                if(string == ''):
                    flag = 0
            else:
                print("not found")
                flag = 0
csv
Jason990420
最佳答案

这里重复太多了, 32000 行, 就读取整个文件32000次, 还循环了32000次, 不慢才奇怪 !

for post in range(1,for_loop):
    with open(input_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        column = [row[5] for row in reader]
        for i, rows in enumerate(column):
            if i == post:
                string = rows

稍作修改, 没 Data.csv 没测试.

import re
import csv

input_file = 'Data.csv'
output_file = 'All.csv'

with open(input_file) as csvfile:
    lines = csv.reader(csvfile)
    columns = [line[5] for line in lines]

buffer = '回應作者\n'

for string in columns:
    string = re.sub('@.*?@', ' ', string, 1)
    while True:
        if '@!@' in string:
            index = string.find('@!@')
            buffer += string[1:index] + '\n'
            string = re.sub(' .*?@', '', string, 1)
            string = re.sub('!.*?@', ' ', string, 1)
        elif string:
            string = re.sub(' ', '', string, 1)
            buffer += string + '\n'
            break
        else:
            print("not found")
            break

with open(output_file, 'wt') as csvfile:
    csvfile.write(buffer)
3年前 评论
zhang88597667 3年前
讨论数量: 2
Jason990420

这里重复太多了, 32000 行, 就读取整个文件32000次, 还循环了32000次, 不慢才奇怪 !

for post in range(1,for_loop):
    with open(input_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        column = [row[5] for row in reader]
        for i, rows in enumerate(column):
            if i == post:
                string = rows

稍作修改, 没 Data.csv 没测试.

import re
import csv

input_file = 'Data.csv'
output_file = 'All.csv'

with open(input_file) as csvfile:
    lines = csv.reader(csvfile)
    columns = [line[5] for line in lines]

buffer = '回應作者\n'

for string in columns:
    string = re.sub('@.*?@', ' ', string, 1)
    while True:
        if '@!@' in string:
            index = string.find('@!@')
            buffer += string[1:index] + '\n'
            string = re.sub(' .*?@', '', string, 1)
            string = re.sub('!.*?@', ' ', string, 1)
        elif string:
            string = re.sub(' ', '', string, 1)
            buffer += string + '\n'
            break
        else:
            print("not found")
            break

with open(output_file, 'wt') as csvfile:
    csvfile.write(buffer)
3年前 评论
zhang88597667 3年前
pardon110

未提供待匹配的目标数据及想要的结果,但可以肯定正则待改进空间大。提供几点建议如下
1.不要频繁的读写文件,考虑流式处理或一次分段分块读(比如读一定区间的行列数据)
2.正则尽可能的精准,同样不要多次重建正则对象
3.若不关系提取列数据顺序,可用并发任务处理,分段缓冲写。
基本思路:任务派发,分批次读,异步处理,缓冲写

3年前 评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!