两层 for 嵌套的循环，算法优化一下？

两层for循环代码，复杂度为n^2，大家又想到什么优化的算法啊吗？
另外还有一个二维list生成式，

UBdata = [[eval(list(user_s.loc[user_s['user'] == u,'history'])[0])['book'].count(int(b)) if b in eval(list(user_s.loc[user_s['user'] == u,'history'])[0])['book']  else 0 for b in books] for u in users]
# 其实我个人不是很喜欢一行式、不是很方便阅读，下面注释代码为嵌套for的详细式子
# for u in users:
#     print('用户',u,'------------------\n')
#     history = eval(list(user_s.loc[user_s['user'] == u,'history'])[0])
#     print('books',history['book'])
#     for b in books:
#         print(b,history['book'].count(int(b)))

以上代码需求，纵轴index为users，横轴cloumns为books，从用户历史记录中找到用户看过的书，并计数。

另外，之前发这个贴子想解决的哪个三层循环，哪个代码是我脑子秀逗了，写错了。这是更改后的代码，正确的

TBdata = np.zeros((len(books),len(tags)))
for b_i,b_v in enumerate(book_tag['book']):
    t_v=book_tag['given_tag'][b_i]
    TBdata[books.index(b_v),tags.index(t_v)] +=1

代码如上，books是一个装有n万个书籍id的list，tags是所有标签（千把个）的id的list。
book_tag = {‘book’:[1,2,3,4,4,3,2,3,4],’given_tag’:[5,6,7,2,3,4,4,3,2]}
这是一个book_tag例子，格式如上，实际数据很大，而且book和given_tag中有很多重复的。
意义是，book_tag['book'][0]和book_tag['given_tag'][0]是相对应的一组（总共有n万组）

代码的目的是生成TBdata# TBdata: index=books,columns=tags，然后每个点TBdata[b_i,t_i]的数据如果出现一次，则增加一次。

单纯想降低复杂度。

如果我还没解释清楚请留个言，靴靴

整体源代码是想尝试矩阵分解
源代码：


#!/usr/bin/python
# -*- coding: utf-8 -*-
#__author__ : stray_camel
#pip_source : https://mirrors.aliyun.com/pypi/simple
import sys,os
import pandas as pd
import numpy as np

# fix error : _csv.Error: field larger than field limit (131072)
import csv
maxInt = sys.maxsize
while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.
    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)


class CF_CB():
    '''
    Collaborative Filtering + Content-based Filtering
    '''
    def __init__(self):
        pass

    def Matrix_factorization(self):
        book_tag = {'book':[],'given_tag':[]}
        books = []
        tags = []
        user_s = pd.read_csv(os.path.dirname(__file__)+'/data/users.csv', nrows=5, sep='\t', engine='python', dtype={'history':dict})
        for _ in user_s['history']:
            try:
                books.extend(set(_['book']))
                tags.extend(set(_['given_tag']))
                book_tag.append(_)
            except TypeError as e:
                books.extend(set(eval(_)['book']))
                tags.extend(set(eval(_)['given_tag']))
                book_tag['book']+=eval(_)['book']
                book_tag['given_tag']+=eval(_)['given_tag']
        books.sort()
        books = list(set(books))
        tags.sort()
        tags = list(set(tags))
        users = list(user_s['user'])

        # UBdata: index=users,columns=books
        UBdata = [[eval(list(user_s.loc[user_s['user'] == u,'history'])[0])['book'].count(int(b)) if b in eval(list(user_s.loc[user_s['user'] == u,'history'])[0])['book']  else 0 for b in books] for u in users]
        user_books = pd.DataFrame(UBdata,index=users,columns=books)

        # UTdata: index=users,columns=tags
        UTdata = [[eval(list(user_s.loc[user_s['user'] == u,'history'])[0])['given_tag'].count(int(t)) if t in eval(list(user_s.loc[user_s['user'] == u,'history'])[0])['given_tag']  else 0 for t in tags] for u in users]
        users_tags = pd.DataFrame(UTdata,index=users,columns=tags)

        # TBdata: index=books,columns=tags
        TBdata = np.zeros((len(books),len(tags)))
        for b_i,b_v in enumerate(book_tag['book']):
            t_v=book_tag['given_tag'][b_i]
            TBdata[books.index(b_v),tags.index(t_v)] +=1
        TBdata = pd.DataFrame(TBdata)
        # print(TBdata)
        print(UBdata.values.dot(TBdata.values))
        print('矩阵分解~~')


if __name__ == "__main__":
    test = CF_CB()
    test.Matrix_factorization()

文章！！首发于我的博客Stray_Camel(＾Ｕ＾)ノ~ＹＯ。

娃哈哈店长

见习助教 494 声望

当学习养成习惯，我的博客https://boywithacoin.cn/

1 人点赞

推荐文章：

更多推荐...

置顶

[进度 100.00%] Python Masonite 4.0 中文翻译召集（Python 中的类 Laravel 框架） 15 / 19 |

公告

Python Masonite 框架中文翻译召集（Python 中的类 Laravel 框架） 24 / 25 |

博客

收集了一些各大网站 python 的登陆方式,希望对学习 python 的小白，和想写爬虫的你们有所帮助,,本项目用于研究和分享各大网站的模拟登陆方式 17 / 5 |

翻译

Python 3.7 的一些新特性 10 / 2 |

链接

快速掌握一个语言最常用的 50% 11 / 1 |

翻译

使用 Python 一步步搭建自己的区块链 22 / 1 |

Jason990420

1.9k 声望 / 個人 @ 個人

最佳答案

很多时候, 资料的表示方式或结构, 会让你的代码写起来很简单, 或很复杂, 甚至很困难. 比如

TBdata[books.index(b_v),tags.index(t_v)] +=1

这代表books或tags中的id是除了没有重复外, 其他没有任何规律.

如果book及tag的id都从0开始编号, 而且连续:

如果books或tags中只有id, 那根本就不需要books及tags这两个列表, 只要记录book总数及tag总数, . 那动作就变简单, 也快多了, 至少不用books.index, tags.index . 如下:
```
value = tuple(zip(*list(book_tag.values())))
unique, counts = np.unique(value, return_counts=True, axis=0)
rows, cols = zip(*unique)
TBdata[rows, cols] = counts
```

如果你有一串book_tags, 直接用list, 而不是dictionary

book_tags = [[[1,2,3,4,4,3,2,3,4],[5,6,7,2,3,4,4,3,2]],
           [[1,2,3,4,4,3,2,3,4],[4,2,5,6,3,2,4,1,5]]]
value = []
for book_tag in book_tags:
  value += list(zip(*book_tag))
unique, counts = np.unique(value, return_counts=True, axis=0)
rows, cols = zip(*unique)
TBdata[rows, cols] = counts

5年前评论

讨论数量: 5

Jason990420

1.9k 声望 / 個人 @ 個人

這段看起來不太对…
For example:

# Code added for testing
books = [1,2,3,4]
tags = [1,2,3,4,5,6,7]
book_tag = {'book':[1,2,3,4,4,3,2,3,4],'given_tag':[5,6,7,2,3,4,4,3,2]}

# Your code
TBdata = np.zeros((len(books),len(tags)))
for b_i,b in enumerate(books):
    for t_i,t in enumerate(tags):
        """
        1. TBdata +1 if 'book' and 'given_tag' are the same in book_tag
        2. There's no any difference for different books and different tags
        3. It means all the content in TBdata will be the same, all 1 here.
        """
        for i,v in enumerate(book_tag['book']):
            if book_tag['given_tag'][i] == v: 
                TBdata[b_i,t_i] +=1
print(TBdata)

result:
[[1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1.]]

5年前评论

娃哈哈店长（楼主）

嗯我发现了有点丢脸，错代码还贴出来，哈哈哈，已经改了，，我发现你是社区最活跃的人哈哈

Jason990420 （作者）

@娃哈哈店长第二段真看不懂, 最好的方式是能有一段能跑的代码, 这样最直接.

娃哈哈店长（楼主）

@Jason990420 第二段代码代码需求，纵轴index为users，横轴cloumns为books，从用户历史记录中找到用户看过的书，并计数。，第二段是正确的，就是多层for看到觉得代码好水

gps99

18 声望

没太看懂楼主的应用需求。。
检索匹配的话，如果一次通吃过余繁琐，试试生成一两次中间文件呢？或分字段压入数据库，再规则搜回来，，
几万条量应该很快

要不吃内存，要不吃时间。总要吃一个

5年前评论

娃哈哈店长（楼主）

要不吃内存，要不吃时间。总要吃一个你理解的不错，就是我感觉我想法太直接了，中间文件常用方法咯，但是我很烦写代码的时候一直嵌套for，所以想和大家套讨论讨论

娃哈哈店长（楼主）

需求改了，你说的方法，一直都在用。但是有时候处理某个数据的时候，写了太多的for，就想问问自己，可不可以少写点if for？换个方式之类的。:)

Jason990420

1.9k 声望 / 個人 @ 個人

很多时候, 资料的表示方式或结构, 会让你的代码写起来很简单, 或很复杂, 甚至很困难. 比如

TBdata[books.index(b_v),tags.index(t_v)] +=1

这代表books或tags中的id是除了没有重复外, 其他没有任何规律.

如果book及tag的id都从0开始编号, 而且连续:

如果books或tags中只有id, 那根本就不需要books及tags这两个列表, 只要记录book总数及tag总数, . 那动作就变简单, 也快多了, 至少不用books.index, tags.index . 如下:
```
value = tuple(zip(*list(book_tag.values())))
unique, counts = np.unique(value, return_counts=True, axis=0)
rows, cols = zip(*unique)
TBdata[rows, cols] = counts
```

如果你有一串book_tags, 直接用list, 而不是dictionary

book_tags = [[[1,2,3,4,4,3,2,3,4],[5,6,7,2,3,4,4,3,2]],
           [[1,2,3,4,4,3,2,3,4],[4,2,5,6,3,2,4,1,5]]]
value = []
for book_tag in book_tags:
  value += list(zip(*book_tag))
unique, counts = np.unique(value, return_counts=True, axis=0)
rows, cols = zip(*unique)
TBdata[rows, cols] = counts

5年前评论

Galois

版主 1.4k 声望 / Coder @ Galois

:+1:

5年前评论

Tang_Yuzhao

0 声望

想减少for循环嵌套楼主可以看看itertools.product(),

3年前评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

两层 for 嵌套的循环，算法优化一下？

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

两层 for 嵌套的循环，算法优化一下？

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录