MovieLens 电影推荐系统数据

github项目源码：点这里🤺

学习别人的额github项目，偷知识咯

本项目使用文本卷积神经网络，并使用MovieLens数据集完成电影推荐的任务。

下载好tensorflow2.1和python3.5

>>> import tensorflow as tf
2020-02-09 16:26:07.555300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
>>> print(tf.__version__)
2.1.0
>>>

MovieLens数据

本项目使用的是MovieLens 1M 数据集

包含6000个用户在近4000部电影上的1亿条评论。
数据集分为三个文件

用户数据users.dat
电影数据movies.dat
评分数据ratings.dat

用户数据

我们看看部分数据：

import pandas as pd
users_title = ['UserID', 'Gender', 'Age', 'OccupationID', 'Zip-code']
users = pd.read_csv('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
print(users.head())

   UserID Gender  Age  OccupationID Zip-code
0       1      F    1            10    48067
1       2      M   56            16    70072
2       3      M   25            15    55117
3       4      M   45             7    02460
4       5      M   25            20    55455

分别有用户ID、性别、年龄、职业ID和邮编等字段。

数据中的格式：UserID::Zip-code

Gender is denoted by a “M” for male and “F” for female

Age is chosen from the following ranges:

    *  1:  "Under 18"
    * 18:  "18-24"
    * 25:  "25-34"
    * 35:  "35-44"
    * 45:  "45-49"
    * 50:  "50-55"
    * 56:  "56+"

Occupation is chosen from the following choices:

    *  0:  "other" or not specified
    *  1:  "academic/educator"
    *  2:  "artist"
    *  3:  "clerical/admin"
    *  4:  "college/grad student"
    *  5:  "customer service"
    *  6:  "doctor/health care"
    *  7:  "executive/managerial"
    *  8:  "farmer"
    *  9:  "homemaker"
    * 10:  "K-12 student"
    * 11:  "lawyer"
    * 12:  "programmer"
    * 13:  "retired"
    * 14:  "sales/marketing"
    * 15:  "scientist"
    * 16:  "self-employed"
    * 17:  "technician/engineer"
    * 18:  "tradesman/craftsman"
    * 19:  "unemployed"
    * 20:  "writer"

电影数据

movies_title = ['MovieID', 'Title', 'Genres']
movies = pd.read_csv('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
print(movies.head())

   MovieID                               Title                        Genres
0        1                    Toy Story (1995)   Animation|Children's|Comedy
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy
2        3             Grumpier Old Men (1995)                Comedy|Romance
3        4            Waiting to Exhale (1995)                  Comedy|Drama
4        5  Father of the Bride Part II (1995)                        Comedy

电影数据以这种格式：MovieID::Genres

Titles are identical to titles provided by the IMDB (including
year of release)
Genres are pipe-separated and are selected from the following genres:
- Action
- Adventure
- Animation
- Children’s
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
Movies are mostly entered by hand, so errors and inconsistencies may exist

MovieID是类别字段，Title是文本，Genres也是类别字段

评分数据

ratings_title = ['UserID','MovieID', 'Rating', 'timestamps']
ratings = pd.read_csv('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
print(ratings.head())

   UserID  MovieID  Rating  timestamps
0       1     1193       5   978300760
1       1      661       3   978302109
2       1      914       3   978301968
3       1     3408       4   978300275
4       1     2355       5   978824291