MovieLens 电影推荐系统数据

github项目源码:点这里🤺

学习别人的额github项目,偷知识咯

本项目使用文本卷积神经网络,并使用MovieLens数据集完成电影推荐的任务。

下载好tensorflow2.1和python3.5

>>> import tensorflow as tf
2020-02-09 16:26:07.555300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
>>> print(tf.__version__)
2.1.0
>>>

MovieLens数据

本项目使用的是MovieLens 1M 数据集

包含6000个用户在近4000部电影上的1亿条评论。
数据集分为三个文件

  • 用户数据users.dat
  • 电影数据movies.dat
  • 评分数据ratings.dat

用户数据

我们看看部分数据:

import pandas as pd
users_title = ['UserID', 'Gender', 'Age', 'OccupationID', 'Zip-code']
users = pd.read_csv('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
print(users.head())
   UserID Gender  Age  OccupationID Zip-code
0       1      F    1            10    48067
1       2      M   56            16    70072
2       3      M   25            15    55117
3       4      M   45             7    02460
4       5      M   25            20    55455

分别有用户ID、性别、年龄、职业ID和邮编等字段。

数据中的格式:UserID::Gender::Age::Occupation::Zip-code

  • Gender is denoted by a “M” for male and “F” for female

  • Age is chosen from the following ranges:

        *  1:  "Under 18"
        * 18:  "18-24"
        * 25:  "25-34"
        * 35:  "35-44"
        * 45:  "45-49"
        * 50:  "50-55"
        * 56:  "56+"
  • Occupation is chosen from the following choices:

        *  0:  "other" or not specified
        *  1:  "academic/educator"
        *  2:  "artist"
        *  3:  "clerical/admin"
        *  4:  "college/grad student"
        *  5:  "customer service"
        *  6:  "doctor/health care"
        *  7:  "executive/managerial"
        *  8:  "farmer"
        *  9:  "homemaker"
        * 10:  "K-12 student"
        * 11:  "lawyer"
        * 12:  "programmer"
        * 13:  "retired"
        * 14:  "sales/marketing"
        * 15:  "scientist"
        * 16:  "self-employed"
        * 17:  "technician/engineer"
        * 18:  "tradesman/craftsman"
        * 19:  "unemployed"
        * 20:  "writer"

电影数据

movies_title = ['MovieID', 'Title', 'Genres']
movies = pd.read_csv('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
print(movies.head())
   MovieID                               Title                        Genres
0        1                    Toy Story (1995)   Animation|Children's|Comedy
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy
2        3             Grumpier Old Men (1995)                Comedy|Romance
3        4            Waiting to Exhale (1995)                  Comedy|Drama
4        5  Father of the Bride Part II (1995)                        Comedy

电影数据以这种格式:MovieID::Title::Genres

  • Titles are identical to titles provided by the IMDB (including
    year of release)

  • Genres are pipe-separated and are selected from the following genres:

    • Action
    • Adventure
    • Animation
    • Children’s
    • Comedy
    • Crime
    • Documentary
    • Drama
    • Fantasy
    • Film-Noir
    • Horror
    • Musical
    • Mystery
    • Romance
    • Sci-Fi
    • Thriller
    • War
    • Western
  • Some MovieIDs do not correspond to a movie due to accidental duplicate
    entries and/or test entries

  • Movies are mostly entered by hand, so errors and inconsistencies may exist

MovieID是类别字段,Title是文本,Genres也是类别字段

评分数据

ratings_title = ['UserID','MovieID', 'Rating', 'timestamps']
ratings = pd.read_csv('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
print(ratings.head())
   UserID  MovieID  Rating  timestamps
0       1     1193       5   978300760
1       1      661       3   978302109
2       1      914       3   978301968
3       1     3408       4   978300275
4       1     2355       5   978824291

分别有用户ID、电影ID、评分和时间戳等字段。

数据中的格式:UserID::MovieID::Rating::Timestamp

  • UserIDs range between 1 and 6040
  • MovieIDs range between 1 and 3952
  • Ratings are made on a 5-star scale (whole-star ratings only)
  • Timestamp is represented in seconds since the epoch as returned by time(2)
  • Each user has at least 20 ratings

评分字段Rating就是我们要学习的targets,时间戳字段我们不使用。

本作品采用《CC 协议》,转载必须注明作者和本文链接
文章!!首发于我的博客Stray_Camel(^U^)ノ~YO
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!