MovieLens 电影推荐系统数据
github项目源码:点这里🤺
学习别人的额github项目,偷知识咯
本项目使用文本卷积神经网络,并使用MovieLens
数据集完成电影推荐的任务。
下载好tensorflow2.1和python3.5
>>> import tensorflow as tf
2020-02-09 16:26:07.555300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
>>> print(tf.__version__)
2.1.0
>>>
MovieLens数据
本项目使用的是MovieLens 1M 数据集
包含6000个用户在近4000部电影上的1亿条评论。
数据集分为三个文件
- 用户数据users.dat
- 电影数据movies.dat
- 评分数据ratings.dat
用户数据
我们看看部分数据:
import pandas as pd
users_title = ['UserID', 'Gender', 'Age', 'OccupationID', 'Zip-code']
users = pd.read_csv('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
print(users.head())
UserID Gender Age OccupationID Zip-code
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
分别有用户ID、性别、年龄、职业ID和邮编等字段。
数据中的格式:UserID:
:Zip-code
Gender is denoted by a “M” for male and “F” for female
Age is chosen from the following ranges:
* 1: "Under 18" * 18: "18-24" * 25: "25-34" * 35: "35-44" * 45: "45-49" * 50: "50-55" * 56: "56+"
Occupation is chosen from the following choices:
* 0: "other" or not specified * 1: "academic/educator" * 2: "artist" * 3: "clerical/admin" * 4: "college/grad student" * 5: "customer service" * 6: "doctor/health care" * 7: "executive/managerial" * 8: "farmer" * 9: "homemaker" * 10: "K-12 student" * 11: "lawyer" * 12: "programmer" * 13: "retired" * 14: "sales/marketing" * 15: "scientist" * 16: "self-employed" * 17: "technician/engineer" * 18: "tradesman/craftsman" * 19: "unemployed" * 20: "writer"
电影数据
movies_title = ['MovieID', 'Title', 'Genres']
movies = pd.read_csv('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
print(movies.head())
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
电影数据以这种格式:MovieID:
:Genres
Titles are identical to titles provided by the IMDB (including
year of release)Genres are pipe-separated and are selected from the following genres:
- Action
- Adventure
- Animation
- Children’s
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entriesMovies are mostly entered by hand, so errors and inconsistencies may exist
MovieID是类别字段,Title是文本,Genres也是类别字段
评分数据
ratings_title = ['UserID','MovieID', 'Rating', 'timestamps']
ratings = pd.read_csv('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
print(ratings.head())
UserID MovieID Rating timestamps
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
分别有用户ID、电影ID、评分和时间戳等字段。
数据中的格式:UserID::Timestamp
- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
评分字段Rating就是我们要学习的targets,时间戳字段我们不使用。
本作品采用《CC 协议》,转载必须注明作者和本文链接