Python 数据挖掘-如何赢得英雄联盟 LOL?游戏获胜与哪些特征更相关

简介数据集获取关注加点赞截图 加群+qq群:606115027(找管理发给你)

英雄联盟(League of Legends, LoL)是Riot Games为微软Windows和macOS开发和发行的一款多人在线战斗竞技场游戏。

在《英雄联盟》中,玩家扮演一个拥有独特能力的“冠军”角色,与一组其他玩家或电脑控制的冠军进行战斗。

游戏的目标通常是摧毁对方的“Nexus”,即一个位于由防御结构保护的基地中心的结构,尽管其他不同的游戏模式也存在着不同的目标、规则和地图。每一场《英雄联盟》的比赛都是离散的,所有的冠军一开始都相对较弱,但在游戏过程中会通过累积道具和经验来增强实力。
此数据集包含前10分钟。大约的数据,10k排名游戏(SOLO QUEUE)从高ELO (DIAMOND I to MASTER)。
玩家的级别大致相同。在游戏10分钟后,每个团队有19个特性(总共38个)。这包括死亡,死亡,黄金,经验,关卡…
这取决于你去做一些特性工程来获得更多的见解。列blueWins是目标值(我们试图预测的值)。1表示蓝队赢了。

本篇的目的是预测哪些特征与LOL获胜更相关。

设置

In [1]:

导入包和数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

%matplotlib inline
sns.set_style(‘darkgrid’)

In [3]:

df = pd.read_csv(‘/home/kesci/input/lol8974/high_diamond_ranked_10min.csv’)
df.head()

Out[3]:

gameId blueWins blueWardsPlaced blueWardsDestroyed blueFirstBlood blueKills blueDeaths blueAssists blueEliteMonsters blueDragons redTowersDestroyed redTotalGold redAvgLevel redTotalExperience redTotalMinionsKilled redTotalJungleMinionsKilled redGoldDiff redExperienceDiff redCSPerMin redGoldPerMin
0 4519157822 0 28 2 1 9 6 11 0 0 0 16567 6.8 17047 197 55 -643 8 19.7 1656.7
1 4523371949 0 12 1 0 5 5 5 0 0 1 17620 6.8 17438 240 52 2908 1173 24.0 1762.0
2 4521474530 0 15 0 0 7 11 4 1 1 0 17285 6.8 17254 203 28 1172 1033 20.3 1728.5
3 4524384067 0 43 1 0 4 5 5 1 0 0 16478 7.0 17961 235 47 1321 7 23.5 1647.8
4 4436033771 0 75 4 0 6 6 6 0 0 0 17404 7.0 18313 225 67 1004 -230 22.5 1740.4

5 rows × 40 columns

EDA

In [4]:

检查缺少的值和数据类型

df.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 9879 entries, 0 to 9878
Data columns (total 40 columns):
gameId 9879 non-null int64
blueWins 9879 non-null int64
blueWardsPlaced 9879 non-null int64
blueWardsDestroyed 9879 non-null int64
blueFirstBlood 9879 non-null int64
blueKills 9879 non-null int64
blueDeaths 9879 non-null int64
blueAssists 9879 non-null int64
blueEliteMonsters 9879 non-null int64
blueDragons 9879 non-null int64
blueHeralds 9879 non-null int64
blueTowersDestroyed 9879 non-null int64
blueTotalGold 9879 non-null int64
blueAvgLevel 9879 non-null float64
blueTotalExperience 9879 non-null int64
blueTotalMinionsKilled 9879 non-null int64
blueTotalJungleMinionsKilled 9879 non-null int64
blueGoldDiff 9879 non-null int64
blueExperienceDiff 9879 non-null int64
blueCSPerMin 9879 non-null float64
blueGoldPerMin 9879 non-null float64
redWardsPlaced 9879 non-null int64
redWardsDestroyed 9879 non-null int64
redFirstBlood 9879 non-null int64
redKills 9879 non-null int64
redDeaths 9879 non-null int64
redAssists 9879 non-null int64
redEliteMonsters 9879 non-null int64
redDragons 9879 non-null int64
redHeralds 9879 non-null int64
redTowersDestroyed 9879 non-null int64
redTotalGold 9879 non-null int64
redAvgLevel 9879 non-null float64
redTotalExperience 9879 non-null int64
redTotalMinionsKilled 9879 non-null int64
redTotalJungleMinionsKilled 9879 non-null int64
redGoldDiff 9879 non-null int64
redExperienceDiff 9879 non-null int64
redCSPerMin 9879 non-null float64
redGoldPerMin 9879 non-null float64
dtypes: float64(6), int64(34)
memory usage: 3.0 MB

In [5]:

df_clean = df.copy()

In [6]:

#删除一些不必要的列。例如,blueFirstblood/redfirst blood blueEliteMonster/redEliteMonster blueDeath/redKills等重复

cols = [‘gameId’, ‘redFirstBlood’, ‘redKills’, ‘redEliteMonsters’, ‘redDragons’,’redTotalMinionsKilled’,
‘redTotalJungleMinionsKilled’, ‘redGoldDiff’, ‘redExperienceDiff’, ‘redCSPerMin’, ‘redGoldPerMin’, ‘redHeralds’,
‘blueGoldDiff’, ‘blueExperienceDiff’, ‘blueCSPerMin’, ‘blueGoldPerMin’, ‘blueTotalMinionsKilled’]
df_clean = df_clean.drop(cols, axis = 1)

In [7]:

df_clean.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 9879 entries, 0 to 9878
Data columns (total 23 columns):
blueWins 9879 non-null int64
blueWardsPlaced 9879 non-null int64
blueWardsDestroyed 9879 non-null int64
blueFirstBlood 9879 non-null int64
blueKills 9879 non-null int64
blueDeaths 9879 non-null int64
blueAssists 9879 non-null int64
blueEliteMonsters 9879 non-null int64
blueDragons 9879 non-null int64
blueHeralds 9879 non-null int64
blueTowersDestroyed 9879 non-null int64
blueTotalGold 9879 non-null int64
blueAvgLevel 9879 non-null float64
blueTotalExperience 9879 non-null int64
blueTotalJungleMinionsKilled 9879 non-null int64
redWardsPlaced 9879 non-null int64
redWardsDestroyed 9879 non-null int64
redDeaths 9879 non-null int64
redAssists 9879 non-null int64
redTowersDestroyed 9879 non-null int64
redTotalGold 9879 non-null int64
redAvgLevel 9879 non-null float64
redTotalExperience 9879 non-null int64
dtypes: float64(2), int64(21)
memory usage: 1.7 MB

In [8]:

接下来让我们检查blue team特征参数之间的关系

g = sns.PairGrid(data=df_clean, vars=[‘blueKills’, ‘blueAssists’, ‘blueWardsPlaced’, ‘blueTotalGold’], hue=’blueWins’, size=3, palette=’Set1’)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.add_legend();

/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:1241: UserWarning: The size paramter has been renamed to height; please update your code.
warnings.warn(UserWarning(msg))

我们可以看到变量之间有很多线性对称

In [9]:

我们可以看到很多特征是高度相关的,让我们得到相关矩阵

plt.figure(figsize=(16, 12))
sns.heatmap(df_clean.drop(‘blueWins’, axis=1).corr(), cmap=’YlGnBu’, annot=True, fmt=’.2f’, vmin=0);

In [10]:

基于相关性矩阵,让我们稍微清理一下数据集,以避免共线性

cols = [‘blueAvgLevel’, ‘redWardsPlaced’, ‘redWardsDestroyed’, ‘redDeaths’, ‘redAssists’, ‘redTowersDestroyed’,
‘redTotalExperience’, ‘redTotalGold’, ‘redAvgLevel’]
df_clean = df_clean.drop(cols, axis=1)

In [11]:

接下来让我们删除与bluewins关系不大的列

corr_list = df_clean[df_clean.columns[1:]].apply(lambda x: x.corr(df_clean[‘blueWins’]))
cols = []
for col in corr_list.index:
if (corr_list[col]>0.2 or corr_list[col]<-0.2):
cols.append(col)
cols

Out[11]:

[‘blueFirstBlood’,
‘blueKills’,
‘blueDeaths’,
‘blueAssists’,
‘blueEliteMonsters’,
‘blueDragons’,
‘blueTotalGold’,
‘blueTotalExperience’]

In [12]:

df_clean = df_clean[cols]
df_clean.head()

Out[12]:

blueFirstBlood blueKills blueDeaths blueAssists blueEliteMonsters blueDragons blueTotalGold blueTotalExperience
0 1 9 6 11 0 0 17210 17039
1 0 5 5 5 0 0 14712 16265
2 0 7 11 4 1 1 16113 16221
3 0 4 5 5 1 0 15157 17954
4 0 6 6 6 0 0 16400 18543

In [13]:

df_clean.hist(alpha = 0.7, figsize=(12,10), bins=5);

模型选择

In [14]:

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
X = df_clean
y = df[‘blueWins’]
scaler = MinMaxScaler()
scaler.fit(X)
X = scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

朴素贝叶斯

In [15]:

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

拟合模型

clf_nb = GaussianNB()
clf_nb.fit(X_train, y_train)

pred_nb = clf_nb.predict(X_test)

得到准确性分数

acc_nb = accuracy_score(pred_nb, y_test)
print(acc_nb)

0.7176113360323887

决策树

In [16]:

拟合决策树模型

from sklearn import tree
from sklearn.model_selection import GridSearchCV

tree = tree.DecisionTreeClassifier()

搜索最好的参数

grid = {‘min_samples_split’: [5, 10, 20, 50, 100]},

clf_tree = GridSearchCV(tree, grid, cv=5)
clf_tree.fit(X_train, y_train)

pred_tree = clf_tree.predict(X_test)

获得准确度分数

acc_tree = accuracy_score(pred_tree, y_test)
print(acc_tree)

0.6928137651821862

随机森林

In [17]:

拟合模型

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

搜索最好的参数

grid = {‘n_estimators’:[100,200,300,400,500], ‘max_depth’: [2, 5, 10]}

clf_rf = GridSearchCV(rf, grid, cv=5)
clf_rf.fit(X_train, y_train)

pred_rf = clf_rf.predict(X_test)

获得准确分数

acc_rf = accuracy_score(pred_rf, y_test)
print(acc_rf)

0.7282388663967612

逻辑回归

In [18]:

拟合逻辑回归模型

from sklearn.linear_model import LogisticRegression
lm = LogisticRegression()
lm.fit(X_train, y_train)

得到准确分数

pred_lm = lm.predict(X_test)
acc_lm = accuracy_score(pred_lm, y_test)
print(acc_lm)

0.7302631578947368

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to ‘lbfgs’ in 0.22. Specify a solver to silence this warning.
FutureWarning)

K-近邻算法(KNN)

In [19]:

拟合模型

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

搜索最好的参数

grid = {“n_neighbors”:np.arange(1,100)}
clf_knn = GridSearchCV(knn, grid, cv=5)
clf_knn.fit(X_train,y_train)

得到准确分数

pred_knn = clf_knn.predict(X_test)
acc_knn = accuracy_score(pred_knn, y_test)
print(acc_knn)

0.7171052631578947

结论

In [20]:

data_dict = {‘Naive Bayes’: [acc_nb], ‘DT’: [acc_tree], ‘Random Forest’: [acc_rf], ‘Logistic Regression’: [acc_lm], ‘K_nearest Neighbors’: [acc_knn]}
df_c = pd.DataFrame.from_dict(data_dict, orient=’index’, columns=[‘Accuracy Score’])
print(df_c)

                 Accuracy Score

Naive Bayes 0.717611
DT 0.692814
Random Forest 0.728239
Logistic Regression 0.730263
K_nearest Neighbors 0.717105

从精度分数可以看出,逻辑回归和随机森林的预测效果最好。接下来让我们仔细看看这两种方法的召回率和精确率

In [21]:

召回率和精确率

from sklearn.metrics import recall_score, precision_score

lm参数

recall_lm = recall_score(pred_lm, y_test, average = None)
precision_lm = precision_score(pred_lm, y_test, average = None)
print(‘precision score for naive bayes: {}\n recall score for naive bayes:{}’.format(precision_lm, recall_lm))

precision score for naive bayes: [0.72736521 0.73313192]
recall score for naive bayes:[0.72959184 0.73092369]

In [22]:

rf参数

recall_rf = recall_score(pred_rf, y_test, average = None)
precision_rf = precision_score(pred_rf, y_test, average = None)
print(‘precision score for naive bayes: {}\n recall score for naive bayes:{}’.format(precision_rf, recall_rf))

precision score for naive bayes: [0.73550356 0.72104733]
recall score for naive bayes:[0.723 0.73360656]

一般情况下,还是会选择逻辑回归

In [23]:

df_clean.columns

Out[23]:

Index([‘blueFirstBlood’, ‘blueKills’, ‘blueDeaths’, ‘blueAssists’,
‘blueEliteMonsters’, ‘blueDragons’, ‘blueTotalGold’,
‘blueTotalExperience’],
dtype=’object’)

In [24]:

lm.coef_

Out[24]:

array([[ 0.09223084, 1.6863957 , -4.9521688 , -0.28960539, 0.30701029,
0.29102466, 5.33422084, 1.55350489]])

In [25]:

np.exp(lm.coef_)

Out[25]:

array([[1.09661794e+00, 5.39998244e+00, 7.06806308e-03, 7.48558900e-01,
1.35935495e+00, 1.33779758e+00, 2.07311157e+02, 4.72801236e+00]])

In [26]:

coef_data = np.concatenate((lm.coef_, np.exp(lm.coef_)),axis=0)
coef_df = pd.DataFrame(data=coef_data, columns=df_clean.columns).T.reset_index().rename(columns={‘index’: ‘Var’, 0: ‘coef’, 1: ‘oddRatio’})
coef_df.sort_values(by=’coef’, ascending=False)

Out[26]:

Var coef oddRatio
6 blueTotalGold 5.334221 207.311157
1 blueKills 1.686396 5.399982
7 blueTotalExperience 1.553505 4.728012
4 blueEliteMonsters 0.307010 1.359355
5 blueDragons 0.291025 1.337798
0 blueFirstBlood 0.092231 1.096618
3 blueAssists -0.289605 0.748559
2 blueDeaths -4.952169 0.007068

PCA

In [27]:

使用PCA让结果可视化

X = df_clean
y = df[‘blueWins’]

PCA受scale的影响,首先要对数据集进行scale

from sklearn import preprocessing

S标准化特征

X = preprocessing.StandardScaler().fit_transform(X)

In [28]:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
components = pca.fit_transform(X)
print(pca.explained_variance_ratio_)

[0.42435947 0.20805752]

In [29]:

创造可视化df

df_vis = pd.DataFrame(data = components, columns = [‘pc1’, ‘pc2’])
df_vis = pd.concat([df_vis, df[‘blueWins’]], axis = 1)
X = df_vis[[‘pc1’, ‘pc2’]]
y = df_vis[‘blueWins’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:

重新定义pca数据

lm.fit(X_train, y_train)

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to ‘lbfgs’ in 0.22. Specify a solver to silence this warning.
FutureWarning)

Out[30]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class=’warn’, n_jobs=None, penalty=’l2’,
random_state=None, solver=’warn’, tol=0.0001, verbose=0,
warm_start=False)

In [31]:

可视化函数

from matplotlib.colors import ListedColormap
def DecisionBoundary(clf):
X = df_vis[[‘pc1’, ‘pc2’]]
y = df_vis[‘blueWins’]

h = .02 

# 创建颜色映射
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#0000FF'])

# 绘制决策边界。为此,我们将为每一个分配一个颜色
# 网格 [x_min, x_max]x[y_min, y_max].
x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 8))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()

In [32]:

DecisionBoundary(lm)

本作品采用《CC 协议》,转载必须注明作者和本文链接
讨论数量: 1

请问下做分类统计检验时,Logistic Regression,怎么得到像SPSS、SAS等软件中的P值和置信区间?

1年前 评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!