大数据第二次作业

Assignment 2: Pandas and matplotlib

题目描述：

介绍数据集的背景

League of Legends (LoL) is one of the most popular games in the world today. There are some datasets that aim to provide information for data scientists interested in analyzing the game and, hopefully, playing a bit better. smile This assignment will focus on the (LoL) League of Legends Ranked Games dataset, which gathers details from more than 50,000 ranked LoL games. This dataset is available at https://www.kaggle.com/datasnaek/league-of-legends.

此次数据集是针对S9的LOL比赛，确实令我这个LOL狂热玩家激动。

数据集内容

According to the website, This is a collection of over 50,000 ranked EUW games from the game League of Legends, as well as json files containing a way to convert between champion and summoner spell IDs and their names. For each game, there are fields for:

Game ID

Creation Time (in Epoch format)

Game Duration (in seconds)

Season ID

Winner (1 = team1, 2 = team2)

First Baron, dragon, tower, blood, inhibitor and Rift Herald (1 = team1, 2 = team2, 0 = none)

Champions and summoner spells for each team (Stored as Riot’s champion and summoner spell IDs)

The number of tower, inhibitor, Baron, dragon and Rift Herald kills each team has

The 5 bans of each team (Again, champion IDs are used)

题目要求

Obtain this dataset and familiarize yourself with it. Using this dataset, you are expected to answer some questions and create some plots:

Which champion id is more often associated with a team winning?

Which champion id is more often associated with a team losing?

Which champion id being banned is more often associated with a team losing?

Which two champion ids appear more often together in winning teams (where they are not banned)?

Create a scatterplot showing the number of times each champion is part of a winning team.

Furthermore, create a violin plot to show the distribution of wins for the various champions. This plot will also highlight whether there are outliers, i.e., champions that win or lose much more often than the median.

Finally, you should plot a bar chart and a pie chart with the top 20 most often winning champions.

题目附加要求

Your charts should be readable, without overlapping names, and with tight borders. Ideally, they should be part of the same figure, appearing as subplots.

The deliverable for this assignment is a Jupyter Notebook with all the code that you wrote. The notebook should also use the functions you built so that the responses to the questions posed in the tasks are visible in the notebook. The notebook should also include the code to produce the plots so that all the responses can be seen in the same notebook.

这个数据集中的内容有很多都是用不到的

首先导下库

其中seaborn是用来画小提琴图的

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

设置一下文件路径

# set the path
path = './archive'
gamepath = './archive/games.csv'
champ1path = './archive/champion_info.json'
champ2path = './archive/champion_info_2.json'
spellpath = './archive/summoner_spell_info.json'

其中champ2path和spellpath(存放召唤师技能的细节)都没用到过

该csv文件是按照逗号正常分隔的，直接调用pandas的read_csv方法来获得DataFrame。

gamedata = pd.read_csv(gamepath)

试图通过科学的方法来减少复制粘贴，但是其实没必要…

# create table columns names
champ_team_list=['winner']
team_ban=[]
for i in range(1,3):for j in range(1,6):champ_team_list.append('t{}_champ{}id'.format(i,j))team_ban.append('t{}_ban{}'.format(i,j))
team1_champid_list=champ_team_list[1:6]
team2_champid_list=champ_team_list[6:]
team1_ban=team_ban[:5]
team2_ban=team_ban[5:]

就是为了获得

team1_champid_list

['t1_champ1id', 't1_champ2id', 't1_champ3id', 't1_champ4id', 't1_champ5id']

team1_ban

['t1_ban1', 't1_ban2', 't1_ban3', 't1_ban4', 't1_ban5']

这样的效果

直接把1-3问大一统地解决掉，核心方法是字典，字典的键是champion id，值初始化为0。它们仨的初始化其实直接浅拷贝就行，就是用.copy()方法，注意champion_banned_lose中有-1键，这个键并没有出现在champion id里面；-1代表着空ban：

%%time
champid_dict_won = {i:0 for i in pd.read_json(champ1path).index.values}
champid_dict_lose = {i:0 for i in pd.read_json(champ1path).index.values}
champid_banned_lose = {i:0 for i in pd.read_json(champ1path).index.values}
champid_banned_lose[-1] = 0for i in range(len(gamedata)):if gamedata.loc[i, 'winner'] == 1:for id in gamedata.loc[i, team1_champid_list].T.values:champid_dict_won[id] += 1for id in gamedata.loc[i, team2_champid_list].T.values:champid_dict_lose[id] += 1for id in gamedata.loc[i, team1_ban].T.values:champid_banned_lose[id] += 1elif gamedata.loc[i, 'winner'] == 2:for id in gamedata.loc[i, team2_champid_list].T.values:champid_dict_won[id] += 1for id in gamedata.loc[i, team1_champid_list].T.values:champid_dict_lose[id] += 1for id in gamedata.loc[i, team2_ban].T.values:champid_banned_lose[id] += 1

其中，用转置T是为了消除掉列表中括号

%%time获得的值是Wall time: 3min 43s

# number of champions
len(champid_dict_won)

可以知道在S9的时候有138个英雄。

写一个从字典中，获取键对应值的最大值的键和值的函数：

# return the maximum value and corresponding key from a dictionary
def get_max_from_dict(adict):max_key = max(adict.keys(),key=(lambda x:adict[x]))max_value = adict[max(adict.keys(),key=(lambda x:adict[x]))]print('id:',max_key, 'value:', max_value)

那么可以直接对问题解答了：

Which champion id is more often associated with a team winning?

get_max_from_dict(champid_dict_won)

id: 18 value: 6713

id为18的英雄是崔丝塔娜，小炮

Which champion id is more often associated with a team losing?

get_max_from_dict(champid_dict_lose)

id: 412 value: 6859

id为412的英雄是锤石

Which champion id being banned is more often associated with a team losing?

get_max_from_dict(champid_banned_lose)

id: 157 value: 16519

id为157的英雄是亚索，这个真是能笑死个人，快乐风男被ban了赢得概率就大了，这是不让队友玩的原因还是不让对面玩的原因？

对于问题4

Which two champion ids appear more often together in winning teams (where they are not banned)?

要找出来一对英雄，思来想去用的方法是笛卡尔乘积，但是要剔除掉重复的，所以按照n+(n-1)+···+1的总数来弄类笛卡尔乘积，首先先对数据排好序，然后再生成即可。

champid_list = sorted([i for i in pd.read_json(champ1path).index.values])
champid_tuple = []for i in range(len(champid_list)):for j in range(len(champid_list[i+1:])):champid_tuple.append((champid_list[i], champid_list[i+1:][j]))champid_win_both = {i:0 for i in champid_tuple}

%%time
for i in range(len(gamedata)):champid_tuple_demo=[]if gamedata.loc[i, 'winner'] == 1:data = gamedata.loc[i, team1_champid_list].T.valueselif gamedata.loc[i, 'winner'] == 2:data = gamedata.loc[i, team2_champid_list].T.valuessorted_data = sorted(data)for i in range(len(sorted_data)):for j in range(len(sorted_data[i+1:])):champid_tuple_demo.append((sorted_data[i], sorted_data[i+1:][j]))for i in champid_tuple_demo:if i in champid_tuple:champid_win_both[i] += 1

这个是在双核奔腾笔记本上运行的，Wall time是3min 56s。

# champion ids appear more often together in winning teams
get_max_from_dict(champid_win_both)

id: (497, 498) value: 1242

果然在情理之中啊，霞、洛在一起，是与队伍胜利联系最大的一对英雄！因为在官方设定里面他俩就是一对情侣啊！回城的时候还能一起回城。

我用英文是这样描述的：

The happy thing is that these two ids are correspond to Rakan and Xayah who are lovers in the game.

对于第五题，画散点图，我想着x坐标用id来比较好，毕竟如果把138个英雄名写上去太挤了。

Create a scatterplot showing the number of times each champion is part of a winning team.

won_id_times = sorted(champid_dict_won.items(), key=lambda d:d[0], reverse=False)
scatter_x = [i[0] for i in won_id_times]
scatter_y = [i[1] for i in won_id_times]# set the figure size
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(1,1,1)
ax1.set_title('Champion id and Times on Winning Team')  plt.xlabel('Champion id')
plt.ylabel('Times')  ax1.scatter(scatter_x, scatter_y, s=20)  plt.show()

获得的散点图如下：

对于第六题，是最令我头疼的，

Furthermore, create a violin plot to show the distribution of wins for the various champions. This plot will also highlight whether there are outliers, i.e., champions that win or lose much more often than the median.

小提琴图，第一次听说。仔细查找相关文档发现这东西和箱线图差不多，就是在seaborn上讲的很细。

比如说这张图：

主要困难的地方是画出来离群点，因为我的这个数据中没有y值，所以用matplotlib里的方法，就算把y调成0也画不出来。

于是向外教发邮件，恭恭敬敬地内容如下：

Dear professor,
     I am trapped in question #6 “Furthermore, create a violin plot to show the distribution of wins for the various champions. This plot will also highlight whether there are outliers, i.e., champions that win or lose much more often than the median.” Although I have plotted the violin plot of wins for the various champions which is the graph as an attachment, I have trouble in highlighting the max value on the plot.
     The fatal question for me is that it is hard to plot a yellow circle as the highlight point onto the plot since this violin plot created by seaborn does not have y values.
     I would be appreciate it if you give me some helpful suggestion.
     Yours,
     sincerely student

其中的附件就是上图的那个小提琴图，但是外教没回我，直到作业提交完成的第二天早上我才收到姗姗来迟的邮件：

Dear XXX,
You can change the orientation of the violin plot by passing the orient optional named parameter with argument “v” (for “vertical”). Would that help you solve the problem?
I apologize for the long delay. This message got lost in my inbox.
Kind regards,
XXX

于是向大佬请教，stackoverflow上，csdn上都提问了，链接如下：StackOverflow询问情况（图片预览需要爬梯子才能加载）；CSDN询问情况。

发现最后的解决方案非常简单啊，用seaborn的画布不就行啦！

champid_list = sorted([i for i in pd.read_json(champ1path).index.values])champid_won_times = np.array([i[1] for i in sorted(champid_dict_won.items(), key=lambda d:d[0], reverse=False)])
champid_lose_times = np.array([i[1] for i in sorted(champid_dict_lose.items(), key=lambda d:d[0], reverse=False)])datadict = {'id':champid_list, 'won_times':champid_won_times, 'lose_times':champid_lose_times}champ_won_lose = pd.DataFrame(data=datadict)

# violin plot with outliers
# outliers were highlighted by plotting red pointsfig, ((ax1, ax2)) = plt.subplots(nrows=2, ncols=1, figsize=(10, 5), sharex=True)data_won_violin = champ_won_lose['won_times']
data_lose_violin = champ_won_lose['lose_times']q1_won, q3_won = np.percentile(champ_won_lose['won_times'], [25, 75])
whisker_low_won = q1_won - (q3_won - q1_won) * 1.5
whisker_high_won = q3_won + (q3_won - q1_won) * 1.5q1_lose, q3_lose = np.percentile(champ_won_lose['lose_times'], [25, 75])
whisker_low_lose = q1_lose - (q3_lose - q1_lose) * 1.5
whisker_high_lose = q3_lose + (q3_lose - q1_lose) * 1.5sns.violinplot(x=data_won_violin, color='CornflowerBlue', ax=ax1)
outliers_won = data_won_violin[(data_won_violin > whisker_high_won) | (data_won_violin < whisker_low_won)]
sns.scatterplot(x=outliers_won, y=0, marker='o', color='crimson', ax=ax1)sns.violinplot(x=data_lose_violin, color='CornflowerBlue', ax=ax2)
outliers_lose = data_lose_violin[(data_lose_violin > whisker_high_lose) | (data_lose_violin < whisker_low_lose)]
sns.scatterplot(x=outliers_lose, y=0, marker='o', color='crimson', ax=ax2)ax1.set_title('Won Violin Plot')
ax2.set_title('Lose Violin Plot')plt.tight_layout()
plt.show()

效果如图：

最后一题，

Finally, you should plot a bar chart and a pie chart with the top 20 most often winning champions.

画柱状图和饼图

由于题目没明确是胜率还是上面一二题的那种关联问题，于是我做了两类图。

此外，我是用这样的英文描述我的困惑的：

Since the description of the title is not very accurate, two methods are used to draw the pictures. One is based on the victory of the champion and the team, and the other is based on the victory rate of the champion.

排序函数很重要。

id_name_map = {i['id']: i['name'] for i in pd.read_json(champ1path).data.values}won_rate = (champ_won_lose['won_times']/(champ_won_lose['lose_times']+champ_won_lose['won_times'])).to_list()
won_rate_dict = {x:0 for x in champid_list}
for i in range(len(won_rate)):won_rate_dict[champid_list[i]] += won_rate[i]rate_won_id_times_sorted_with_value_20 = sorted(won_rate_dict.items(), key=lambda d:d[1], reverse=True)[:20]
bar_x_rate = [id_name_map[i[0]] for i in rate_won_id_times_sorted_with_value_20]
bar_y_rate = [i[1] for i in rate_won_id_times_sorted_with_value_20]won_id_times_sorted_with_value_20 = sorted(champid_dict_won.items(), key=lambda d:d[1], reverse=True)[:20]
bar_x = [id_name_map[i[0]] for i in won_id_times_sorted_with_value_20]
bar_y = [i[1] for i in won_id_times_sorted_with_value_20]

了解一下top 20的英雄id

# check the top20 won champion ids
won_id_times_sorted_with_value_20

[(18, 6713),(412, 6143),(67, 5498),(40, 4826),(141, 4807),(29, 4665),(64, 4217),(222, 4087),(157, 3948),(202, 3925),(236, 3915),(498, 3906),(99, 3560),(53, 3506),(497, 3433),(117, 3367),(24, 3347),(103, 3295),(61, 3286),(21, 3256)]

把英雄id以及它对应的英雄名搞成键值对抛进字典里：

id_name_map = {i['id']: i['name'] for i in pd.read_json(champ1path).data.values}

非胜率饼图：

# pie chart
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(1,1,1)
ax1.set_title('Pie Chart for Top 20 Most Winning Champions')
plt.pie(bar_y, labels=bar_x, autopct='%0.2f%%')
plt.show()

胜率饼图：

# pie chart (winning rate)
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(1,1,1)
ax1.set_title('Pie Chart for Top 20 Most Winning Champions (Winning Rate)')
plt.pie(bar_y_rate, labels=bar_x_rate, autopct='%0.2f%%')
plt.show()

柱状图也一样简单：

非胜率柱状图：

# bar chart
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(1,1,1)
ax1.set_title('Bar Chart for Top 20 Most Winning Champions')
plt.bar(bar_x,bar_y)
plt.xticks(fontsize = 12, rotation=45)
plt.xlabel("Champion Name")
plt.ylabel("Times")plt.show()

胜率柱状图：

# bar chart (winning rate)
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(1,1,1)
ax1.set_title('Bar Chart for Top 20 Most Winning Champions (Winning Rate)')
plt.bar(bar_x_rate,bar_y_rate)
plt.xticks(fontsize = 12, rotation=45)
plt.xlabel("Champion Name")
plt.ylabel("Times")plt.show()

終わり。

不懂，就钻；还是难受，那不如不懂就问。

Analysis for INFO 212 Data Science Programming I Assignment 2相关推荐

我们分析了全美Top Business Analyst 和 Data Science专业，最后给你总结了这几点
身边很多朋友提过或者是想要走进大数据这个行业每个人的Background不一样,能力,擅长的领域都不一样小编有一句发自肺腑的话要说给大家听: 不是热门的,薪水高的,所谓好找工作的专业就值得去学: ...
Data Science 到底是什么？
最近被问到了一个问题:Data Science是干什么的? 尽管一直在说Data Science,但是还真的没有深入的.认真的研究过它的起源. Data Science,数据科学,一般的解释是: 数据 ...
Cousera-Introduction to Data Science in Python Assignment1-4答案
Cousera-Introduction to Data Science in Python Assignment1-4答案全是密歇根大学的python课程,作业难度对于初学者来说着实不太友好啊hh ...
python选课系统_【精选】在Monash读Data Science，人人都拥有这样一份选课指南。
点击上方"蓝字",关注最适合你的学习咨询前言 1.课程难度因人而异,课程作业也可能每学期变动,所以大家结合个人实际情况参考借鉴. 2.本指南系列只描述了比较最主流的课,冷门课程资 ...
数据挖掘(data mining)，机器学习(machine learning)，和人工智能(AI)的区别是什么？数据科学(data science)和商业分析(business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...
STATS 4014 Advanced Data Science
STATS 4014 Advanced Data Science Assignment 3 Jono Tuke Semester 1 2019 CHECKLIST : Have you shown a ...
Introducing DataFrames in Apache Spark for Large Scale Data Science（中英双语）
文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API--DataFrame ...
Coursera | Applied Data Science with Python 专项课程 | Applied Machine Learning in Python
本文为学习笔记,记录了由University of Michigan推出的Coursera专项课程--Applied Data Science with Python中Course Three: Ap ...
Coursera | Introduction to Data Science in Python（University of Michigan）| Assignment4
u1s1,这门课的assignment还是有点难度的,特别是assigment4(哀怨),放给大家参考啦~ 有时间(需求)就把所有代码放到github上(好担心被河蟹啊) 先放下该课 ...

Analysis for INFO 212 Data Science Programming I Assignment 2