引用包¶

In [1]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directoryimport os
for dirname, _, filenames in os.walk('/kaggle/input'):for filename in filenames:print(os.path.join(dirname, filename))# Any results you write to the current directory are saved as output.

数据导入¶

In [31]:

data_dict = {'color' : ['black', 'white', 'black', 'white', 'black','white', 'black', 'white', 'black', 'white'],'size' : ['S','M','L','M','L','S','S','XL','XL','M'],'date':pd.date_range('1/1/2019',periods=10, freq='W' ),'feature_1': np.random.randn(10),'feature_2': np.random.normal(0.5, 2, 10)}
array=[['A','B','B','B','C','A','B','A','C','C'],['JP','CN','US','US','US','CN','CN','CA','JP','CA']]index = pd.MultiIndex.from_arrays(array, names=['class', 'country'])
data_df = pd.DataFrame(data_dict,index=index)
data_df

Out[31]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

		color	date	feature_1	feature_2	size
class	country
A	JP	black	2019-01-06	-1.234449	-0.133232	S
B	CN	white	2019-01-13	1.308935	-0.493569	M
	US	black	2019-01-20	0.041672	1.014697	L
	US	white	2019-01-27	-0.203778	1.742654	M
C	US	black	2019-02-03	0.419852	-2.964561	L
A	CN	white	2019-02-10	2.350862	-1.895651	S
B	CN	black	2019-02-17	-0.649887	-0.187894	S
A	CA	white	2019-02-24	0.912200	0.782471	XL
C	JP	black	2019-03-03	-1.295436	0.416840	XL
C	CA	white	2019-03-10	0.500633	2.827345	M

分组¶

将dataframe根据size进行分组，得到group_1。在这里我们将GroupBy对象转换list后输出。

In [5]:

group_1 = data_df.groupby('size')
for i in list(group_1):print(i)

('L',                color       date  feature_1  feature_2 size
class country
B     US       black 2019-01-20  -1.204530   2.331003    L
C     US       black 2019-02-03  -0.475149   2.455877    L)
('M',                color       date  feature_1  feature_2 size
class country
B     CN       white 2019-01-13   0.354512  -0.106245    MUS       white 2019-01-27   0.640886   3.105454    M
C     CA       white 2019-03-10   0.471399   1.102412    M)
('S',                color       date  feature_1  feature_2 size
class country
A     JP       black 2019-01-06   0.599631   1.029602    SCN       white 2019-02-10   0.024186   2.412876    S
B     CN       black 2019-02-17   3.110097   0.678240    S)
('XL',                color       date  feature_1  feature_2 size
class country
A     CA       white 2019-02-24   0.890249   1.522595   XL
C     JP       black 2019-03-03  -1.216877   2.321393   XL)

对GroupBy对象进行分组运算，如sum(),非数值数据则不进行分组运算。将计算得到的数据添加表头前缀后输出

In [6]:

group_1.sum().add_prefix('sum_')

Out[6]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

	sum_feature_1	sum_feature_2
size
L	-1.679679	4.786880
M	1.466797	4.101621
S	3.733914	4.120718
XL	-0.326628	3.843988

根据size进行分组后获得所有size值为M的行向量

In [5]:

group_1.get_group('M')

Out[5]:

.dataframe tbody tr th:only-of-type {vertical-align: middle;}.dataframe tbody tr th {vertical-align: top;}.dataframe thead th {text-align: right;}

		color	date	feature_1	feature_2
class	country
B	CN	white	2019-01-13	1.735991	0.383047
B	US	white	2019-01-27	-0.847715	-2.327769
C	CA	white	2019-03-10	-0.818303	1.317979

将dataframe根据size和color两个列标签进行多重分组，得到group_2

In [34]:

group_2 = data_df.groupby(['size', 'color'])
for i in list(group_2):print(i)

(('L', 'black'),                color       date  feature_1  feature_2 size
class country
B     US       black 2019-01-20   0.041672   1.014697    L
C     US       black 2019-02-03   0.419852  -2.964561    L)
(('M', 'white'),                color       date  feature_1  feature_2 size
class country
B     CN       white 2019-01-13   1.308935  -0.493569    MUS       white 2019-01-27  -0.203778   1.742654    M
C     CA       white 2019-03-10   0.500633   2.827345    M)
(('S', 'black'),                color       date  feature_1  feature_2 size
class country
A     JP       black 2019-01-06  -1.234449  -0.133232    S
B     CN       black 2019-02-17  -0.649887  -0.187894    S)
(('S', 'white'),                color       date  feature_1  feature_2 size
class country
A     CN       white 2019-02-10   2.350862  -1.895651    S)
(('XL', 'black'),                color       date  feature_1  feature_2 size
class country
C     JP       black 2019-03-03  -1.295436    0.41684   XL)
(('XL', 'white'),                color       date  feature_1  feature_2 size
class country
A     CA       white 2019-02-24     0.9122   0.782471   XL)

对分组后的数据size函数获得组别个数

In [9]:

print(group_1.size())
print(group_2.size())

size
L     2
M     3
S     3
XL    2
dtype: int64
size  color
L     black    2
M     white    3
S     black    2white    1
XL    black    1white    1
dtype: int64

此外还可以利用函数进行分组。注意到groupby函数中axis=1对列进行分组，axis=0对行进行分组

In [10]:

def get_letter_type(letter):if 'feature' in letter:return 'feature'else:return 'other'for i in list(data_df.groupby(get_letter_type, axis=1)):print(i)

('feature',                feature_1  feature_2
class country
A     JP        0.599631   1.029602
B     CN        0.354512  -0.106245US       -1.204530   2.331003US        0.640886   3.105454
C     US       -0.475149   2.455877
A     CN        0.024186   2.412876
B     CN        3.110097   0.678240
A     CA        0.890249   1.522595
C     JP       -1.216877   2.321393CA        0.471399   1.102412)
('other',                color       date size
class country
A     JP       black 2019-01-06    S
B     CN       white 2019-01-13    MUS       black 2019-01-20    LUS       white 2019-01-27    M
C     US       black 2019-02-03    L
A     CN       white 2019-02-10    S
B     CN       black 2019-02-17    S
A     CA       white 2019-02-24   XL
C     JP       black 2019-03-03   XLCA       white 2019-03-10    M)

分组对象除了列标签之外，还可以是索引，其中用不同的level值来区分多重索引。

In [16]:

for i in list(data_df.groupby(level=[0,1])):print(i)

(('A', 'CA'),                color       date  feature_1  feature_2 size
class country
A     CA       white 2019-02-24   0.890249   1.522595   XL)
(('A', 'CN'),                color       date  feature_1  feature_2 size
class country
A     CN       white 2019-02-10   0.024186   2.412876    S)
(('A', 'JP'),                color       date  feature_1  feature_2 size
class country
A     JP       black 2019-01-06   0.599631   1.029602    S)
(('B', 'CN'),                color       date  feature_1  feature_2 size
class country
B     CN       white 2019-01-13   0.354512  -0.106245    MCN       black 2019-02-17   3.110097   0.678240    S)
(('B', 'US'),                color       date  feature_1  feature_2 size
class country
B     US       black 2019-01-20  -1.204530   2.331003    LUS       white 2019-01-27   0.640886   3.105454    M)
(('C', 'CA'),                color       date  feature_1  feature_2 size
class country
C     CA       white 2019-03-10   0.471399   1.102412    M)
(('C', 'JP'),                color       date  feature_1  feature_2 size
class country
C     JP       black 2019-03-03  -1.216877   2.321393   XL)
(('C', 'US'),                color       date  feature_1  feature_2 size
class country
C     US       black 2019-02-03  -0.475149   2.455877    L)

DataFrame可以根据列标签和索引两者同时分组,分组后可以在不同组间进行迭代

In [10]:

group_3=df.groupby(['country','color'])
for name, group in group_3:print(name)print(group)

('CA', 'white')color size       date  feature_1  feature_2
class country
A     CA       white   XL 2019-02-24   0.412967   1.196859
C     CA       white    M 2019-03-10  -0.818303   1.317979
('CN', 'black')color size       date  feature_1  feature_2
class country
B     CN       black    S 2019-02-17  -0.058021  -2.420962
('CN', 'white')color size       date  feature_1  feature_2
class country
B     CN       white    M 2019-01-13   1.735991   0.383047
A     CN       white    S 2019-02-10   0.282515   3.156525
('JP', 'black')color size       date  feature_1  feature_2
class country
A     JP       black    S 2019-01-06   0.997065  -1.018255
C     JP       black   XL 2019-03-03   0.513201  -3.266357
('US', 'black')color size       date  feature_1  feature_2
class country
B     US       black    L 2019-01-20  -0.547211   0.693104
C     US       black    L 2019-02-03  -0.245918   4.444044
('US', 'white')color size       date  feature_1  feature_2
class country
B     US       white    M 2019-01-27  -0.847715  -2.327769

当GroupBy对象被建立后，我们可以用agg函数对分组后的数据进行计算。下例中计算了group_3中feature_1的最大值和feature_2的均值。

In [17]:

group_2.agg({'feature_1' : np.min,'feature_2' : np.mean})

Out[17]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

		feature_1	feature_2
size	color
L	black	-1.204530	2.393440
M	white	0.354512	1.367207
S	black	0.599631	0.853921
S	white	0.024186	2.412876
XL	black	-1.216877	2.321393
XL	white	0.890249	1.522595

接下来我们使用transform函数对groupby对象进行变换，transform的计算结果和原始数据的形状保持一致。下例中我们自定义了函数data_range来获得根据size分组后各个值的范围。

In [18]:

data_range = lambda x: x.max() - x.min()
data_df.groupby('size').transform(data_range)

Out[18]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

		date	feature_1	feature_2
class	country
A	JP	42 days	3.085912	1.734636
B	CN	56 days	0.286375	3.211699
	US	14 days	0.729382	0.124874
	US	56 days	0.286375	3.211699
C	US	14 days	0.729382	0.124874
A	CN	42 days	3.085912	1.734636
B	CN	42 days	3.085912	1.734636
A	CA	7 days	2.107125	0.798798
C	JP	7 days	2.107125	0.798798
C	CA	56 days	0.286375	3.211699

另外我们还常常通过transform函数将缺失值替换为组间平均值。

In [29]:

data_df.iloc[1, 2:4] = np.NaN
group_4 = data_df.groupby('size')
f = lambda x: x.fillna(x.mean())
df_trans = group_4.transform(f)
df_trans

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\kernelbase.py:399: PerformanceWarning: indexing past lexsort depth may impact performance.user_expressions, allow_stdin)

Out[29]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

		feature_1	feature_2
class	country
A	JP	-0.023671	-0.409491
B	CN	-0.091596	-1.399647
	US	1.085396	2.245660
	US	-0.127399	-1.747656
C	US	-2.046202	3.475487
A	CN	-1.076002	2.705517
B	CN	0.184117	2.913971
A	CA	0.601222	-2.098025
C	JP	-0.009375	-3.623235
C	CA	-0.055794	-1.051638

In [30]:

data_df

Out[30]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

		color	date	feature_1	feature_2	size
class	country
A	JP	black	2019-01-06	-0.023671	-0.409491	S
B	CN	white	2019-01-13	NaN	NaN	M
	US	black	2019-01-20	1.085396	2.245660	L
	US	white	2019-01-27	-0.127399	-1.747656	M
C	US	black	2019-02-03	-2.046202	3.475487	L
A	CN	white	2019-02-10	-1.076002	2.705517	S
B	CN	black	2019-02-17	0.184117	2.913971	S
A	CA	white	2019-02-24	0.601222	-2.098025	XL
C	JP	black	2019-03-03	-0.009375	-3.623235	XL
C	CA	white	2019-03-10	-0.055794	-1.051638	M

根据列标签color进行分组后对列标签feature_1使用rolling方法,滚动计算最新的三个值的平均。

In [32]:

data_df.groupby('color').rolling(3).feature_1.mean()

Out[32]:

color  class  country
black  A      JP              NaNB      US              NaNC      US        -0.257642B      CN        -0.062787C      JP        -0.508490
white  B      CN              NaNUS              NaNA      CN         1.152006CA         1.019761C      CA         1.254565
Name: feature_1, dtype: float64

expanding函数会对给定的操作(如下例中的sum)进行叠加

In [47]:

data_df.groupby('color').expanding(3).feature_1.sum()

Out[47]:

color  class  country
black  A      JP              NaNB      US              NaNC      US        -0.772925B      CN        -1.422812C      JP        -2.718247
white  B      CN              NaNUS              NaNA      CN         3.456018CA         4.368218C      CA         4.868851
Name: feature_1, dtype: float64

filter函数的参数是作用于整个组且返回值为True或False的函数，我们可以利用filter函数得到分组后的某些特定组别,如下例中元素个数大于三的分组。

In [36]:

data_df.groupby('class').filter(lambda x: len(x) > 3)

Out[36]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

		color	date	feature_1	feature_2	size
class	country
B	CN	white	2019-01-13	1.308935	-0.493569	M
	US	black	2019-01-20	0.041672	1.014697	L
	US	white	2019-01-27	-0.203778	1.742654	M
	CN	black	2019-02-17	-0.649887	-0.187894	S

有些对于分组数据的处理用transform和aggregate都很难完成，这时候我们需要使用apply函数，apply相较两者更加灵活。在apply中可使用用自定义函数。

In [37]:

data_df.groupby('class')['feature_1'].apply(lambda x: x.describe())

Out[37]:

class
A      count    3.000000mean     0.676204std      1.804268min     -1.23444925%     -0.16112550%      0.91220075%      1.631531max      2.350862
B      count    4.000000mean     0.124235std      0.840077min     -0.64988725%     -0.31530650%     -0.08105375%      0.358488max      1.308935
C      count    3.000000mean    -0.124984std      1.014446min     -1.29543625%     -0.43779250%      0.41985275%      0.460243max      0.500633
Name: feature_1, dtype: float64

In [38]:

def f(group):return pd.DataFrame({'original' : group,'demeaned' : group - group.mean()})
data_df.groupby('class')['feature_1'].apply(f)

Out[38]:

.dataframe thead tr:only-child th {text-align: right;}.dataframe thead th {text-align: left;}.dataframe tbody tr th {vertical-align: top;}

		demeaned	original
class	country
A	JP	-1.910653	-1.234449
B	CN	1.184700	1.308935
	CN	-0.774122	-0.649887
	US	-0.082563	0.041672
	US	-0.328014	-0.203778
C	US	0.544836	0.419852
A	CN	1.674658	2.350862
A	CA	0.235996	0.912200
C	JP	-1.170452	-1.295436
C	CA	0.625616	0.500633

参考文献¶

pandas toolkit
https://cloud.tencent.com/developer/article/1193823

用修改css的方法导入的juypter notebook文件相关推荐

javascript、jquery 动态修改css样式方法
javascript.jquery 动态修改css样式方法 javascript 修改样式的方法第一种.使用obj.className来修改样式表的类名 var obj = document.get ...
javascript 动态修改css样式方法汇总(四种方法)
在很多情况下,都需要对网页上元素的样式进行动态的修改.在JavaScript中提供几种方式动态的修改样式,下面将介绍方法的使用.效果.以及缺陷. 1.使用obj.className来修改样式表的类名. ...
mysql 批量导入sql_MySQL高效导入多个.sql文件方法详解
MySQL有多种方法导入多个.sql文件(里面是sql语句),常用的有两个命令:mysql和source. 但是这两个命令的导入效率差别很大,具体请看最后的比较. (还有sqlimport和LOAD ...
linux 批量更换文件名,Linux下批量修改文件名的方法
Linux下批量修改文件名的方法在Linux环境下修改文件名可以有不同的命令方式,比如rename.mv都可以进行修改,如果用户正在了解这方面的知识,可以阅读下文了解Linux修改文件名以及批量修改 ...
linux下文件怎么改名字,Linux文件名字批量修改步骤 Linux修改文件名的方法
在Linux环境下修改文件名可以有不同的命令方式,比如rename.mv都可以进行修改,如果用户正在了解这方面的知识,可以阅读下文了解Linux修改文件名以及批量修改文件名的方法. 对于单个的文件,可 ...
子报表修改后需要重新导入，0.00显示.00的调整方法
水晶报表笔记: 子报表修改后需要重新导入 0.00显示.00的调整方法:数字格式的内容,右键,设置对象格式, 自定义,前导值勾上,默认值,选择0 转载于:https://www.cn ...
html之四种方法导入css...
原文地址:http://hi.baidu.com/lizhanfu/blog/item/4dbc806499bed82fab184c9c.html 在html中,引入css的方法主要有行内式.内嵌式. ...
如何修改zblog的css,修改Zblog中的CSS的方法
在初学CSS的时候应该都有用过Zblog的主题吧,那么你知道如何修改Zblog中的CSS吗?如果想了解的朋友们就跟爱站小编一起去了解下修改Zblog中的CSS的方法. 在学习应用css之前我们要先了解 ...
html浅色_修改CSS样式实现网页灰色(没有颜色只有浅色黑白)的几个方法整理
一般在清明节,全国哀悼日,大地震的日子,以及一些影响力很大的伟人逝世或纪念日的时候,身为站长的我们都会让自己的网站的全部网页变成灰色(黑白色),以表示我们对逝者的悼念.那么今天就说说,通过几行简单的代 ...

用修改css的方法导入的juypter notebook文件

引用包¶

数据导入¶

分组¶

参考文献¶

用修改css的方法导入的juypter notebook文件相关推荐

最新文章

热门文章