Pandas[加深学习]06去重、映射、异常值检测和过滤、排序、聚合

1、删除重复元素

In [1]:

import numpy as np

import pandas as pd

from pandas import Series,DataFrame

In [2]:

df = DataFrame({'color':['red','white','red','green'],'size':[10,20,10,30]})

df

Out[2]:

	color	size
0	red	10
1	white	20
2	red	10
3	green	30

使用duplicated()函数检测重复的行，返回元素为布尔类型的Series对象，每个元素对应一行，如果该行不是第一次出现，则元素为True

In [3]:

df.duplicated()

Out[3]:

0    False

1    False

2     True

3    False

dtype: bool

使用drop_duplicates()函数删除重复的行

In [4]:

df.drop_duplicates()

Out[4]:

	color	size
0	red	10
1	white	20
3	green	30

如果使用pd.concat([df1,df2],axis = 1)生成新的DataFrame，新的df中columns相同，使用duplicate()和drop_duplicates()都会出问题

In [5]:

#就是列名相同，当我们删除重复元素时，会出问题

df2 = pd.concat((df,df),axis = 1)

df2

Out[5]:

	color	size	color	size
0	red	10	red	10
1	white	20	white	20
2	red	10	red	10
3	green	30	green	30

In [6]:

df2.duplicated()

Out[6]:

0    False

1    False

2     True

3    False

dtype: bool

In [7]:

df2.drop_duplicates()

Out[7]:

	color	size	color	size
0	red	10	red	10
1	white	20	white	20
3	green	30	green	30

In [8]:

#drop_dupilicates ; drop 根据dupicates

# df.drop_duplicates() ==

du = df.duplicated()

#du = [0,0,1,0]

display(du)

df.drop(du)

0    False

1    False

2     True

3    False

dtype: bool

---------------------------------------------------------------------------KeyError

KeyError: '[False False  True False] not found in axis'

In [9]:

#返回du非逻辑后的布尔值

np.logical_not(du)

Out[9]:

0     True

1     True

2    False

3     True

dtype: bool

In [10]:

df[np.logical_not(du)]

Out[10]:

	color	size
0	red	10
1	white	20
3	green	30

In [11]:

df

Out[11]:

	color	size
0	red	10
1	white	20
2	red	10
3	green	30

2. 映射

映射的含义：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定

需要使用字典：

map = { 'label1':'value1', 'label2':'value2', ... }

包含三种操作：

replace()函数：替换元素
最重要：map()函数：新建一列
rename()函数：替换索引

1) replace()函数：替换元素

使用replace()函数，对values进行替换操作

In [13]:

#red = 10

#green = 20

color = {'red':10,'green':20}

首先定义一个字典

调用.replace()

In [14]:

df.replace(color,inplace=True)

replace还经常用来替换NaN元素

In [15]:

df.loc[1] = np.nan

In [16]:

v = {np.nan:0.1}

df.replace(v)

Out[16]:

	color	size
0	10.0	10.0
1	0.1	0.1
2	10.0	10.0
3	20.0	30.0

============================================

练习19：

假设张三李四的课表里有满分的情况，老师认为是作弊，把所有满分的情况（包括150,300分）都记0分，如何实现？

============================================

2) map()函数：新建一列

使用map()函数，由已有的列生成一个新列

适合处理某一单独的列。

In [24]:

df = DataFrame(np.random.randint(0,150,size = (4,4)),columns=['Python','Java','PHP','HTML'],

               index = ['张三','旭日','阳刚','木兰'])

df

Out[24]:

	Python	Java	PHP	HTML
张三	75	61	41	136
旭日	67	81	2	92
阳刚	100	120	77	64
木兰	138	82	93	93

In [25]:

#Go

#map也有映射关系，新添加一列，根据现存的那一列进行添加

v = {75:90,67:100,100:166,138:55}

df['Go'] = df['Python'].map(v)

In [26]:

df

Out[26]:

	Python	Java	PHP	HTML	Go
张三	75	61	41	136	90
旭日	67	81	2	92	100
阳刚	100	120	77	64	166
木兰	138	82	93	93	55

仍然是新建一个字典

map()函数中可以使用lambda函数

In [27]:

#map()函数不仅可以根据条件修改前列，map()函数还可以映射新一列数据

#map()函数中可以使用lambda表达式，还可以使用方法，实现自己的方法

#不能使用sum之类的函数，因为底层有for循环

#C

df['C'] = df['Go'].map(lambda x : x - 40)

In [28]:

df

Out[28]:

	Python	Java	PHP	HTML	Go	C
张三	75	61	41	136	90	50
旭日	67	81	2	92	100	60
阳刚	100	120	77	64	166	126
木兰	138	82	93	93	55	15

In [29]:

def mp(x):

    #复杂的条件

    if x <51:

        return '不及格'

    else:

        return '优秀'

In [30]:

df['score'] = df['C'].map(mp)

df

Out[30]:

	Python	Java	PHP	HTML	Go	C	score
张三	75	61	41	136	90	50	不及格
旭日	67	81	2	92	100	60	优秀
阳刚	100	120	77	64	166	126	优秀
木兰	138	82	93	93	55	15	不及格

In [31]:

#'int' object is not iterable

max(10)

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-31-e9dfad7007eb> in <module>      1 #'int' object is not iterable----> 2 max(10)

TypeError: 'int' object is not iterable

In [32]:

#'int' object is not iterable

df['score2'] = df['C'].pma(max)

---------------------------------------------------------------------------AttributeError

AttributeError: 'Series' object has no attribute 'pma'

transform()和map()类似

In [33]:

#transform()方法根据某种规则算法，进行批量修改

#score与score2没有变动

df['score2'] = df['C'].transform(mp)

In [34]:

df

Out[34]:

	Python	Java	PHP	HTML	Go	C	score	score2
张三	75	61	41	136	90	50	不及格	不及格
旭日	67	81	2	92	100	60	优秀	优秀
阳刚	100	120	77	64	166	126	优秀	优秀
木兰	138	82	93	93	55	15	不及格	不及格

使用map()函数新建一个新列

In [35]:

#同时map还可以修改当前列

df['C'] = df['C'].map(lambda x : x*2)

In [36]:

df

Out[36]:

	Python	Java	PHP	HTML	Go	C	score	score2
张三	75	61	41	136	90	100	不及格	不及格
旭日	67	81	2	92	100	120	优秀	优秀
阳刚	100	120	77	64	166	252	优秀	优秀
木兰	138	82	93	93	55	30	不及格	不及格

============================================

练习20：

    新增两列，分别为张三、李四等人的成绩状态，如果分数低于90，则为"failed"，如果分数高于120，则为"excellent"，其他则为"pass"

    【提示】使用函数作为map的参数

============================================

In [39]:

df3 = DataFrame(np.random.randint(0,150,size = (6,1)),index = ['张三','李四','王五','赵柳','Chales','凡凡'],columns = ['Python'])

df3

Out[39]:

	Python
张三	18
李四	119
王五	143
赵柳	39
Chales	44
凡凡	60

In [43]:

#定义判断成绩的状态的方法

#方法中的参数，就是DataFrame中一列的每个数据

def state(i):

    if i < 90:

        return 'failed'

    elif i > 120:

        return 'excellent'

    else:

        return 'pass'

In [46]:

df3['State'] = df3['Python'].map(state)

df3

Out[46]:

	Python	State
张三	18	failed
李四	119	pass
王五	143	excellent
赵柳	39	failed
Chales	44	failed
凡凡	60	failed

3) rename()函数：替换索引

仍然是新建一个字典

In [55]:

df

Out[55]:

	Python	Java	PHP	HTML	Go	C	score	score2
张三	75	61	41	136	90	100	不及格	不及格
旭日	67	81	2	92	100	120	优秀	优秀
阳刚	100	120	77	64	166	252	优秀	优秀
木兰	138	82	93	93	55	30	不及格	不及格

In [56]:

def cols(x):

    if x == 'PHP':

        return 'php'

    if x == 'Python':

        return '大蟒蛇'

    else:

        return x

In [57]:

inds = {'张三':'Zhang Sir','木兰':'MissLan'}

# index, columns : scalar, list-like, dict-like or function, optional

#     Scalar or list-like will alter the ``Series.name`` attribute,

#     and raise on DataFrame or Panel.

#     dict-like or functions are transformations to apply to

#     that axis' values

df.rename(index = inds,columns=cols)

Out[57]:

	大蟒蛇	Java	php	HTML	Go	C	score	score2
Zhang Sir	75	61	41	136	90	100	不及格	不及格
旭日	67	81	2	92	100	120	优秀	优秀
阳刚	100	120	77	64	166	252	优秀	优秀
MissLan	138	82	93	93	55	30	不及格	不及格

使用rename()函数替换行索引

3. 异常值检测和过滤

使用describe()函数查看每一列的描述性统计量

In [85]:

df = DataFrame(np.random.randint(0,150,size = (6,3)),index = ['张三','李四','王五','赵柳','Chales','凡凡'],columns = ['Python','Java','HTML'])

display(df)

df.describe()

# count  统计数量

# mean   平均值

# std    标准方差

# min    最小值

# max    最大值

	Python	Java	HTML
张三	95	56	7
李四	64	59	38
王五	75	111	91
赵柳	55	99	2
Chales	127	80	34
凡凡	13	124	149

Out[85]:

	Python	Java	HTML
count	6.000000	6.000000	6.000000
mean	71.500000	88.166667	53.500000
std	38.459069	27.838223	56.500442
min	13.000000	56.000000	2.000000
25%	57.250000	64.250000	13.750000
50%	69.500000	89.500000	36.000000
75%	90.000000	108.000000	77.750000
max	127.000000	124.000000	149.000000

使用std()函数可以求得DataFrame对象每一列的标准差

In [86]:

df.std()

Out[86]:

Python    38.459069

Java      27.838223

HTML      56.500442

dtype: float64

In [87]:

df.std(axis = 1)

Out[87]:

张三        44.094595

李四        13.796135

王五        18.036999

赵柳        48.569538

Chales    46.500896

凡凡        72.390147

dtype: float64

根据每一列的标准差，对DataFrame元素进行过滤。

借助any()函数, 测试是否有True，有一个或以上返回True，反之返回False

对每一列应用筛选条件,去除标准差太大的数据

In [88]:

cond = np.abs(df) > df.std()*4

cond

Out[88]:

	Python	Java	HTML
张三	False	False	False
李四	False	False	False
王五	False	False	False
赵柳	False	False	False
Chales	False	False	False
凡凡	False	True	False

In [89]:

any1 = cond.any(axis = 1)

df[any1]

Out[89]:

	Python	Java	HTML
凡凡	13	124	149

In [93]:

#(实际应用)如果数据小于4倍的平均方差，认为数据可靠

cond1 = np.abs(df) < df.std()*4

all1 = cond1.all(axis = 1)

all1

Out[93]:

张三         True

李四         True

王五         True

赵柳         True

Chales     True

凡凡        False

dtype: bool

删除特定索引df.drop(labels,inplace = True)

In [95]:

df.drop(['HTML'],axis = 1,inplace = True)

df

Out[95]:

	Python	Java
张三	95	56
李四	64	59
王五	75	111
赵柳	55	99
Chales	127	80
凡凡	13	124

============================================

练习21：

新建一个形状为10000*3的标准正态分布的DataFrame(np.random.randn)，去除掉所有满足以下情况的行：其中任一元素绝对值大于3倍标准差

============================================

In [96]:

n = np.random.randn(10000,3)

df = DataFrame(n)

df

Out[96]:

	0	1	2
0	-0.121184	0.219447	1.101524
1	-0.470456	1.859989	0.371916
2	0.821264	0.847836	-1.019667
3	-0.013778	1.213796	0.150453
4	-1.259598	0.213430	-2.966739
...	...	...	...
9995	-0.344166	-0.309993	2.177523
9996	1.042569	-1.171988	-0.134560
9997	-0.415962	0.899625	-0.660630
9998	-0.442747	0.666860	-0.208234
9999	-0.231901	0.850881	0.314572

10000 rows × 3 columns

In [97]:

cond = np.abs(df) >df.std()*3

cond

Out[97]:

	0	1	2
0	False	False	False
1	False	False	False
2	False	False	False
3	False	False	False
4	False	False	False
...	...	...	...
9995	False	False	False
9996	False	False	False
9997	False	False	False
9998	False	False	False
9999	False	False	False

10000 rows × 3 columns

In [98]:

drop_index = df[cond.any(axis = 1)].index

In [99]:

df2 = df.drop(drop_index)

In [100]:

df2.shape

Out[100]:

(9927, 3)

In [101]:

cond2 = np.abs(df2) > df.std()*3

In [102]:

cond2.any(axis = 1).sum()

Out[102]:

In [103]:

df2

Out[103]:

	0	1	2
0	-0.121184	0.219447	1.101524
1	-0.470456	1.859989	0.371916
2	0.821264	0.847836	-1.019667
3	-0.013778	1.213796	0.150453
4	-1.259598	0.213430	-2.966739
...	...	...	...
9995	-0.344166	-0.309993	2.177523
9996	1.042569	-1.171988	-0.134560
9997	-0.415962	0.899625	-0.660630
9998	-0.442747	0.666860	-0.208234
9999	-0.231901	0.850881	0.314572

9927 rows × 3 columns

In [104]:

#标准偏差的平均值

row_std_mean = df2.std(axis = 1).mean()

In [105]:

cond3 = df2.std(axis = 1) > row_std_mean*2.5

In [106]:

#一下行，的数据的标准偏差大于平均标准偏差的2.5 过滤掉

large_std_index = df2[cond3].index

In [107]:

df3 = df2.drop(large_std_index)

In [108]:

df3.shape

Out[108]:

(9877, 3)

4. 排序

使用.take()函数排序

可以借助np.random.permutation()函数随机排序

In [110]:

df = DataFrame(np.random.randint(0,150,size = (4,4)),columns=['Python','Java','PHP','HTML'],

               index = ['张三','旭日','阳刚','木兰'])

df

Out[110]:

	Python	Java	PHP	HTML
张三	45	8	40	10
旭日	129	62	121	9
阳刚	94	77	80	26
木兰	51	40	120	18

In [111]:

df.take([3,2,0])

Out[111]:

	Python	Java	PHP	HTML
木兰	51	40	120	18
阳刚	94	77	80	26
张三	45	8	40	10

In [114]:

indices = np.random.permutation(4)

indices

Out[114]:

array([2, 0, 1, 3])

In [115]:

#此时得到了重新排列的数据

df.take(indices)

Out[115]:

	Python	Java	PHP	HTML
阳刚	94	77	80	26
张三	45	8	40	10
旭日	129	62	121	9
木兰	51	40	120	18

随机抽样

当DataFrame规模足够大时，直接使用np.random.randint()函数，就配合take()函数实现随机抽样

In [116]:

df2 = DataFrame(np.random.randn(10000,3))

df2

Out[116]:

	0	1	2
0	-0.853997	-0.592047	0.281676
1	-0.398245	1.777191	1.290763
2	0.145525	-2.162814	0.505369
3	0.048881	-0.898379	0.828580
4	0.190132	-0.981742	-0.067704
...	...	...	...
9995	-0.310573	-0.833636	0.749073
9996	-0.180564	0.233861	-0.045255
9997	0.407484	1.194655	2.402484
9998	0.214967	-1.205329	0.731477
9999	-0.892452	2.126844	0.370416

10000 rows × 3 columns

In [117]:

indices = np.random.randint(0,10000,size = 10)

df2.take(indices)

Out[117]:

	0	1	2
2060	-0.697259	0.929964	0.579201
6866	-1.685287	-0.969844	1.979999
8946	-0.285639	0.431326	0.122082
4214	1.714924	0.543248	-0.555760
9033	-0.381230	0.439363	-0.904614
3093	-0.240366	-0.128198	-1.182337
8769	0.877627	0.690714	0.435345
6967	0.137017	-0.545800	-0.737800
8433	-0.245648	1.133199	-0.713866
5855	-1.069446	-0.552896	2.414125

============================================

练习22：

   假设有张三李四王老五的期中考试成绩ddd2，对着三名同学随机排序

============================================

5. 数据聚合【重点】

数据聚合是数据处理的最后一步，通常是要使每一个数组生成一个单一的数值。

数据分类处理：

分组：先把数据分为几组
用函数处理：为不同组的数据应用不同的函数以转换数据
合并：把不同组得到的结果合并起来

数据分类处理的核心： groupby()函数

如果想使用color列索引，计算price1的均值，可以先获取到price1列，然后再调用groupby函数，用参数指定color这一列

In [118]:

#groupby（）根据某个属性，或者多个属性进行分类

df = DataFrame({'color':['red','white','red','cyan','cyan','green','white','cyan'],

                'price':np.random.randint(0,8,size = 8),

                'weight':np.random.randint(50,55,size = 8)})

df

Out[118]:

	color	price	weight
0	red	0	51
1	white	5	53
2	red	0	53
3	cyan	3	51
4	cyan	6	51
5	green	1	54
6	white	2	54
7	cyan	2	51

使用.groups属性查看各行的分组情况：

In [119]:

#根据颜色对数据进行分类，目的计算机，将相同的事物进行分组，求和，求平局值

df_sum_weight = df.groupby(['color'])[['weight']].sum()

df_price_mean = df.groupby(['color'])[['price']].mean()

In [120]:

df_sum_weight

Out[120]:

	weight
color
cyan	153
green	54
red	104
white	107

In [122]:

df_price_mean

Out[122]:

	price
color
cyan	3.666667
green	1.000000
red	0.000000
white	3.500000

In [123]:

#pandas 聚合concat//append;merge

In [124]:

pd.concat([df,df_sum_weight],axis=1)

Out[124]:

	color	price	weight	weight
0	red	0.0	51.0	NaN
1	white	5.0	53.0	NaN
2	red	0.0	53.0	NaN
3	cyan	3.0	51.0	NaN
4	cyan	6.0	51.0	NaN
5	green	1.0	54.0	NaN
6	white	2.0	54.0	NaN
7	cyan	2.0	51.0	NaN
cyan	NaN	NaN	NaN	153.0
green	NaN	NaN	NaN	54.0
red	NaN	NaN	NaN	104.0
white	NaN	NaN	NaN	107.0

In [125]:

type(df_sum_weight)

Out[125]:

pandas.core.frame.DataFrame

In [126]:

df_sum = df.merge(df_sum_weight,left_on='color',right_index=True,suffixes=['','_sum'])

In [127]:

#平均价格进行整合

df_r = df_sum.merge(df_price_mean,left_on='color',right_index=True,suffixes=['','_平均'])

In [128]:

df_r

Out[128]:

	color	price	weight	weight_sum	price_平均
0	red	0	51	104	0.000000
2	red	0	53	104	0.000000
1	white	5	53	107	3.500000
6	white	2	54	107	3.500000
3	cyan	3	51	153	3.666667
4	cyan	6	51	153	3.666667
7	cyan	2	51	153	3.666667
5	green	1	54	54	1.000000

In [129]:

#take获取，提取，take根据传入参数获取部分的数据，获取之后，自身并没有进行排序

In [130]:

df_r.index

Out[130]:

Int64Index([0, 2, 1, 6, 3, 4, 7, 5], dtype='int64')

In [131]:

df_r.take([2,3])

Out[131]:

	color	price	weight	weight_sum	price_平均
1	white	5	53	107	3.5
6	white	2	54	107	3.5

In [132]:

df_r.sort_index()

Out[132]:

	color	price	weight	weight_sum	price_平均
0	red	0	51	104	0.000000
1	white	5	53	107	3.500000
2	red	0	53	104	0.000000
3	cyan	3	51	153	3.666667
4	cyan	6	51	153	3.666667
5	green	1	54	54	1.000000
6	white	2	54	107	3.500000
7	cyan	2	51	153	3.666667

============================================

练习23：

假设菜市场张大妈在卖菜，有以下属性：

菜品(item)：萝卜，白菜，辣椒，冬瓜

颜色(color)：白，青，红

重量(weight)

价格(price)

要求以属性作为列索引，新建一个ddd
对ddd进行聚合操作，求出颜色为白色的价格总和
对ddd进行聚合操作，求出萝卜的所有重量(包括白萝卜，胡萝卜，青萝卜）以及平均价格
使用merge合并总重量及平均价格

============================================

In [134]:

df = DataFrame({'item':['萝卜','白菜','辣椒','冬瓜','萝卜','白菜','辣椒','冬瓜'],

               'color':['red','white','green','red','white','green','green','red'],

               'weight':np.random.randint(50,150,size = 8),

               'price':np.random.randint(1,4,size = 8)})

df

Out[134]:

	item	color	weight	price
0	萝卜	red	95	3
1	白菜	white	141	1
2	辣椒	green	73	2
3	冬瓜	red	81	3
4	萝卜	white	120	3
5	白菜	green	68	1
6	辣椒	green	121	1
7	冬瓜	red	83	3

In [135]:

df.groupby('color')['price'].sum()

Out[135]:

color

green    4

red      9

white    4

Name: price, dtype: int64

In [136]:

df.groupby('color')['price'].sum()['white']

Out[136]:

In [137]:

df.groupby('item')['weight','price'].sum()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

  """Entry point for launching an IPython kernel.

Out[137]:

	weight	price
item
冬瓜	164	6
白菜	209	2
萝卜	215	6
辣椒	194	3

6.0 高级数据聚合

可以使用pd.merge()函数将聚合操作的计算结果添加到df的每一行
使用groupby分组后调用加和等函数进行运算，让后最后可以调用add_prefix()，来修改列名

可以使用transform和apply实现相同功能

在transform或者apply中传入函数即可

In [138]:

sum([10])

Out[138]:

In [139]:

df['columns'] = df['color'].map()

---------------------------------------------------------------------------TypeError

TypeError: map() missing 1 required positional argument: 'arg'

In [140]:

#传递函数，这个和上午map(不能迭代) ，将运算出来的结果显示给所有行

df.groupby('color').transform(sum)

Out[140]:

	item	weight	price
0	萝卜冬瓜冬瓜	259	9
1	白菜萝卜	261	4
2	辣椒白菜辣椒	262	4
3	萝卜冬瓜冬瓜	259	9
4	白菜萝卜	261	4
5	辣椒白菜辣椒	262	4
6	辣椒白菜辣椒	262	4
7	萝卜冬瓜冬瓜	259	9

In [141]:

df

Out[141]:

	item	color	weight	price
0	萝卜	red	95	3
1	白菜	white	141	1
2	辣椒	green	73	2
3	冬瓜	red	81	3
4	萝卜	white	120	3
5	白菜	green	68	1
6	辣椒	green	121	1
7	冬瓜	red	83	3

In [142]:

df.groupby('color')[['price','weight']].apply(sum)

Out[142]:

	price	weight
color
green	4	262
red	9	259
white	4	261

In [143]:

def add_all(item):

    a = 0

    for i in item:

        a+=i

    return a

In [146]:

#自定义add_all函数，就相当于sum

df.groupby('item')['price'].apply(add_all)

Out[146]:

item

冬瓜    6

白菜    2

萝卜    6

辣椒    3

Name: price, dtype: int64

transform()与apply()函数还能传入一个函数或者lambda

df = DataFrame({'color':['white','black','white','white','black','black'], 'status':['up','up','down','down','down','up'], 'value1':[12.33,14.55,22.34,27.84,23.40,18.33], 'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})

apply的操作对象，也就是传给lambda的参数是整列的数组

============================================

练习24：

使用transform与apply实现练习23的功能

============================================

仅供参考学习，严禁转载！