
CASE: demographic


The world bank has estimates of the world population for the years 1950 up to 2100. The years are loaded in your workspace as a list called year, and the corresponding populations as a list called pop.

from matplotlib import pyplot as plt
plt.plot(year, pop)

Let’s start working on the data that professor Hans Rosling used to build his beautiful bubble chart. It was collected in 2007. Two lists are available for you:
现在让我们开始研究 Hans Rosling 教授的一份数据,其中包含两个指标:

  • life_exp which contains the life expectancy for each country 每个国家的预期寿命
  • gdp_cap, which contains the GDP per capita (i.e. per person) for each country expressed in US Dollars. 每个国家的人均GDP
plt.plot(gdp_cap, life_exp)

When you’re trying to assess if there’s a correlation between two variables, for example, the scatter plot is the better choice.

plt.scatter(gdp_cap, life_exp)
plt.xscale('log') # 把人均GDP用对数表示时,相关性就会变得很明显。

You saw that the higher GDP usually corresponds to a higher life expectancy. In other words, there is a positive correlation. Do you think there’s a relationship between population and life expectancy of a country?

import matplotlib.pyplot as plt
plt.scatter(pop, life_exp)

To see how life expectancy in different countries is distributed, let’s create a histogram of life_exp

import matplotlib.pyplot as plt

In the previous exercise, you didn’t specify the number of bins. By default, Python sets the number of bins to 10 in that case. The number of bins is pretty important. Too few bins will oversimplify reality and won’t show you the details. Too many bins will overcomplicate reality and won’t show the bigger picture.

import matplotlib.pyplot as plt
plt.hist(life_exp, 5)
plt.clf() # 清除

import matplotlib.pyplot as plt
plt.hist(life_exp, 20)

Let’s do a similar comparison. life_exp contains life expectancy data for different countries in 2007. You also have access to a second list now, life_exp1950, containing similar data for 1950. Can you make a histogram for both datasets?

import matplotlib.pyplot as plt
plt.hist(life_exp, 15)
plt.clf()plt.hist(life_exp1950, 15)

You’re going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis. As a first step, let’s add axis labels and a title to the plot.

import matplotlib.pyplot as plt
plt.scatter(gdp_cap, life_exp)
xlab = 'GDP per Capita [in USD]'
ylab = 'Life Expectancy [in years]'
title = 'World Development in 2007'

Let’s do a thing for the x-axis of your world development chart, with the xticks() function. The tick values 1000, 10000 and 100000 should be replaced by 1k, 10k and 100k.

import matplotlib.pyplot as plt
plt.scatter(gdp_cap, life_exp)
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
tick_val = [1000, 10000, 100000]
tick_lab = ['1k', '10k', '100k']
plt.xticks(tick_val, tick_lab)

Right now, the scatter plot is just a cloud of blue dots, indistinguishable from each other. Let’s change this. Wouldn’t it be nice if the size of the dots corresponds to the population?

import numpy as np
np_pop = np.array(pop) # 将pop存储为numpy数组:np_pop
np_pop = np_pop * 2
plt.scatter(gdp_cap, life_exp, s = np_pop) # 将size参数设置为np_pop
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])

The next step is making the plot more colorful! To do this, a list col has been created for you. It’s a list with a color for each corresponding country, depending on the continent the country is part of.

dict = {'Asia':'red','Europe':'green','Africa':'blue','Americas':'yellow','Oceania':'black'
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

Additional customizations and gridlines.

plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])
plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')



countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']
ind_ger = countries.index('germany') # 德国的索引
print(capitals[ind_ger])<> output:berlin


countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']
europe = { 'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
print(europe)<> output:{'spain': 'madrid', 'germany': 'berlin', 'norway': 'oslo', 'france': 'paris'}


europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
print(europe['norway'])<> output:dict_keys(['spain', 'germany', 'norway', 'france'])oslo


europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
europe['italy'] = 'rome'
print('italy' in europe)
europe['poland'] = 'warsaw'
print(europe)<> output:True{'spain': 'madrid', 'germany': 'berlin', 'italy': 'rome', 'norway': 'oslo', 'france': 'paris', 'poland': 'warsaw'}


europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn','norway':'oslo', 'italy':'rome', 'poland':'warsaw','australia':'vienna' }
europe['germany'] = 'berlin'
del europe['australia']
print(europe)<> output:{'spain': 'madrid', 'germany': 'berlin', 'italy': 'rome', 'norway': 'oslo', 'france': 'paris', 'poland': 'warsaw'}


europe = { 'spain': { 'capital':'madrid', 'population':46.77 },'france': { 'capital':'paris', 'population':66.03 },'germany': { 'capital':'berlin', 'population':80.62 },'norway': { 'capital':'oslo', 'population':5.084 } }
data = {'capital':'rome', 'population':59.83} # 添加信息
europe['italy'] = data
print(europe)<> output:{'capital': 'paris', 'population': 66.03}{'spain': {'capital': 'madrid', 'population': 46.77}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'italy': {'capital': 'rome', 'population': 59.83}, 'norway': {'capital': 'oslo', 'population': 5.084}, 'france': {'capital': 'paris', 'population': 66.03}}


The DataFrame is one of Pandas’ most important data structures. It’s basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]import pandas as pd
my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc} # 创建字典
cars = pd.DataFrame(my_dict) # 建立一个DataFrame
print(cars)<> output:cars_per_cap        country  drives_right0           809  United States          True1           731      Australia         False2           588          Japan         False...


import pandas as pd
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict)row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels # 指定行标签
print(cars)<> output:cars_per_cap        country  drives_rightUS            809  United States          TrueAUS           731      Australia         FalseJPN           588          Japan         False...

Putting data in a dictionary and then building a DataFrame works, but it’s not very efficient. What if you’re dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for “comma-separated values”.

import pandas as pd
cars = pd.read_csv('cars.csv')
print(cars)<> output:Unnamed: 0  cars_per_cap        country  drives_right0         US           809  United States          True1        AUS           731      Australia         False2        JPN           588          Japan         False...

Your read_csv() call to import the CSV data didn’t generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name.

import pandas as pd
cars = pd.read_csv('cars.csv', index_col=0)
print(cars)<> output:cars_per_cap        country  drives_rightUS            809  United States          TrueAUS           731      Australia         FalseJPN           588          Japan         False...

Square Brackets


import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
print(cars['country']) # 打印Pandas Series<> output:US     United StatesAUS        AustraliaJPN            Japan...Name: country, dtype: object
print(cars[['country']]) # 打印Pandas DataFrame<> output:countryUS   United StatesAUS      AustraliaJPN          Japan...
print(cars[['country', 'drives_right']])<> output:country  drives_rightUS   United States          TrueAUS      Australia         FalseJPN          Japan         False


import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
print(cars[3:6])<> output:cars_per_cap        country  drives_rightUS            809  United States          TrueAUS           731      Australia         FalseJPN           588          Japan         Falsecars_per_cap  country  drives_rightIN             18    India         FalseRU            200   Russia          TrueMOR            70  Morocco          True

loc and iloc

With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
print(cars.loc[['JPN']]) # 打印Japan<> output:cars_per_cap country  drives_rightJPN           588   Japan         False
print(cars.iloc[2]) <> output:cars_per_cap      588country         Japandrives_right    FalseName: JPN, dtype: object
print(cars.loc[['AUS', 'EG']])<> output:cars_per_cap    country  drives_rightAUS           731  Australia         FalseEG             45      Egypt          True

loc and iloc also allow you to select both rows and columns from a DataFrame.

import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
print(cars.loc['MOR', 'drives_right']) # 打印出摩洛哥的drives_right值
print(cars.loc[['RU', 'MOR'], ['country', 'drives_right']])<> output:Truecountry  drives_rightRU    Russia          TrueMOR  Morocco          True

It’s also possible to select only columns with loc and iloc. In both cases, you simply put a slice going from beginning to end in front of the comma.

import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
print(cars.loc[:, 'drives_right'])<> output:US      TrueAUS    FalseJPN    False...Name: drives_right, dtype: bool
print(cars.loc[:, ['drives_right']])<> output:drives_rightUS           TrueAUS         FalseJPN         False
print(cars.loc[:, ['cars_per_cap', 'drives_right']])<> output:cars_per_cap  drives_rightUS            809          TrueAUS           731         FalseJPN           588         False


Boolean operators with Numpy

import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])print(np.logical_or(my_house > 18.5, my_house < 10))
print(np.logical_and(my_house < 11, your_house < 11))<> output:[False  True False  True][False False False  True]

Control Flow


room = "kit"
area = 14.0if room == "kit" :print("looking around in the kitchen.")
if area > 15:print("big place!")<> output:looking around in the kitchen.


room = "kit"
area = 14.0if room == "kit" :print("looking around in the kitchen.")
else :print("looking around elsewhere.")if area > 15 :print("big place!")
else:print("pretty small.")<> output:looking around in the kitchen.pretty small.


room = "bed"
area = 14.0if room == "kit" :print("looking around in the kitchen.")
elif room == "bed":print("looking around in the bedroom.")
else :print("looking around elsewhere.")if area > 15 :print("big place!")
elif area > 10:print("medium size, nice!")
else :print("pretty small.")<> output:looking around in the bedroom.medium size, nice!


筛选出符合drives_right is True的行:

import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
dr = cars['drives_right'] # 将drives_right提取为Series
sel = cars.loc[dr]
print(sel)<> output:cars_per_cap        country  drives_rightUS            809  United States          TrueRU            200         Russia          TrueMOR            70        Morocco          TrueEG             45          Egypt          True


import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
sel = cars[cars['drives_right']]

This time you want to find out which countries have a high cars per capita figure.

import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)cpc = cars['cars_per_cap']
many_cars = cpc > 500
car_maniac = cars[many_cars]
print(car_maniac)<> output:cars_per_cap        country  drives_rightUS            809  United States          TrueAUS           731      Australia         FalseJPN           588          Japan         False


import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
import numpy as npcpc = cars['cars_per_cap']
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]
print(medium)<> output:cars_per_cap country  drives_rightRU           200  Russia          True



offset = 3
while offset != 0:print("correcting...")offset = offset - 1print(offset)<> output:correcting...2correcting...1correcting...0


offset = -3
while offset != 0 :print("correcting...")if offset > 0 :offset = offset - 1else : offset = offset + 1  print(offset)<> output:correcting...-2correcting...-1correcting...0


areas = [11.25, 18.0, 20.0, 10.75, 9.50]
for element in areas:print(element)<> output:11.2518.020.010.759.5

Using a for loop to iterate over a list only gives you access to every list element in each run, one after the other. If you also want to access the index information, so where the list element you’re iterating over is located, you can use enumerate().

areas = [11.25, 18.0, 20.0, 10.75, 9.50]
for index, a in enumerate(areas) :print("room " + str(index) + ": " + str(a))<> output:room 0: 11.25room 1: 18.0room 2: 20.0room 3: 10.75room 4: 9.5

For non-programmer folks, room 0: 11.25 is strange. Wouldn’t it be better if the count started at 1?
房间0: 11.25很奇怪,改为房间1:

areas = [11.25, 18.0, 20.0, 10.75, 9.50]
for index, area in enumerate(areas) :print("room " + str(index+1) + ": " + str(area))<> output:room 1: 11.25room 2: 18.0room 3: 20.0room 4: 10.75room 5: 9.5


house = [["hallway", 11.25], ["kitchen", 18.0], ["living room", 20.0], ["bedroom", 10.75], ["bathroom", 9.50]]
for x, y in house:print("the " + str(x) + " is " + str(y) + " sqm")<> output:the hallway is 11.25 sqmthe kitchen is 18.0 sqmthe living room is 20.0 sqmthe bedroom is 10.75 sqmthe bathroom is 9.5 sqm


europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin','norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
for key, value in europe.items():print("the capital of " + key + " is " + str(value))<> output:the capital of austria is viennathe capital of norway is oslothe capital of spain is madrid...


import numpy as np
for x in np_height: # 遍历一维数组print(str(x) + " inches")<> output:
74 inches
74 inches
72 inches
import numpy as np
for x in np.nditer(np_baseball): # 遍历二维数组及以上print(x)<> output:


import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
for lab, row in cars.iterrows():print(lab)print(row)<> output:UScars_per_cap              809country         United Statesdrives_right             TrueName: US, dtype: object...

The row data that’s generated by iterrows() on every run is a Pandas Series. This format is not very convenient to print out. Luckily, you can easily select variables from the Pandas Series using square brackets.

import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
for lab, row in cars.iterrows() :print(lab + ": " + str(row['cars_per_cap']))<> output:US: 809AUS: 731JPN: 588...


import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)for lab, row in cars.iterrows(): # 添加国家列的循环代码cars.loc[lab, "COUNTRY"] = row['country'].upper()
print(cars)<> output:cars_per_cap        country  drives_right        COUNTRYUS            809  United States          True  UNITED STATESAUS           731      Australia         False      AUSTRALIAJPN           588          Japan         False          JAPAN...

If you want to add a column to a DataFrame by calling a function on another column, the iterrows() method in combination with a for loop is not the preferred way to go. Instead, you’ll want to use apply().

import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)for lab, row in cars.iterrows() : # 使用.apply(str.upper)cars["COUNTRY"] = cars["country"].apply(str.upper)

Case: Hacker Statistics 黑客统计

Randomness has many uses in science, art, statistics, cryptography, gaming, gambling, and other fields. You’re going to use randomness to simulate a game.

All the functionality you need is contained in the random package, a sub-package of numpy. In this exercise, you’ll be using two functions from this package:

  • seed(): sets the random seed, so that your results are reproducible between simulations. As an argument, it takes an integer of your choosing. If you call the function, no output will be generated. 设置随机种子,使您的结果是重复之间的模拟。作为一个参数,它取你选择的整数。如果调用该函数,则不会生成任何输出。
  • rand(): if you don’t specify any arguments, it generates a random float between zero and one. 如果不指定任何参数,它将生成0到1之间的随机浮点数。
import numpy as np
np.random.seed(123) # Set the seed
print(np.random.rand())<> output:0.6964691855978616


import numpy as np
print(np.random.randint(1, 7)) # 使用randint()来模拟骰子
print(np.random.randint(1, 7))<> output:63


import numpy as np
step = 50
dice = np.random.randint(1, 7)if dice <= 2 :step = step - 1
elif dice <= 5 :step = step + 1
else :step = step + np.random.randint(1,7)print(dice)
print(step)<> output:653

Before, you have already written Python code that determines the next step based on the previous step. Now it’s time to put this code inside a for loop so that we can simulate a random walk.

import numpy as np
random_walk = [0]for x in range(100) :step = random_walk[-1] # random_walk中的最后一个元素dice = np.random.randint(1,7)if dice <= 2:step = step - 1elif dice <= 5:step = step + 1else:step = step + np.random.randint(1,7)random_walk.append(step) # 将next_step追加到random_walk
print(random_walk)<> output:[0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, -1, ..., 57, 58, 59]

Things are shaping up nicely! You already have code that calculates your location in the Empire State Building after 100 dice throws. However, there’s something we haven’t thought about - you can’t go below 0!

import numpy as np
random_walk = [0]for x in range(100) :step = random_walk[-1]dice = np.random.randint(1,7)if dice <= 2:step = max(0, step - 1) # 使用max确保step不低于0elif dice <= 5:step = step + 1else:step = step + np.random.randint(1,7)random_walk.append(step)
print(random_walk)<> output:[0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, 0, ..., 58, 59, 60]


import numpy as np
random_walk = [0]for x in range(100) :step = random_walk[-1]dice = np.random.randint(1,7)if dice <= 2:step = max(0, step - 1)elif dice <= 5:step = step + 1else:step = step + np.random.randint(1,7)random_walk.append(step)import matplotlib.pyplot as plt

A single random walk is one thing, but that doesn’t tell you if you have a good chance at winning the bet. To get an idea about how big your chances are of reaching 60 steps, you can repeatedly simulate the random walk and collect the results. That’s exactly what you’ll do in this exercise.

import numpy as np
all_walks = []for i in range(10) : # 模拟随机行走10次random_walk = [0]for x in range(100) :step = random_walk[-1]dice = np.random.randint(1,7)if dice <= 2:step = max(0, step - 1)elif dice <= 5:step = step + 1else:step = step + np.random.randint(1,7)random_walk.append(step)all_walks.append(random_walk)import matplotlib.pyplot as plt
np_aw = np.array(all_walks)
np_aw_t = np.transpose(np_aw) # 转置np_aw

You’re a bit clumsy and you have a 0.1% chance of falling down. That calls for another random number generation. Basically, you can generate a random float between 0 and 1. If this value is less than or equal to 0.001, you should reset step to 0.

import numpy as np
all_walks = []for i in range(250) :random_walk = [0]for x in range(100) :step = random_walk[-1]dice = np.random.randint(1,7)if dice <= 2:step = max(0, step - 1)elif dice <= 5:step = step + 1else:step = step + np.random.randint(1,7)if np.random.rand() <= 0.001 : # Implement clumsinessstep = 0random_walk.append(step)all_walks.append(random_walk)import matplotlib.pyplot as plt
np_aw_t = np.transpose(np.array(all_walks))

All these fancy visualizations have put us on a sidetrack. We still have to solve the million-dollar problem: What are the odds that you’ll reach 60 steps high on the Empire State Building?

Basically, you want to know about the end points of all the random walks you’ve simulated. These end points have a certain distribution that you can visualize with a histogram.

import numpy as np
all_walks = []
for i in range(500) :random_walk = [0]for x in range(100) :step = random_walk[-1]dice = np.random.randint(1,7)if dice <= 2:step = max(0, step - 1)elif dice <= 5:step = step + 1else:step = step + np.random.randint(1,7)if np.random.rand() <= 0.001 :step = 0random_walk.append(step)all_walks.append(random_walk)import matplotlib.pyplot as plt
np_aw_t = np.transpose(np.array(all_walks))
ends = np_aw_t[-1,:] # 选取np_aw_t最后一个点

np.mean(ends >= 60)<> output:0.784

