ppt的代码加自己整理的笔记，并没有完全搞明白，后面应该会有更新，欢迎交流学习！

Introduction to computational social science

Time Series Analysis

SARIMAX

The Algorithm (Step by Step)

❑Import data

❑Pre-processing

❑Divide the dataset into train and test parts

❑Finding best parameter (auto ARIMA) based on the training part

❑Make the model (SARIMAX) based on the achieved best parameters and analyze its summary (evaluate model)

❑Make the prediction by the achieved model over the test part and calculate RMSE (to find the accuracy)

❑If the accuracy was acceptable, proceed with the model to predict for unseen data (beyond the dataset)

Import Data

import pandas as pd
df = pd.read_csv('Champagne.csv')
df

	Month	Sales
0	1964-01	2815
1	1964-02	2672
2	1964-03	2755
3	1964-04	2721
4	1964-05	2946
...	...	...
100	1972-05	4618
101	1972-06	5312
102	1972-07	4298
103	1972-08	1413
104	1972-09	5877

105 rows × 2 columns

Data Preprocessing

Names should be meaningful
NaN need to be dropped or imputed
To apply time series analysis, the index should be timestamp format

df.columns=["Month","Sales"]
df.head()

	Month	Sales
0	1964-01	2815
1	1964-02	2672
2	1964-03	2755
3	1964-04	2721
4	1964-05	2946

df.dropna()

	Month	Sales
0	1964-01	2815
1	1964-02	2672
2	1964-03	2755
3	1964-04	2721
4	1964-05	2946
...	...	...
100	1972-05	4618
101	1972-06	5312
102	1972-07	4298
103	1972-08	1413
104	1972-09	5877

105 rows × 2 columns

df['Month']=pd.to_datetime(df['Month'])
df.head()

	Month	Sales
0	1964-01-01	2815
1	1964-02-01	2672
2	1964-03-01	2755
3	1964-04-01	2721
4	1964-05-01	2946

df.set_index('Month',inplace=True)
df.head()

	Sales
Month
1964-01-01	2815
1964-02-01	2672
1964-03-01	2755
1964-04-01	2721
1964-05-01	2946

Divide the dataset into train and test parts

train–> find best parameters
test–> predict on it and calculate the acuracy. If the acuracy is acceptable, use the model to predict unseen data(这时用所有已知的 data(train+test) to train model，然后预测)

df.shape

(105, 1)

80% - 85% should be assigned to train

这里最后13个给了test，其它train

train = df.iloc[:-13]
test = df.iloc[-13:]
print("Train shape:", train.shape)
print("Test shape:", test.shape)
print(train)
test

Train shape: (92, 1)
Test shape: (13, 1)Sales
Month
1964-01-01   2815
1964-02-01   2672
1964-03-01   2755
1964-04-01   2721
1964-05-01   2946
...           ...
1971-04-01   4676
1971-05-01   5010
1971-06-01   4874
1971-07-01   4633
1971-08-01   1659[92 rows x 1 columns]

	Sales
Month
1971-09-01	5951
1971-10-01	6981
1971-11-01	9851
1971-12-01	12670
1972-01-01	4348
1972-02-01	3564
1972-03-01	4577
1972-04-01	4788
1972-05-01	4618
1972-06-01	5312
1972-07-01	4298
1972-08-01	1413
1972-09-01	5877

Finding best parameter

auto_arima 自动 detect best parameter

seasonal=True–>SARIMA

suppress_warnings=True–>因为warnings很烦

stepwise=True–>step by step

from pmdarima.arima import auto_arima
stepwise_fit=auto_arima(train['Sales'],m=12,seasonal=True,d=None,D=1,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)

Performing stepwise search to minimize aicARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=1305.033, Time=0.49 secARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=1300.270, Time=0.02 secARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=1298.319, Time=0.14 secARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=1299.128, Time=0.23 secARIMA(0,0,0)(0,1,0)[12]             : AIC=1309.721, Time=0.02 secARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=1299.950, Time=0.02 secARIMA(1,0,0)(2,1,0)[12] intercept   : AIC=1298.706, Time=0.12 secARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=1299.277, Time=0.36 secARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=1299.289, Time=0.05 secARIMA(1,0,0)(2,1,1)[12] intercept   : AIC=1300.076, Time=0.46 secARIMA(0,0,0)(1,1,0)[12] intercept   : AIC=1300.359, Time=0.11 secARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=1300.344, Time=0.16 secARIMA(1,0,1)(1,1,0)[12] intercept   : AIC=1300.399, Time=0.17 secARIMA(0,0,1)(1,1,0)[12] intercept   : AIC=1298.227, Time=0.19 secARIMA(0,0,1)(0,1,0)[12] intercept   : AIC=1300.004, Time=0.02 secARIMA(0,0,1)(2,1,0)[12] intercept   : AIC=1298.163, Time=0.24 secARIMA(0,0,1)(2,1,1)[12] intercept   : AIC=1299.854, Time=0.37 secARIMA(0,0,1)(1,1,1)[12] intercept   : AIC=1299.239, Time=0.34 secARIMA(0,0,0)(2,1,0)[12] intercept   : AIC=1300.543, Time=0.09 secARIMA(1,0,1)(2,1,0)[12] intercept   : AIC=1300.219, Time=0.40 secARIMA(0,0,2)(2,1,0)[12] intercept   : AIC=1300.184, Time=0.32 secARIMA(1,0,2)(2,1,0)[12] intercept   : AIC=1302.144, Time=0.50 secARIMA(0,0,1)(2,1,0)[12]             : AIC=1301.353, Time=0.19 secBest model:  ARIMA(0,0,1)(2,1,0)[12] intercept
Total fit time: 5.012 seconds

没有BIC，AIC is enough, AIC 越小越好
review：SARIMA(p,d,q)(P,D,Q)s
- p: AR 的 # of lags, PACF
- d: # of differencing
- q: MA 的 # of lags, ACF
- P: seasonality for AR
- D: seasonality for differencing
- Q: seasonality for MA
- s: # Of time steps for a single seasonal period

Evaluate model: analyze its summary

衡量刚刚得到的 best model–> 输入刚刚的参数, 用train parts fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(train['Sales'], order=(0,0,1), seasonal_order=(2,1,0,12),enforce_stationary=False,enforce_invertibility=False)
results = model.fit()
results.summary()

SARIMAX Results
Dep. Variable:	Sales	No. Observations:	92
Model:	SARIMAX(0, 0, 1)x(2, 1, [], 12)	Log Likelihood	-646.676
Date:	Fri, 28 Jan 2022	AIC	1301.353
Time:	18:37:07	BIC	1310.881
Sample:	01-01-1964	HQIC	1305.173
	- 08-01-1971
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ma.L1	0.2411	0.096	2.500	0.012	0.052	0.430
ar.S.L12	-0.0456	0.072	-0.638	0.523	-0.186	0.095
ar.S.L24	0.2839	0.105	2.711	0.007	0.079	0.489
sigma2	5.868e+05	7.71e+04	7.607	0.000	4.36e+05	7.38e+05

Ljung-Box (L1) (Q):	0.03	Jarque-Bera (JB):	4.88
Prob(Q):	0.87	Prob(JB):	0.09
Heteroskedasticity (H):	1.86	Skew:	-0.02
Prob(H) (two-sided):	0.11	Kurtosis:	4.21

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

No. Observations: 92 --> train data
ma.L1–> (0,0,1) lags for MA is 1
ar.S.L12 ar.S.L24–> (2,1,0) lags of seasonality for AR is 2
- and because seasonaily = 12. Lags jump back twice
P>|z| if p-value<0.05, the coefficient to achieve is significant ->proceed with that
最下面那一坨是其它的衡量指标，可以用于比较SARIMA和其它类型的model

Find the accuracy: Prediction on Test Data

start=len(train)
end = len(train)+len(test)-1
df['forecast'] = results.predict(start=start,end=end,dynamic=True)
df['forecast']

Month
1964-01-01            NaN
1964-02-01            NaN
1964-03-01            NaN
1964-04-01            NaN
1964-05-01            NaN...
1972-05-01    4335.597877
1972-06-01    4637.333846
1972-07-01    4811.011815
1972-08-01    1639.045080
1972-09-01    5216.793461
Name: forecast, Length: 105, dtype: float64

import matplotlib.pyplot as plt
df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()

test['Sales'].mean()

5711.384615384615

from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df['forecast'].dropna(),test['Sales']))
rmse

643.8346721715318

RSME << Mean Test Data -> Model is good

Prediction (on Unseen Data)

用所有已知的数据训练模型来预测 unseen

model2 = SARIMAX(df['Sales'], order = (0,0,1), seasonal_order=(2,1,0,12),enforce_stationarity=False,enforce_invertibility=False)
results2 = model2.fit()
results2.summary()

SARIMAX Results
Dep. Variable:	Sales	No. Observations:	105
Model:	SARIMAX(0, 0, 1)x(2, 1, [], 12)	Log Likelihood	-559.547
Date:	Fri, 28 Jan 2022	AIC	1127.095
Time:	18:37:07	BIC	1136.031
Sample:	01-01-1964	HQIC	1130.640
	- 09-01-1972
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ma.L1	0.3183	0.105	3.036	0.002	0.113	0.524
ar.S.L12	-0.2382	0.114	-2.086	0.037	-0.462	-0.014
ar.S.L24	0.1056	0.092	1.148	0.251	-0.075	0.286
sigma2	6.382e+05	9.14e+04	6.979	0.000	4.59e+05	8.17e+05

Ljung-Box (L1) (Q):	0.25	Jarque-Bera (JB):	3.51
Prob(Q):	0.62	Prob(JB):	0.17
Heteroskedasticity (H):	0.43	Skew:	-0.05
Prob(H) (two-sided):	0.05	Kurtosis:	4.10

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

ar.S.L24 的 p-value >0.05, but it’s inside the model, we can proceed with the model ？？？？？

df.tail()

	Sales	forecast
Month
1972-05-01	4618	4335.597877
1972-06-01	5312	4637.333846
1972-07-01	4298	4811.011815
1972-08-01	1413	1639.045080
1972-09-01	5877	5216.793461

from pandas.tseries.offsets import DateOffset
future_dates = [df.index[-1]+ DateOffset(months=x) for x in range(0,24)]
future_datest_df = pd.DataFrame(index=future_dates[1:],columns=df.columns)
future_datest_df

	Sales	forecast
1972-10-01	NaN	NaN
1972-11-01	NaN	NaN
1972-12-01	NaN	NaN
1973-01-01	NaN	NaN
1973-02-01	NaN	NaN
1973-03-01	NaN	NaN
1973-04-01	NaN	NaN
1973-05-01	NaN	NaN
1973-06-01	NaN	NaN
1973-07-01	NaN	NaN
1973-08-01	NaN	NaN
1973-09-01	NaN	NaN
1973-10-01	NaN	NaN
1973-11-01	NaN	NaN
1973-12-01	NaN	NaN
1974-01-01	NaN	NaN
1974-02-01	NaN	NaN
1974-03-01	NaN	NaN
1974-04-01	NaN	NaN
1974-05-01	NaN	NaN
1974-06-01	NaN	NaN
1974-07-01	NaN	NaN
1974-08-01	NaN	NaN

future_df = pd.concat([df,future_datest_df]) #合起来，好plot
future_df

	Sales	forecast
1964-01-01	2815	NaN
1964-02-01	2672	NaN
1964-03-01	2755	NaN
1964-04-01	2721	NaN
1964-05-01	2946	NaN
...	...	...
1974-04-01	NaN	NaN
1974-05-01	NaN	NaN
1974-06-01	NaN	NaN
1974-07-01	NaN	NaN
1974-08-01	NaN	NaN

128 rows × 2 columns

future_df['forecast']=results2.predict(start=104,end=128,dynamic=True)
future_df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()

【SARIMAX】Champagne Sales相关推荐

【数据挖掘】数据挖掘总结 ( 拉普拉斯修正 | 贝叶斯分类器示例2 ) ★
文章目录一. 贝叶斯分类器分类的流程二. 拉普拉斯修正三. 贝叶斯分类器示例2 参考博客 : [数据挖掘]贝叶斯分类 ( 贝叶斯分类器 | 贝叶斯推断 | 逆向概率 | 贝叶斯公式 | 贝叶斯公 ...
【实验】利用系统自带脚本utlsampl.sql创建scott用户及样本数据
很多的演示程序都是以scott用户及其用户下的表做例子的,于是,快速的创建这个用户和初始化表中的数据是必须的.在Oracle 10g环境中这个过程很简便,只需要以sys用户执行一下$ORACLE_HO ...
【机器学习】机器学习模型解释神器：Shapash
什么是 Shapash 模型可解释性和可理解性一直是许多研究论文和开源项目的关注的重点.并且很多项目中都配备了数据专家和训练有素的专业人员.Shapash 是一个 Python 库,用于描述 AI 模 ...
【Python】利用 Python 分析了一波月饼，我得出的结论是？
作者:Cherich_sun 来源: https://blog.csdn.net/w_yuqing/article/details/119888532 本文为读者投稿马上八月十五了,又迎来了一年一度 ...
【SpringBoot】SpringBoot 操作 Excel 完整示例（含源码GitHub）
参考博客:博客园 - Spring Boot 操作 Excel 示例GitHub:spring-boot-study/spring-boot-study-excel/ 1.新建 Spring Boot ...
【sql】leetcode习题 (共 42 题)
[175]Combine Two Tables (2018年11月23日,开始集中review基础) Table: Person +-------------+---------+ | Column ...
【转载】DNN6开源CMS
DotNetNuke是?DotNetNuke (DNN) 资源下载DNN网站展示DNN视频教程 DotNetNuke, DNN技术及应用 DotNetNuke/DNN安装,汉化,教程,资源 http: ...
【英语学习】【WOTD】leviathan 释义/词源/示例
leviathan n. [luh-VYE-uh-thun] 大海怪,庞然大物 Definition 1 a often capitalized Leviathan : a sea monster d ...
【英语学习】【WOTD】charisma 释义/词源/示例
charisma n. [kuh-RIZ-muh] Definition 1: a personal magic of leadership arousing special popular loya ...

【SARIMAX】Champagne Sales