ppt的代码加自己整理的笔记,并没有完全搞明白,后面应该会有更新,欢迎交流学习!

Introduction to computational social science

Time Series Analysis

SARIMAX

The Algorithm (Step by Step)

❑Import data

❑Pre-processing

❑Divide the dataset into train and test parts

❑Finding best parameter (auto ARIMA) based on the training part

❑Make the model (SARIMAX) based on the achieved best parameters and analyze its summary (evaluate model)

❑Make the prediction by the achieved model over the test part and calculate RMSE (to find the accuracy)

❑If the accuracy was acceptable, proceed with the model to predict for unseen data (beyond the dataset)

Import Data

import pandas as pd
df = pd.read_csv('Champagne.csv')
df
Month Sales
0 1964-01 2815
1 1964-02 2672
2 1964-03 2755
3 1964-04 2721
4 1964-05 2946
... ... ...
100 1972-05 4618
101 1972-06 5312
102 1972-07 4298
103 1972-08 1413
104 1972-09 5877

105 rows × 2 columns

Data Preprocessing

  1. Names should be meaningful
  2. NaN need to be dropped or imputed
  3. To apply time series analysis, the index should be timestamp format
df.columns=["Month","Sales"]
df.head()
Month Sales
0 1964-01 2815
1 1964-02 2672
2 1964-03 2755
3 1964-04 2721
4 1964-05 2946
df.dropna()
Month Sales
0 1964-01 2815
1 1964-02 2672
2 1964-03 2755
3 1964-04 2721
4 1964-05 2946
... ... ...
100 1972-05 4618
101 1972-06 5312
102 1972-07 4298
103 1972-08 1413
104 1972-09 5877

105 rows × 2 columns

df['Month']=pd.to_datetime(df['Month'])
df.head()
Month Sales
0 1964-01-01 2815
1 1964-02-01 2672
2 1964-03-01 2755
3 1964-04-01 2721
4 1964-05-01 2946
df.set_index('Month',inplace=True)
df.head()
Sales
Month
1964-01-01 2815
1964-02-01 2672
1964-03-01 2755
1964-04-01 2721
1964-05-01 2946

Divide the dataset into train and test parts

  1. train–> find best parameters
  2. test–> predict on it and calculate the acuracy. If the acuracy is acceptable, use the model to predict unseen data(这时用所有已知的 data(train+test) to train model,然后预测)
df.shape
(105, 1)

80% - 85% should be assigned to train

这里最后13个给了test,其它train

train = df.iloc[:-13]
test = df.iloc[-13:]
print("Train shape:", train.shape)
print("Test shape:", test.shape)
print(train)
test
Train shape: (92, 1)
Test shape: (13, 1)Sales
Month
1964-01-01   2815
1964-02-01   2672
1964-03-01   2755
1964-04-01   2721
1964-05-01   2946
...           ...
1971-04-01   4676
1971-05-01   5010
1971-06-01   4874
1971-07-01   4633
1971-08-01   1659[92 rows x 1 columns]
Sales
Month
1971-09-01 5951
1971-10-01 6981
1971-11-01 9851
1971-12-01 12670
1972-01-01 4348
1972-02-01 3564
1972-03-01 4577
1972-04-01 4788
1972-05-01 4618
1972-06-01 5312
1972-07-01 4298
1972-08-01 1413
1972-09-01 5877

Finding best parameter

auto_arima 自动 detect best parameter

seasonal=True–>SARIMA

suppress_warnings=True–>因为warnings很烦

stepwise=True–>step by step

from pmdarima.arima import auto_arima
stepwise_fit=auto_arima(train['Sales'],m=12,seasonal=True,d=None,D=1,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)
Performing stepwise search to minimize aicARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=1305.033, Time=0.49 secARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=1300.270, Time=0.02 secARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=1298.319, Time=0.14 secARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=1299.128, Time=0.23 secARIMA(0,0,0)(0,1,0)[12]             : AIC=1309.721, Time=0.02 secARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=1299.950, Time=0.02 secARIMA(1,0,0)(2,1,0)[12] intercept   : AIC=1298.706, Time=0.12 secARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=1299.277, Time=0.36 secARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=1299.289, Time=0.05 secARIMA(1,0,0)(2,1,1)[12] intercept   : AIC=1300.076, Time=0.46 secARIMA(0,0,0)(1,1,0)[12] intercept   : AIC=1300.359, Time=0.11 secARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=1300.344, Time=0.16 secARIMA(1,0,1)(1,1,0)[12] intercept   : AIC=1300.399, Time=0.17 secARIMA(0,0,1)(1,1,0)[12] intercept   : AIC=1298.227, Time=0.19 secARIMA(0,0,1)(0,1,0)[12] intercept   : AIC=1300.004, Time=0.02 secARIMA(0,0,1)(2,1,0)[12] intercept   : AIC=1298.163, Time=0.24 secARIMA(0,0,1)(2,1,1)[12] intercept   : AIC=1299.854, Time=0.37 secARIMA(0,0,1)(1,1,1)[12] intercept   : AIC=1299.239, Time=0.34 secARIMA(0,0,0)(2,1,0)[12] intercept   : AIC=1300.543, Time=0.09 secARIMA(1,0,1)(2,1,0)[12] intercept   : AIC=1300.219, Time=0.40 secARIMA(0,0,2)(2,1,0)[12] intercept   : AIC=1300.184, Time=0.32 secARIMA(1,0,2)(2,1,0)[12] intercept   : AIC=1302.144, Time=0.50 secARIMA(0,0,1)(2,1,0)[12]             : AIC=1301.353, Time=0.19 secBest model:  ARIMA(0,0,1)(2,1,0)[12] intercept
Total fit time: 5.012 seconds
  • 没有BIC,AIC is enough, AIC 越小越好
  • review:SARIMA(p,d,q)(P,D,Q)s
    • p: AR 的 # of lags, PACF
    • d: # of differencing
    • q: MA 的 # of lags, ACF
    • P: seasonality for AR
    • D: seasonality for differencing
    • Q: seasonality for MA
    • s: # Of time steps for a single seasonal period

Evaluate model: analyze its summary

衡量刚刚得到的 best model–> 输入刚刚的参数, 用train parts fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train['Sales'], order=(0,0,1), seasonal_order=(2,1,0,12),enforce_stationary=False,enforce_invertibility=False)
results = model.fit()
results.summary()
SARIMAX Results
Dep. Variable: Sales No. Observations: 92
Model: SARIMAX(0, 0, 1)x(2, 1, [], 12) Log Likelihood -646.676
Date: Fri, 28 Jan 2022 AIC 1301.353
Time: 18:37:07 BIC 1310.881
Sample: 01-01-1964 HQIC 1305.173
- 08-01-1971
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ma.L1 0.2411 0.096 2.500 0.012 0.052 0.430
ar.S.L12 -0.0456 0.072 -0.638 0.523 -0.186 0.095
ar.S.L24 0.2839 0.105 2.711 0.007 0.079 0.489
sigma2 5.868e+05 7.71e+04 7.607 0.000 4.36e+05 7.38e+05
Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 4.88
Prob(Q): 0.87 Prob(JB): 0.09
Heteroskedasticity (H): 1.86 Skew: -0.02
Prob(H) (two-sided): 0.11 Kurtosis: 4.21

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

  1. No. Observations: 92 --> train data
  2. ma.L1–> (0,0,1) lags for MA is 1

    ar.S.L12 ar.S.L24–> (2,1,0) lags of seasonality for AR is 2

    • and because seasonaily = 12. Lags jump back twice
  3. P>|z| if p-value<0.05, the coefficient to achieve is significant ->proceed with that
  4. 最下面那一坨是其它的衡量指标,可以用于比较SARIMA和其它类型的model

Find the accuracy: Prediction on Test Data

start=len(train)
end = len(train)+len(test)-1
df['forecast'] = results.predict(start=start,end=end,dynamic=True)
df['forecast']
Month
1964-01-01            NaN
1964-02-01            NaN
1964-03-01            NaN
1964-04-01            NaN
1964-05-01            NaN...
1972-05-01    4335.597877
1972-06-01    4637.333846
1972-07-01    4811.011815
1972-08-01    1639.045080
1972-09-01    5216.793461
Name: forecast, Length: 105, dtype: float64
import matplotlib.pyplot as plt
df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()


test['Sales'].mean()
5711.384615384615
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df['forecast'].dropna(),test['Sales']))
rmse
643.8346721715318

RSME << Mean Test Data -> Model is good

Prediction (on Unseen Data)

用所有已知的数据训练模型来预测 unseen

model2 = SARIMAX(df['Sales'], order = (0,0,1), seasonal_order=(2,1,0,12),enforce_stationarity=False,enforce_invertibility=False)
results2 = model2.fit()
results2.summary()
SARIMAX Results
Dep. Variable: Sales No. Observations: 105
Model: SARIMAX(0, 0, 1)x(2, 1, [], 12) Log Likelihood -559.547
Date: Fri, 28 Jan 2022 AIC 1127.095
Time: 18:37:07 BIC 1136.031
Sample: 01-01-1964 HQIC 1130.640
- 09-01-1972
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ma.L1 0.3183 0.105 3.036 0.002 0.113 0.524
ar.S.L12 -0.2382 0.114 -2.086 0.037 -0.462 -0.014
ar.S.L24 0.1056 0.092 1.148 0.251 -0.075 0.286
sigma2 6.382e+05 9.14e+04 6.979 0.000 4.59e+05 8.17e+05
Ljung-Box (L1) (Q): 0.25 Jarque-Bera (JB): 3.51
Prob(Q): 0.62 Prob(JB): 0.17
Heteroskedasticity (H): 0.43 Skew: -0.05
Prob(H) (two-sided): 0.05 Kurtosis: 4.10

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

ar.S.L24 的 p-value >0.05, but it’s inside the model, we can proceed with the model ?????

df.tail()
Sales forecast
Month
1972-05-01 4618 4335.597877
1972-06-01 5312 4637.333846
1972-07-01 4298 4811.011815
1972-08-01 1413 1639.045080
1972-09-01 5877 5216.793461
from pandas.tseries.offsets import DateOffset
future_dates = [df.index[-1]+ DateOffset(months=x) for x in range(0,24)]
future_datest_df = pd.DataFrame(index=future_dates[1:],columns=df.columns)
future_datest_df
Sales forecast
1972-10-01 NaN NaN
1972-11-01 NaN NaN
1972-12-01 NaN NaN
1973-01-01 NaN NaN
1973-02-01 NaN NaN
1973-03-01 NaN NaN
1973-04-01 NaN NaN
1973-05-01 NaN NaN
1973-06-01 NaN NaN
1973-07-01 NaN NaN
1973-08-01 NaN NaN
1973-09-01 NaN NaN
1973-10-01 NaN NaN
1973-11-01 NaN NaN
1973-12-01 NaN NaN
1974-01-01 NaN NaN
1974-02-01 NaN NaN
1974-03-01 NaN NaN
1974-04-01 NaN NaN
1974-05-01 NaN NaN
1974-06-01 NaN NaN
1974-07-01 NaN NaN
1974-08-01 NaN NaN
future_df = pd.concat([df,future_datest_df]) #合起来,好plot
future_df
Sales forecast
1964-01-01 2815 NaN
1964-02-01 2672 NaN
1964-03-01 2755 NaN
1964-04-01 2721 NaN
1964-05-01 2946 NaN
... ... ...
1974-04-01 NaN NaN
1974-05-01 NaN NaN
1974-06-01 NaN NaN
1974-07-01 NaN NaN
1974-08-01 NaN NaN

128 rows × 2 columns

future_df['forecast']=results2.predict(start=104,end=128,dynamic=True)
future_df[['Sales','forecast']].plot(figsize=(12,8))
plt.show()

【SARIMAX】Champagne Sales相关推荐

  1. 【数据挖掘】数据挖掘总结 ( 拉普拉斯修正 | 贝叶斯分类器示例2 ) ★

    文章目录 一. 贝叶斯分类器分类的流程 二. 拉普拉斯修正 三. 贝叶斯分类器示例2 参考博客 : [数据挖掘]贝叶斯分类 ( 贝叶斯分类器 | 贝叶斯推断 | 逆向概率 | 贝叶斯公式 | 贝叶斯公 ...

  2. 【实验】利用系统自带脚本utlsampl.sql创建scott用户及样本数据

    很多的演示程序都是以scott用户及其用户下的表做例子的,于是,快速的创建这个用户和初始化表中的数据是必须的.在Oracle 10g环境中这个过程很简便,只需要以sys用户执行一下$ORACLE_HO ...

  3. 【机器学习】机器学习模型解释神器:Shapash

    什么是 Shapash 模型可解释性和可理解性一直是许多研究论文和开源项目的关注的重点.并且很多项目中都配备了数据专家和训练有素的专业人员.Shapash 是一个 Python 库,用于描述 AI 模 ...

  4. 【Python】利用 Python 分析了一波月饼,我得出的结论是?

    作者:Cherich_sun 来源: https://blog.csdn.net/w_yuqing/article/details/119888532 本文为读者投稿 马上八月十五了,又迎来了一年一度 ...

  5. 【SpringBoot】SpringBoot 操作 Excel 完整示例(含源码GitHub)

    参考博客:博客园 - Spring Boot 操作 Excel 示例GitHub:spring-boot-study/spring-boot-study-excel/ 1.新建 Spring Boot ...

  6. 【sql】leetcode习题 (共 42 题)

    [175]Combine Two Tables (2018年11月23日,开始集中review基础) Table: Person +-------------+---------+ | Column ...

  7. 【转载】DNN6开源CMS

    DotNetNuke是?DotNetNuke (DNN) 资源下载DNN网站展示DNN视频教程 DotNetNuke, DNN技术及应用 DotNetNuke/DNN安装,汉化,教程,资源 http: ...

  8. 【英语学习】【WOTD】leviathan 释义/词源/示例

    leviathan n. [luh-VYE-uh-thun] 大海怪,庞然大物 Definition 1 a often capitalized Leviathan : a sea monster d ...

  9. 【英语学习】【WOTD】charisma 释义/词源/示例

    charisma n. [kuh-RIZ-muh] Definition 1: a personal magic of leadership arousing special popular loya ...

最新文章

  1. 根据前序、中序序列重建一棵二叉树的代码实现
  2. 报告 | 超级智能城市2.0 – 人工智能引领新风尚(附下载)
  3. Git学习系列(二)创建本地仓库及文件操作
  4. Linux静态IP设置
  5. springboot 使用i18n进行国际化
  6. vb.net 2019-机器学习ml.net情绪分析(3)
  7. 计算机控制z反变换公式,第三章 计算机控制系统的数学描述(修正Z变换).ppt
  8. 昭阳E47A每天第一次开机启动速度慢的原因
  9. Atlassian发布Bamboo 6.0和Bitbucket Server 5.0
  10. [ZOJ3213] Beautiful Meadow
  11. PE使用万能驱动7解决USB3、NVME驱动问题及台式机、笔记本电脑使用PE因驱动问题不能加载硬盘问题(YOGA 14C 因PE加载不了硬盘亲测可用)
  12. 【精品收藏】世界上最有智慧的人是怎样理性思考的?查理·芒格的100个思维模型...
  13. JavaScript模板引擎
  14. SEO数据变化,检测网站死链接、蜘蛛访问、whois
  15. 稳压电源: 电路图及类型
  16. Java——(九)IO流
  17. 图片按指定比例缩放并压缩至指定大小,解决保存图片文件体积过大bug。
  18. CAD创建块后图形依然保持原状?AUTOCAD——特殊字符如何输入
  19. 怎样用Java求水仙花数和水仙花数的数量
  20. ALC5621声卡调试记录

热门文章

  1. PHP使用 Redis批量删除过期数据
  2. 解决:Mac安装 VSCode 程序坞 没有应用图标,问题原因
  3. 视频教程-ThinkPHP5实现QQ快捷登录初级入门-PHP
  4. 无界鼠标~synergy~Ubuntu
  5. 我见过的最大的Flash游戏,最有意思的Puzzle类游戏
  6. 阿里云下配置keepalive
  7. 前端canvas能压缩图片?
  8. 如何选择物联网智能井盖网关?
  9. oracle数据库连接拒绝,Oracle 数据库连接失败问题
  10. 万字渗透测试入门知识点,自学查漏补缺