使用回归模型预测星巴克消费者的购物习惯

摘要

数据描述

数据分析代码实现

加载数据与预处理数据集

开始回答假设问题

数据建模

结束

摘要

此次将使用CRISP-DM流程分析星巴克数据集。主要通过使用数据可视化和数据建模探索星巴克消费者习惯来回答一些业务问题。所探索的问题是：

性别将如何影响某人在星巴克的消费？男人比女人花更多的钱，还是相反？
有多少人查看并完成报价？有多少人在不先打开报价的情况下完成报价？
哪些因素最有助于某人在星巴克的消费？

数据描述

此数据集包含模拟星巴克奖励APP程序中客户行为模拟数据。每隔几天，星巴克就会向使用移动应用APP的用户发出一个信息。这个信息可能仅仅是一个饮料广告或一个实际的销售推送，如折扣或（买一送一）的广告等。某些用户可能在某些周内没有收到信息。并非所有用户都能收到相同的信息，这是使用此数据集解决的难题。

我们的任务是将交易、人口统计和报价数据结合起来，以确定哪些对象组对哪种报价类型的响应最好。这个数据集是星巴克应用程序生成的简化版本数据，因为只有一个产品，而星巴克实际上销售几十个产品。

每份广告折扣信息在到期前都有一个有效期。例如，一个BOGO折扣信息可能只有5天有效。这将在数据集中体现。但并不是所有的折扣信息的有效期都是一样的；例如，如果信息有7天的有效期，可以假设客户在收到广告后7天内感受到报价的影响。

数据集内显示了用户在应用程序上购买的交易数据，包括购买的时间戳和购买花费的金额。此数据集还具有用户收到的每个折扣信息的记录以及用户实际查看折扣信息的记录。同时该数据集内还包含使用该应用APP的人，可能在没有收到折扣信息或看到相关信息的情况下通过该应用APP进行购买。

例子：

一个用户可能在周一得到一个折扣优惠信息，买10美元就可享受2折优惠。信息自收到之日起10天内有效。如果客户在有效期内累计购买至少10美元，则客户完成报价。

然而，在这个数据集中有一些事情需要注意。客户不会选择看他们所收到的信息；换句话说，用户可以收到信息，而不会实际查看信息，仍然可以完成购买。例如，用户可能会收到“购买10美元，获得2美元优惠”，但用户在10天有效期内从未打开优惠信息。客户在这十天里花了15美元。数据集中会有信息发送记录；但是，客户没有受到信息的影响，因为客户从未查看过信息。

这使得数据清理变得特别重要和棘手。

你还需要考虑到一些统计群体即使没有收到信息也会购买。从商业的角度来看，如果一个顾客要在没有信息的情况下购买10美元的商品，顾客压根不想得到2美元的折扣。而你要试着评估一个特定的人口群体在没有收到任何优惠信息时会购买什么。

因为这是开放性数据分析的项目，所以您可以自由地采取任何方式进行分析数据。例如，您可以构建一个机器学习模型，根据人口统计和提供类型预测某人将花费多少。或者你可以建立一个模型来预测是否有人会回应你的提议。或者，你根本不需要建立一个机器学习模型。你可以开发一套启发式方法来确定你应该向每个客户发送什么样的服务（即，在35岁的女性客户中，有75%的人响应提供a，而在相同人口中，有40%的人响应提供B，所以发送a）。

数据集包含在三个文件中：

portfolio.json-包含每个信息的信息id和元数据（持续时间、类型等）

profile.json-每个客户的统计情况

script.json-交易记录、收到的信息、查看的信息和完成的报价

portfolio.json

id (string) - offer id
offer_type (string) - type of offer ie BOGO, discount, informational（即BOGO、折扣、信息）
difficulty (int) - minimum required spend to complete an offer
reward (int) - reward given for completing an offer
duration (int) - time for offer to be open, in days
channels (list of strings)

profile.json

age (int) - age of the customer（客户年龄）
became_member_on (int) - date when customer created an app account
gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
id (str) - customer id
income (float) - customer's income

transcript.json

event (str) - record description (ie transaction, offer received, offer viewed, etc.)（记录描述）
person (str) - customer id
time (int) - time in hours since start of test. The data begins at time t=0
value - (dict of strings) - either an offer id or transaction amount depending on the record

数据分析代码实现

加载数据与预处理数据集

portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

通过打印每个数据集的前几行来查看数据样本：

通过对Portfolio和Profile数据集绘制相关图来可视化这些数据情况：

可以看到折扣优惠持续时间最长，Bogo优惠的回报最高。

可以从数据中看到一些有趣的趋势。例如，就年龄而言，大多数女性购买者往往比大多数男性购买者年轻。

为了探索transcript数据集，将研究人们响应的信息类型的分布。绘制与信息相关的图形以及信息发送的类型。

可以看到大多数信息都是通过电子邮件和手机发送的。

为了进一步探索数据，我们需要对其进行清理和预处理。预处理规则如下：

在第一个数据集中重新格式化字典以解析信息值
将通道值与存储在第二个数据集中的列表分开
格式化第三个数据集中的日期值，并删除缺少值的行
对于每个数据集，将基于文本的标签值转换为one-hot形式。

profile数据表格处理后的结果如下：（其他数据表也进行了类似的处理）

开始回答假设问题

使用处理后的transcript, profile数据画出性别与支出的金额的关系图和性别与交易数量的关系图；

plt.rcParams["figure.figsize"] = (12,4)fig, (ax1, ax2) = plt.subplots(ncols = 2)# plotting a Box Plot of amounts by gender
g = sns.boxplot(x="gender_F", y="amount",hue="gender_F", palette=["m", "g"],data=transaction_data_only, ax=ax1)# set a log Y scale
g.set_yscale('log')
g.legend().set_title('Gender')
for t, l in zip(g.legend().texts, ['Male', 'Female']): t.set_text(l)# plot title and axis
plt.title('Amount spent on transactions by Gender')
plt.xlabel('Gender')
plt.ylabel('Amount')
plt.xticks([[]])# plotting a Bar Plot of number of transactions by gender
h = sns.countplot(transaction_data_only['gender_F'], ax=ax2)# plot title and axis
plt.title('Number of Transactions by Gender')
plt.xlabel('Gender')
plt.ylabel('Transactions')plt.xticks([0, 1], ['Male', 'Female'])# Create a new DataFrame of stat details on required attributes
description_columns = ['amount', 'age', 'income']
description_labels = [cat + '_' + gen for cat in ['Male', 'Female'] for gen in description_columns]
reorder_columns = [gen + '_' + cat for cat in description_columns for gen in ['Male', 'Female']]# Separate into stats from Female and Male transactions
description = pd.concat([transaction_data_m[['amount', 'age', 'income']].describe(), transaction_data_f[['amount', 'age', 'income']].describe()], axis=1);
description.columns = description_labels;plt.show()

至此解决了第一个问题：性别对购买欲的影响。女性在星巴克的消费比男性多，每次购买的平均差额约为6美元。

使用处理后的transcript, profile数据画出已发送信息状态和信息与购买数量的关系图：

%%timecompletion_details = [];'''
Go through each group in the transaction grouping. Because iterating can be slow,
we will use vectorized operations inside the main loop.
'''
for i, g in transcript_by_group:record = {};# Record status based on weather the offer has been viewedif g[['event_offer_received', 'event_offer_viewed', 'event_offer_completed']].sum(axis=0).sum() == 3:record['status'] = 'viewed_and_completed';elif (g[['event_offer_received', 'event_offer_completed']].sum(axis=0).sum() == 2) and 'status' not in record:record['status'] = 'not_viewed_and_completed';else:record['status'] = 'no_offer';# Get required detailsperson_id, offer_id = g['person'].iloc[0], g['offer_id'].iloc[0];first_purchase, last_purchase = g['time'].min(), g['time'].max();# Get all transactions corresponding to the persontry:person_transactions = transactions_only.get_group(person_id);except KeyError:offer_transactions = [];# Filter out transactions that are within the time window of the offeroffer_transactions = person_transactions[(person_transactions.time > first_purchase) \& (person_transactions.time < last_purchase)];# Store these details into the dictionaryrecord['num_transactions'] = len(offer_transactions);record['person_id'], record['offer_id'] = person_id, offer_id;record['first_purchase'], record['last_purchase'] = first_purchase, last_purchase;#print(person_id, offer_id, num_transactions)completion_details.append(record);print('Completed forming transactions');completion_details_df = pd.DataFrame(completion_details);
plt.rcParams["figure.figsize"] = (12,4)fig, (ax1, ax2) = plt.subplots(ncols = 2)# Plotting the status of offers
g = sns.countplot(completion_details_df['status'], ax=ax1);
g.set_xticklabels(labels = ['No Offer', 'Viewed and Completed Offer', 'Did not view but completed offer'], rotation=45)# Plotting number of transactions made when associated with an offer
h = sns.countplot(completion_details_df[completion_details_df.status != 'no_offer']['num_transactions'], ax=ax2)plt.show()

这回答了我们的第二个问题。25％的人查看信息并完成购买，而10％的人在不先查看信息的情况下完成购买。

数据建模

有了清理的数据，现在可以使用监督模型来尝试预测价格。由于价格是一个连续变量，因此我们将需要使用回归模型。我们将数据分为X和Y，其中Y表示购买金额。

我们将为模型评分的指标是R2评分，在连续输出时效果很好。

在建模中，我使用了三个模型。我从支持向量机开始，该机器采用多项式核和回归方法。经过训练的模型的R2得分为0.079。

尽管SVM提供了良好的结果，但由于内核是非线性的，因此我们无法查看特征的重要性。功能重要性对我们很有用，因为它将帮助我们了解哪些属性最有助于预测标签。为此，我们使用其他模型，例如Random Forest。

在使用默认参数训练随机森林模型后，R2得分为-0.0823。

R2分数不如SVM分数高，因此我们应尝试调整模型以找到可提供最佳结果的参数。为此，我们可以使用Grid Search。我们要调整的参数是

估计量，介于10、50 和100之间
树的最大深度，介于5、10、30和80之间
树中最大要素数量，介于1、3、8和15之间
分割的最小样本数为3、5、10、30、50 和100

SVM核心代码：

data_x = transaction_data_only.drop(['amount', 'time'], axis=1);
data_y = transaction_data_only['amount'];
scaler = StandardScaler();
data_x = scaler.fit_transform(data_x);
#data_x.shape, data_y.shape
X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.33);
sv_model = SVR(kernel='poly', degree=7);
sv_model.fit(X_train, y_train);
print('SVR model trained on 7-degree poly kernel')
y_preds_svr = sv_model.predict(X_test);
print('SVR Model R2 score: ' , r2_score(y_test, y_preds_svr))
#SVR Model R2 score:  0.0791369487837

最终随机森林的R2分数要好于使用SVM和Random Forest所获得的分数。SVM和随机森林的结果之间没有太大差异.

使用随机森林，可以查看用于训练模型的属性的特征重要性。

pd.DataFrame(list(zip(transaction_data_only.columns[2:], tuned_rf_model.feature_importances_)), columns=['Attribute', 'Feature Importance']).sort_values(by='Feature Importance', ascending=False)

这回答了第三个问题，即哪个属性对价格的影响最大。收入和年龄最有可能影响某人在星巴克的消费额。

结束

在这个数据分析中，通过星巴克交易数据集的探索回答了三个问题。

性别确实会影响一个人在星巴克的消费额。男性倾向于购买更多商品，而女性倾向于购买更昂贵的商品。平均而言，女性在星巴克每次购物的花费大约多出6美元。
很大一部分会员确实在星巴克查看并完成报价。根据数据集，实际有25％的个人通过进行与要约相关的购买来接收，查看和完成要约。此外，有10％的人完成报价后没有真正先查看报价，而无意间完成了报价。
某人在星巴克消费最多的因素是收入和年龄。尽管还有其他促成因素，例如性别（如前所述），但收入在预测咖啡价格时的影响力最高，达到55％以上。