使用Keras中的Autoencoders进行极端罕见事件分类

在这篇文章中，我们将学习如何实现自动编码器来构建稀有事件分类器。我们将使用来自的真实稀有事件数据集 here.

Background

什么是极端罕见的事件？
在一个罕见的问题中，我们有一个不平衡的数据集。意思是，我们得到的阳性标记样本少于阴性。在典型的罕见事件中，正标记数据约占总数的5-10％。在一个极端罕见的事件问题中，我们的标记数据不到1％。例如，在这里使用的数据集中，它约为0.6％。

这种极端罕见的事件问题在现实世界中非常普遍，例如，在制造，点击或在线行业购买时的纸张断裂和机器故障。

对这些罕见事件进行分类非常具有挑最近，深度学习已被广泛用于分类。但是，少量阳性标记样本禁止深度学习应用。无论数据有多大，深度学习的使用都受到正面标记样本数量的限制。

我们为什么还要费心使用深度学习？
这是一个合理的问题。为什么我们不应该考虑使用另一种机器学习方法？

答案是主观的。我们总是可以采用机器学习方法。为了使其工作，我们可以从负面标记的数据中取样，以获得接近平衡的数据集。由于我们有大约0.6％的正标记数据，因此欠采样将导致数据集粗糙，大约是原始数据大小的1％。机器学习方法，例如SVM或随机森林仍然可以处理此大小的数据集。但是，它的准确性会受到限制。我们不会利用剩下的~99％的数据中存在的信息。

如果数据足够，则深度学习方法可能更有能力。它还允许通过使用不同的体系结构灵活地进行模型改进。因此，我们将尝试使用深度学习方法。

在这篇文章中，我们将学习如何使用简单的密集层自动编码器来构建罕见的事件分类器。这篇文章的目的是演示用于极端罕见事件分类的Autoencoder的实现。我们将在用户身上探索Autoencoder的不同架构和配置。如果您发现任何有趣的内容，请分享评论。

自动编码器用于分类
用于分类的自动编码器方法类似于异常检测。在异常检测中，我们学习了正常过程的模式。任何不遵循这种模式的东西都被归类为异常。对于罕见事件的二进制分类，我们可以使用类似的方法使用自动编码器（从此处[2]导出）。

什么是自动编码器？

自动编码器由两个模块组成：编码器和解码器。
编码器学习过程的基本特征。这些特征通常具有减小的尺寸。
解码器可以从这些底层特征重建原始数据。

How to use an Autoencoder rare-event classification?

我们将数据分为两部分：正标记和负标记。
带负标记的数据被视为过程的正常状态。正常状态是指该过程无事件。
我们将忽略带正号的数据，并仅在负标记数据上训练自动编码器。
这个Autoencoder现在已经学习了正常过程的功能。
训练有素的Autoencoder将预测任何来自正常状态的新数据（因为它将具有相同的模式或分布）。
因此，重建误差会很小。
但是，如果我们尝试从稀有事件重建数据，则自动编码器会很困难。
这将使罕见事件期间的重建误差变高。
我们可以捕获如此高的重建误差并将其标记为罕见事件预测。
该过程类似于异常检测方法。
履行

数据和问题

这是来自纸浆和造纸厂的二进制标记数据，用于断纸。纸张断裂是纸张制造中的严重问题。单张纸张会造成数千美元的损失，而且工厂每天至少会看到一次或多次休息。这导致每年数百万美元的损失和工作危险。

由于过程的性质，检测休息事件是具有挑战性的。如[1]中所述，即使断裂减少5％也会给工厂带来显着的好处。

我们的数据包含15天内收集的大约18k行。列y包含二进制标签，1表示分页符。其余列是预测变量。大约有124个阳性标记样本（~0.6％）。

Download data here.

Code

Import the desired libraries.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pylab import rcParams
import tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_curve
from sklearn.metrics import recall_score, classification_report, auc, roc_curve
from sklearn.metrics import precision_recall_fscore_support, f1_score
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
SEED = 123 #used to help randomly select the data points
DATA_SPLIT_PCT = 0.2
rcParams['figure.figsize'] = 8, 6
LABELS = ["Normal","Break"]

Note that we are setting the random seeds for reproducibility of the result.

Data preprocessing

Now, we read and prepare the data.

df = pd.read_csv("data/processminer-rare-event-mts - data.csv")

这种罕见事件的目的是在碎片发生之前预测它。我们将尝试提前4分钟预测休息时间。要构建此模型，我们将标签向上移动2行（相当于4分钟）。我们可以这样做df.y = df.y.shift（-2）。但是，在这个问题中，我们希望将变换视为：如果行n被正面标记，

使行（n-2）和（n-1）等于1.这将有助于分类器学习最多4分钟的预测。
删除行n。因为我们不希望分类器在发生故障时学习预测中断。
我们将为此曲线移位开发以下UDF。

sign = lambda x: (1, -1)[x < 0]def curve_shift(df, shift_by):'''This function will shift the binary labels in a dataframe.The curve shift will be with respect to the 1s. For example, if shift is -2, the following processwill happen: if row n is labeled as 1, then- Make row (n+shift_by):(n+shift_by-1) = 1.- Remove row n.i.e. the labels will be shifted up to 2 rows up.Inputs:df       A pandas dataframe with a binary labeled column. This labeled column should be named as 'y'.shift_by An integer denoting the number of rows to shift.Outputdf       A dataframe with the binary labels shifted by shift.'''vector = df['y'].copy()for s in range(abs(shift_by)):tmp = vector.shift(sign(shift_by))tmp = tmp.fillna(0)vector += tmplabelcol = 'y'# Add vector to the dfdf.insert(loc=0, column=labelcol+'tmp', value=vector)# Remove the rows with labelcol == 1.df = df.drop(df[df[labelcol] == 1].index)# Drop labelcol and rename the tmp col as labelcoldf = df.drop(labelcol, axis=1)df = df.rename(columns={labelcol+'tmp': labelcol})# Make the labelcol binarydf.loc[df[labelcol] > 0, labelcol] = 1return df

Now, we divide the data into train, valid, and test sets. Then we will take the subset of data with only 0s to train the autoencoder.

df_train, df_test = train_test_split(df, test_size=DATA_SPLIT_PCT, random_state=SEED)
df_train, df_valid = train_test_split(df_train, test_size=DATA_SPLIT_PCT, random_state=SEED)
df_train_0 = df_train.loc[df['y'] == 0]
df_train_1 = df_train.loc[df['y'] == 1]
df_train_0_x = df_train_0.drop(['y'], axis=1)
df_train_1_x = df_train_1.drop(['y'], axis=1)
df_valid_0 = df_valid.loc[df['y'] == 0]
df_valid_1 = df_valid.loc[df['y'] == 1]
df_valid_0_x = df_valid_0.drop(['y'], axis=1)
df_valid_1_x = df_valid_1.drop(['y'], axis=1)
df_test_0 = df_test.loc[df['y'] == 0]
df_test_1 = df_test.loc[df['y'] == 1]
df_test_0_x = df_test_0.drop(['y'], axis=1)
df_test_1_x = df_test_1.drop(['y'], axis=1)

Standardization

It is usually better to use a standardized data (transformed to Gaussian, mean 0 and variance 1) for autoencoders.

scaler = StandardScaler().fit(df_train_0_x)
df_train_0_x_rescaled = scaler.transform(df_train_0_x)
df_valid_0_x_rescaled = scaler.transform(df_valid_0_x)
df_valid_x_rescaled = scaler.transform(df_valid.drop(['y'], axis = 1))
df_test_0_x_rescaled = scaler.transform(df_test_0_x)
df_test_x_rescaled = scaler.transform(df_test.drop(['y'], axis = 1))

Autoencoder Classifier

Initialization

First, we will initialize the Autoencoder architecture. We are building a simple autoencoder. More complex architectures and other configurations should be explored.

nb_epoch = 100
batch_size = 128
input_dim = df_train_0_x_rescaled.shape[1] #num of predictor variables,
encoding_dim = 32
hidden_dim = int(encoding_dim / 2)
learning_rate = 1e-3
input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim, activation="tanh", activity_regularizer=regularizers.l1(learning_rate))(input_layer)
encoder = Dense(hidden_dim, activation="relu")(encoder)
decoder = Dense(hidden_dim, activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)

Training

We will train the model and save it in a file. Saving a trained model is a good practice for saving time for future analysis.

autoencoder.compile(metrics=['accuracy'],loss='mean_squared_error',optimizer='adam')
cp = ModelCheckpoint(filepath="autoencoder_classifier.h5",save_best_only=True,verbose=0)
tb = TensorBoard(log_dir='./logs',histogram_freq=0,write_graph=True,write_images=True)
history = autoencoder.fit(df_train_0_x_rescaled, df_train_0_x_rescaled,epochs=nb_epoch,batch_size=batch_size,shuffle=True,validation_data=(df_valid_0_x_rescaled, df_valid_0_x_rescaled),verbose=1,callbacks=[cp, tb]).history

Classification

In the following, we show how we can use an Autoencoder reconstruction error for the rare-event classification.

As mentioned before, if the reconstruction error is high, we will classify it as a sheet-break. We will need to determine the threshold for this.

We will use the validation set to identify the threshold.

valid_x_predictions = autoencoder.predict(df_valid_x_rescaled)
mse = np.mean(np.power(df_valid_x_rescaled - valid_x_predictions, 2), axis=1)
error_df = pd.DataFrame({'Reconstruction_error': mse,'True_class': df_valid['y']})
precision_rt, recall_rt, threshold_rt = precision_recall_curve(error_df.True_class, error_df.Reconstruction_error)
plt.plot(threshold_rt, precision_rt[1:], label="Precision",linewidth=5)
plt.plot(threshold_rt, recall_rt[1:], label="Recall",linewidth=5)
plt.title('Precision and recall for different threshold values')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.legend()
plt.show()

Now, we will perform classification on the test data.

We should not estimate the classification threshold from the test data. It will result in overfitting.

test_x_predictions = autoencoder.predict(df_test_x_rescaled)
mse = np.mean(np.power(df_test_x_rescaled - test_x_predictions, 2), axis=1)
error_df_test = pd.DataFrame({'Reconstruction_error': mse,'True_class': df_test['y']})
error_df_test = error_df_test.reset_index()
threshold_fixed = 0.85
groups = error_df_test.groupby('True_class')
fig, ax = plt.subplots()
for name, group in groups:ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',label= "Break" if name == 1 else "Normal")
ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
ax.legend()
plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")
plt.show();

Figure 4. Using threshold = 0.85 for classification. The orange and blue dots above the threshold line represents the True Positive and False Positive, respectively.

In Figure 4, the orange and blue dot above the threshold line represents the True Positive and False Positive, respectively. As we can see, we have good number of false positives. To have a better look, we can see a confusion matrix.

pred_y = [1 if e > threshold_fixed else 0 for e in error_df.Reconstruction_error.values]
conf_matrix = confusion_matrix(error_df.True_class, pred_y)
plt.figure(figsize=(12, 12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

Figure 5. Confusion Matrix on the test predictions.
我们可以预测32个中断实例中的9个。请注意，这些实例包括提前2或4分钟的预测。这大约是28％，这对造纸业来说是一个很好的召回率。误报率约为6.3％。这对于工厂来说并不理想，但并不可怕。

尽管如此，这个模型还可以进一步改进，以便以较小的误报率提高召回率。我们将查看下面的AUC，然后讨论下一个改进方法。

ROC curve and AUC

false_pos_rate, true_pos_rate, thresholds = roc_curve(error_df.True_class, error_df.Reconstruction_error)
roc_auc = auc(false_pos_rate, true_pos_rate,)
plt.plot(false_pos_rate, true_pos_rate, linewidth=5, label='AUC = %0.3f'% roc_auc)
plt.plot([0,1],[0,1], linewidth=5)
plt.xlim([-0.01, 1])
plt.ylim([0, 1.01])
plt.legend(loc='lower right')
plt.title('Receiver operating characteristic curve (ROC)')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

The AUC is 0.624.

Github repository

The entire code with comments are present here.

What can be done better here?

This is a (multivariate) time series data. We are not taking into account the temporal information/patterns in the data. In the next post, we will explore if it is possible with an RNN. We will try a LSTM autoencoder.

Conclusion

We worked on an extreme rare event binary labeled data from a paper mill to build an Autoencoder Classifier. We achieved reasonable accuracy. The purpose here was to demonstrate the use of a basic Autoencoder for rare event classification. We will further work on developing other methods, including an LSTM Autoencoder that can extract the temporal features for better accuracy.

The next post on LSTM Autoencoder is here, LSTM Autoencoder for rare event classification.

References

Ranjan, C., Mustonen, M., Paynabar, K., & Pourak, K. (2018). Dataset: Rare Event Classification in Multivariate Time Series. arXiv preprint arXiv:1809.10717.
https://www.datascience.com/blog/fraud-detection-with-tensorflow
Github repo: https://github.com/cran2367/autoencoder_classifier