
Abstract. Intrusion Detection System (IDS), the second line of defense mechanism plays a major role in safeguarding the network infrastructure from various threats imposed by the “Black hat” attackers. The ever-advancing nature of cyber-attacks makes the design and development of an efficient IDS, a complex task. Hence, this paper presents an intelligent IDS based on Feed Forward Neural Network (FFNN) and Genetic Algorithm (GA) for parameter optimization and classification of malicious and normal data. The experiments of GA-FFNN were evaluated on the NSL KDD dataset and the performance of the proposed algorithm has been validated with the performance metrics such as Classification accuracy, Detection rate, and false alarm rate.

Keywords: Genetic Algorithm, Feed Forward Neural Network, Parameter optimization, IDS


1 Introduction


With the growth of “The internet” and “Computer networks”, the digital transformation has led to the massive generation of sensitive information over the network that might be affected when intrusions or vulnerabilities occur in the network [1]. The recent security incidents like, the nexus repository breach [2], ransomware [3], wannacry [4], password leak on yahoo [5] and data theft on adobe [6] insist the importance of protecting the sensitive information against intruders. Earlier, the traditional security measures like antivirus, access control, and firewall were used to protect the networks from various threats. However, these security mechanisms are obsolete due to the dynamic nature of intrusions and further, it has motivated the researchers to develop a robust security mechanism, Intrusion Detection system to fight against the ever-advancing intrusions [7]. According to NIST, “Intrusion detection is defined as an automated process which identifies any suspicious activities that compromise the Confidentiality, Integrity, and Availability (CIA) of the computer or network resources”. Based on the detection methodologies, IDS are classified into two types: (i) Misuse detection and (ii) Anomaly detection. The former method detects the intrusions based on the predefined patterns and provides less false positives. However, it fails to identify the new anomalies. Whereas, the latter mechanism identifies both known and unknown attacks. However, the false-positive rate is high [8]. Several researchers prefer misuse detection over anomaly detection to achieve high classification accuracy.

In general, Intrusion detection is identified as a classification problem that discriminates the “normal” and “malicious data”[9]. It has led the researchers to use the machine learning algorithms like Artificial Neural Network (ANN), K-Nearest Neighbor, Random forest, etc. with IDS to achieve better classification accuracy and detection rate [10]. Among these, ANN was significant in designing an Intelligent IDS as it can handle the imbalanced or incomplete dataset. The major problem in existing ANN-based IDS is the architecture of ANN is unstable due to the high dimensionality of the dataset which may trap at local minima [11]. To overcome this challenge, GA-FFNN IDS is proposed where the hyperparameters of FFNN (learning rate, number of hidden units, dropout, and penalty) have been optimized using a genetic algorithm to improve the stability of the ANN-based IDS. The major contribution of this work is:

1. The proposed GA-FFNN was designed to classify normal and malicious data.


2. Hyperparameters of FFNN were optimized with GA that avoids premature convergence.

3. The effectiveness of the proposed algorithm has been evaluated with the benchmark IDS dataset, NSL-KDD and the performance has been validated with accuracy and detection rate.


2 Materials and Methods:


2.1 Genetic Algorithm:


The Genetic Algorithm is an adaptive, meta-heuristic optimization approach, inspired by Darwin’s theory of evolution where stronger individuals are selected in a competing environment occurred in a biological process [19]. GA postulates that the potential solution of a problem is an individual chromosome that can be expressed as a set of parameters. GA guarantees the global optimal solution as it searches over the large sample space. The working behind the Genetic Algorithm is described in Algorithm 1.

遗传算法是一种自适应的元启发式优化方法,其灵感来自达尔文的进化论,即在竞争过程中选择更强的个体发生在生物过程中[19]。 GA假定问题的潜在解决方案是可以表达为一组参数的单个染色体。 GA可以在较大的样本空间中进行搜索,因此可确保提供全球最佳解决方案。 遗传算法背后的工作在算法1中进行了描述。



Step 1:  Begin the algorithm by initializing random populationStep 2: At each step, GA uses the current individuals to generate the next populationStep 3: Compute fitness valueStep 4: Select the best individuals in the current populationStep 5: Apply cross over and mutation operationsStep 6: Replace the current population by crossover to create next gen-erationStep 7: Terminate the algorithm when stopping criteria is satisfied.

1.1 FeedForward Neural Network:


FFNN is a deep learning model often called Multilayer Perceptron (MLP). FFNN architecture comprises an input layer, hidden layers, and an output layer (Figure 1). In FFNN architecture, each neuron in one layer is connected to all the neurons of the next layer. It is a fully connected network that learns through supervised algorithms. FFNN operates with the ReLU (Rectified Linear Units) activation function in hidden layers [20].

And the net function is termed as,


Architecture of FNN

Feed Forward Neural Network


Working of FFNN:


Step 1: Initialize the input as number of samples and number of features in the dataset and output as decision classStep 2: Initialize number of features in input layer and compute net function using eqn. (1)Step 3: Initialize epoch=100 and error >=0.01Step 4: Use ReLU as an activation function for the hidden neurons of the hidden layer (Eqn.2)Step 5: Compute error.Step 6: If error is greater than or equal to 0.01, Update the weights of the network and repeat the iteration. (i.e. epoch=epoch+1)Step 7: Else return decision class.
  1. Proposed Methodology:


Step 1: Initialize the number of features as input and the optimize the number of hidden neurons, learning rate(l), momentum(m), and dropout(d), number of epochs and batch size

Step 3: Optimize l,m, and d using Algorithm 1


Step 4: Compute the error using Eqn. (3)

Step 5: Calculate fitness=accuracy(best)


Step 6: Terminate the condition when optimal parameters obtained or maximum number of iterations reached.


Step 7: Based on the best fitness function, Update the position of the population.


  1. Experimental Setup


To carry out the experiments of GA-FFNN, the NSL KDD dataset was used. The GA-FFNN algorithm was implemented using python 3.4 in an INTEL® CoreTM i5 processor @2.40 GHz, 8 GB RAM running windows 10 operating system. Further, Weka tool was used for validation purposes. The entire set of experiments were divided into three phases, (i) Data preprocessing, (ii) Training and Testing and (iii) Evaluate the performance of GA-FFNN based on classification accuracy.

2. Data Preprocessing:


NSL-KDD dataset:


Tavallaee et al proposed NSL-KDD, an improved version of the KDD ’99 dataset to remove uncertainties in KDD-CUP [21]. As compared to KDD ’99 dataset, there are no duplicate records in the test and train sets. This dataset consists of approximately 1,074,992 single connection vectors, each of which contains a total of 41 features including basic features, Content related features, Time related traffic features, and Host-based traffic features. It has attribute value types grouped by Nominal, Binary, and Numeric. From connection vectors, each can be categorized as either an attack or a normal type. Attack types may be classified as DoS, U2R, R2L, and Probe. Data mapping and data normalization techniques were carried out as in our previous works [10].

3. Training and testing:


Subsequently, the entire dataset was partitioned into 80% for training (TrainNSL) and 20% for testing (TestNSL) respectively.


4. Evaluate the performance of GA-FFNN based on classification accuracy:


The proposed GA-FFNN was designed to classify whether the incoming network traffic pattern is malicious or normal. It has been evaluated and validated with the following metrics: classification accuracy, Detection rate, and false alarm rate. The proposed GA-FFNN architecture was designed with one input layer, 2 hidden layers, and an output layer. “Adam” function was used to optimize the hidden layers. Figure 3 visualizes the classification accuracy of the proposed GA-FFNN that outperforms than the existing classifiers like the random forest, bayesnet, k-star, and BFFO-CNN. Table 2 compares the detection and false alarm rate of different classifiers where the proposed approach shows its dominance over the existing approaches.

Fig. 3. Classification Accuracy图3.分类精度
Table 2. Performance Evaluation- Detection Rate and False Alarm Rate

结论 (Conclusions)

This paper has presented the Genetic Algorithm based Feedforward Neural Network for the parameter optimization of FFNN and also for the classification of malicious samples from normal samples. The NSL-KDD dataset has been used to evaluate the proposed GA-FFNN and the results were validated with classification accuracy, detection rate, and false alarm rate. From the extensive experiments, the proposed classification approach, GA-FFNN has provided better accuracy than the existing approaches. This work can be further extended for feature selection by varying the genetic operations to optimize the parameters of FFNN.

Python packages imported snippets


import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport matplotlib.colorsfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, mean_squared_errorfrom tqdm import tqdm_notebookfrom keras import regularizers // to avoid overfitfrom sklearn.preprocessing import OneHotEncoder // binary classification 0 and 1from sklearn.datasets import make_blobsfrom sklearn import preprocessing
Inside NSL KDD Dataset
KDD-Train and Test


df_train = pd.read_csv(‘KDDTrain+.txt’, header=None, index_col=None) df_test = pd.read_csv(‘KDDTest+.txt’, header=None, index_col=None) df_train.head()

In that Dataset, We have 42 features present in them. We can remove the last column of that dataset since we won’t be needing them.

df_train.drop(42, axis=1, inplace=True) df_test.drop(42, axis=1, inplace=True)

# Classifying Attacks counterparts as 1 and Normal as 0.


df_train.loc[df_train[41]!=’normal’, 41] = 1 df_test.loc[df_test[41]!=’normal’, 41] = 1 df_train.loc[df_train[41]==’normal’, 41] = 0 df_test.loc[df_test[41]==’normal’, 41] = 0 df_train.groupby(41).count()

#To determine it’s size and shape after classification.


X_train = df_train.drop(41, axis=1) y_train = df_train.loc[:,[41]] X_test = df_test.drop(41, axis=1) y_test = df_test.loc[:,[41]] print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We can apply One hot Encoding (categorical encoding) on data to cover some special characters.

le = preprocessing.LabelEncoder() enc = OneHotEncoder()

Guys, a Complete version of code can be found here.


I hope you guys understood it!!


