IDS Security Using Decision Trees and Neural Networks

This blog post details the creation of an intrusion detection systems (IDS) using decision trees and neural networks. The project was split into two parts. In part I, group 4—Derek Chan and Olivia Gallucci under Dr. Leon Reznik—prepared and preprocessed network traffic data and developed a framework for IDS that operates on misuse and anomaly detection principles. By utilizing decision trees and establishing baseline metrics to set thresholds for identifying unusual network activity, the group’s IDS achieved accuracy rates of 98.9% for misuse detection and 99.1% for anomaly detection, respectively. In part II, group 4 designed, trained, and implemented methods to utilize a neural network for misuse and anomaly based detection. Here, the group’s IDS achieved accuracy rates of 94.4% for misuse detection and 99.0% for anomaly detection, respectively (view project on GitHub).

🌸👋🏻 Join 10,000+ followers! Let’s take this to your inbox. You’ll receive occasional emails about whatever’s on my mind—offensive security, open source, academics, boats, software freedom, you get the idea.

IDS specification

Part I

This research aims to create a method for distinguishing between regular network connections and potential attacks within the data from the Canadian Institute for Cybersecurity (CIC) IDS 2017 Dataset. While the techniques described below could be generalized for application in an Intrusion Detection and Prevention System (IDPS), the primary focus in this report will be their application to the provided dataset.

The group’s approach to building an anomaly-based IDS involves utilizing the training data to establish a baseline for normal network traffic. To identify whether a network connection is anomalous, the program assesses the connection’s deviation from this calculated normal standard. The misuse-based IDS relies on a decision tree to assess the attributes present in the data, determining whether a connection is anomalous based on an inspection of the data itself.

To assess the effectiveness of the filtering, the program measures its accuracy and misclassification rates. Here, the research aims to maximize the true positive rate while minimizing the false negative rate.

Part II

The group trained a neural network using the CIC IDS 2017 Dataset. Utilizing the neural network, the group developed a misuse detection system to independently identify five unique attack types. This process resulted in five models; each one specialized in detecting its respective attack. Then, the group utilized the neural network to determine multiple attack types with one training set. Unlike part I, a model was generated for each attack type, so that the IDS could detect all five attacks separately. Lastly, the group built an anomaly detection system to recognize benign and anomalous traffic patterns. The goal was to train it so that it could recognize novel malicious traffic.

Methods and techniques

Data preparation

Design

The program’s design follows several key principles in its data preparation phase. Firstly, it focuses on data quality by addressing inconsistent column names, duplicate columns, and missing values. It also performs data normalization, applying min-max scaling to ensure that all features have the same scale. Additionally, it filters out specific attack types ‘Infiltration’ and ‘Heartbleed,’ which might lack sufficient data for effective classification; these attacks are provided as unseen data to the IDS, and help gauge the IDS’ effectiveness at determining the threat of malicious attacks. This selective approach helps maintain the dataset’s integrity while improving the model’s accuracy.

Operation

The program’s operation begins by reading raw data from CSV files containing network traffic information from the CIC IDS 2017 Dataset. The program normalizes the data through min-max scaling, ensuring that all feature values fall within the same numerical range. Additionally, it preprocesses the data by removing leading spaces in column names, handling duplicate columns, filtering out specific attack types (Infiltration and Heartbleed) that did not have enough data, and replacing missing values with zeros. The labels are adjusted based on the provided ‘idsType,’ either treating all attacks as abnormal behavior or categorizing attacks into distinct labels. The resulting preprocessed data is then separated into training and testing datasets, ensuring a balanced representation of benign and attack data in the training set. Finally, the data is converted into the ARFF format, suitable for consumption by Weka.

Misuse IDS

Design

For the misuse IDS, the program creates five ‘seen’ attack labels and a ‘BENIGN’ label containing normal packet data and unseen attack packet data. Then, it groups each datapoint into the correct label.

For example, all the attacks originally labeled ‘Web Attack – Brute Force,’ ‘Web Attack – XSS,’ or ‘Web Attack – Sql Injection’ are grouped into the ‘WebAttack’ label. This approach is typical for misuse IDS, where the focus is on matching new network traffic to known attack datasets. By simplifying attack labels and distinguishing them from benign data, the program ensures that the resulting dataset is suitable for supervised machine learning, enabling the detection of well-defined attack patterns and better prediction of novel threats.

Operation

The program identifies attack patterns by categorizing network traffic into predefined attack types. Labels for known attacks are adjusted to be uniform, making them easily distinguishable from benign traffic. This ensures that the machine learning model is trained to recognize and classify known attacks accurately. During operation, the trained model detects deviations from these predefined patterns (including those previously filtered out, which the group used as ‘unseen’ data), flagging any network traffic that matches the known attack signatures as malicious.

Anomaly IDS

Design

Regarding anomaly IDS, the program treats all attacks as abnormal behavior, categorizing them as ‘abnormal,’ while benign activities are labeled BENIGN. This design principle reflects the nature of anomaly IDS, which focuses on identifying deviations from expected, normal behavior. By consolidating various attack types into a single ‘anomaly’ category, the code simplifies the detection task, emphasizing the identification of unusual patterns or outliers in the data. This approach is beneficial for uncovering novel and previously unseen attacks that may not be well-defined in advance.

Operation

Here, the program considers any network behavior that deviates from established baselines as potentially malicious. In the preprocessing phase, it normalizes the data from a variety of attack types and handles missing values. This ensures that the model is exposed to a broader range of network behaviors, both benign and potentially anomalous. During operation, the anomaly-based IDS identifies unusual patterns that have not been seen during training, and flags them as potential threats. This design principle is valuable for detecting previously unknown threats since it does not rely on predefined attack patterns.

Neural network

The group used a multilayer perceptron (MLP) in Weka—which provides tools for training and using MLPs—to tackle a machine learning IDS.

An MLP is an artificial neural network used for supervised machine learning tasks. Specifically, an MPL is a feedforward neural network with multiple layers of nodes (neurons). It consists of an input layer, one or more hidden layers, and an output layer. Each node (neuron) in one layer is connected to every node in the subsequent layer. These connections have associated weights, and the network learns to adjust these weights during training to make predictions based on input data. The group used an MLP in Weka, which provides tools for training and using MLPs.

Depending on the test and objective, the group’s MLP used a neural network with various hidden layers and nodes in each layer. Here is one of the configurations we used within Weka:

weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 
-N 500 -V 0 -S 0 -E 20 -H a

Hyperparameters and options

The hyperparameters and options work as follows:

-L 0.3: Sets the learning rate to 0.3. The learning rate is a hyperparameter that controls the step size during the weight update process in training the neural network. It affects the convergence and stability of training.
-M 0.2: Sets the momentum to 0.2. Momentum is a hyperparameter that helps the optimization process by adding a fraction of the previous weight update to the current weight update. It can improve the convergence of the training process.
-N 500: Sets the number of training epochs to 500. An epoch is one complete pass through the training data. Training for a fixed number of epochs can help prevent overfitting, as the model stops learning after a certain point.
-V 0: Sets the validation set size to 0. The validation set is used for early stopping, which can prevent overfitting.
-S 0: Sets the seed for the random number generator to 0. Setting the seed ensures the training process is reproducible, as using the same seed will produce the same random weights initialization and data shuffling.
-E 20: This is the consecutive number of errors allowed for validation testing before the network terminates.
-H a: Determines the number of neurons in the hidden layers. The a specifies the number of neurons based on the number of attributes and the number of classes. The group has this information embedded into each ARFF file.

In short, the command configures the MultilayerPerceptron classifier with a specific learning rate, momentum, number of training epochs, and a hidden layer architecture determined automatically by the algorithm. It uses a seed of 0 for reproducibility and no separate validation set. The algorithm determines the architecture of the hidden layers based on each dataset the group used.

Parameter summary

Aspect	Effect	Impact on FP and FN rates
Number of Neurons in Hidden Layers	Increasing improves capacity but may cause overfitting	May reduce false negatives but increase false positives (overfitting risk)
Learning Rate	Determines step size in weight updates; affects convergence	Too high may lead to overshooting, increasing false positives/negatives; too low may cause slow convergence
Number of Hidden Layers	Deeper architectures capture hierarchical features; risk of overfitting	May reduce false negatives but increase false positives (overfitting risk)
Activation Functions	Choice influences information flow; ReLU mitigates vanishing gradients	Can influence the model’s ability to capture complex patterns
Regularization Techniques (e.g., Dropout)	Prevents overfitting by dropping out neurons	Can reduce overfitting, potentially improving generalization and reducing false positives
Batch Size	Determines training examples in each iteration; affects convergence	Small batches may introduce noise, affecting convergence and potentially increasing false positives/negatives; larger batches may provide stable updates but slower convergence
Weight Initialization	Proper initialization aids convergence and avoids vanishing/exploding gradients	Well-initialized weights contribute to stable training, potentially reducing the risk of false positives/negatives

Table: Aspects to MLP and their corresponding effects on FP and FN rates.

Impact of MLP parameters on false positive and false negative ratios

Several parameters warrant consideration, each with distinct effects on the MLP’s results and performance. Firstly, the number of neurons in hidden layers affects the MLP. Increasing this number enhances the model’s capacity to discern complex patterns, potentially reducing false negatives. However, a concomitant risk of overfitting exists, which might elevate false positives.

Another critical parameter is the learning rate, which dictates the step size during weight updates. A learning rate that is too high may expedite convergence but lead to overshooting, subsequently increasing false positives and negatives. Conversely, a rate that is too low can impede convergence, adversely affecting overall model performance.

The number of hidden layers introduces an additional layer of complexity. Deeper architectures have the potential to capture hierarchical features, thereby reducing false negatives. However, the accompanying risks of vanishing and exploding gradients, and overfitting may elevate false positives.

Regularization techniques, exemplified by dropout, aim to prevent overfitting by randomly dropping neurons during training. Dropout can mitigate overfitting, improving generalization performance and potentially reducing false positives.

Batch size, another pivotal factor, determines the number of training examples per iteration. Small batch sizes may introduce noise, impacting convergence and potentially increasing false positives/negatives, while larger batch sizes offer stability but might result in slower convergence.

Finally, weight initialization plays a crucial role in the MLP’s outcomes. Well-initialized weights contribute to a stable training process and lead to more efficient convergence. Together, the interplay of these parameters highlights the need for a comprehensive understanding to optimize the MLP’s performance while minimizing false positive and false negative ratios.

Implementation

The group’s program also serves as a data preprocessing and conversion tool designed for machine learning tasks using the CIC IDS 2017 IDS Dataset. Its structure comprises several functions organized for specific tasks, including data cleaning, normalization, and ARFF file generation. The program utilizes Python libraries such as pandas, numpy, sklearn, and concurrent.futures for data manipulation and parallel processing.

The program handles different aspects of the data preprocessing pipeline, from loading CSV files to generating ARFF files suitable for machine learning with Weka. It includes functions to normalize data, separate attacks from benign traffic, and efficiently process multiple files in parallel.

The first thing the program does is concurrently read in raw data from multiple CSV files using pandas read_csv function. This function parses the first row of the CSV as column names and the rest of the CSV as data corresponding to those columns. The result of this read function is a pandas data frame object representing the data in the CSV.

The next step is the filtering and normalization of data. Using the pandas rename function, this program removes all the extra whitespace from the column names. Then, using the pandas drop function, this program removes any duplicate and unnecessary data features. Using the pandas min and max function, the program normalizes the data between the values of 0 and 1 based on the minimum and maximum value of each column. After the min-max normalization, the program results with NaNs if the minimum and maximum values in the column are the same; to fix this, it uses the pandas fillna function, which replaces any NaNs in the data with a 0.

Data grouping

After the filtering and normalization of the data, the way the program groups the data is based on whether it is creating an anomaly dataset or a misuse dataset. In either case, the program employs the pandas apply function. The difference comes when the program marks all attacks as abnormal in the anomaly case or groups similar attacks together in the misuse case. After that, the program creates a dictionary that groups by type of attack label, including benign. This is done through the pandas groupby function, which automatically groups data points by different values given a column name.

With this dictionary of grouped data points, the program utilizes sklearn’s train_test_split function to randomly select data points without replacement. The number of datapoints selected is provided by the user. In Group 4’s program, eight-thousand benign data points, and two-hundred of each attack datapoint was specified. The program also creates a testing set with 25% fewer data points, representing an 80% and 20% ratio of training to testing data. The data points selected for training and testing are mutually exclusive.

ARFF generation

Finally, the program generates the testing and training ARFF files. First, the headers are written with the specifications provided by the University of Waikato. Then the data is shuffled using pandas sample function and written to a file using numpy’s savetxt function. Each line of comma-separated values represents an individual data point.

The program lacks a user interface; it is intended to be run from the command line, taking command-line arguments for customization. Users need to specify parameters such as the type of intrusion detection system (anomaly or misuse), the maximum number of threads for parallel processing, the directory containing CSV files, and quantities for training data.

To run the script, navigate to the location of the script and run:

python3 preprocessing.py [--idsType IDSTYPE] [--attackType [ATTACKTYPE]] [--unprocessedDataPath UNPROCESSEDDATAPATH] [--benignTrainingQty BENIGNTRAININGQTY] [--eachAttackTrainingQty EACHATTACKTRAININGQTY] [--maxThreads MAXTHREADS]

The attackType parameter is optional and multiple attacks can be specified like attack1_attack2_attack3. For example, the results for

python preprocessing.py -idst misuse -a Bot_DDoS_DoSHulk_WebAttack_PortScan -t 8 -dp ./data/unprocessedCSVs -bq 8000 -eaq 200

will be in data/processedCSVs with the base directory as the same as the directory of the script.

Limitations of the software include its command-line interface, which might not be user-friendly for non-technical users. It also assumes a basic understanding of the dataset structure and preprocessing requirements. Additionally, it does not provide extensive error handling for various edge cases.

Software and hardware requirements include a Python environment with the required libraries installed, and access to the CIC IDS 2017 Dataset in CSV format.

Weka user guide

Data analysis

Additional module in part I

Loading data
1. Click on the “Explorer” button in the Weka GUI Chooser.
2. In the “Preprocess” tab, click the “Open file” button to load your dataset.
3. Weka supports various data formats, including ARFF, CSV, and more.
Data preprocessing
1. After loading your data, you can preprocess it using the “Preprocess” tab.
2. Explore options for data cleaning, transformation, and attribute selection.
Building a Machine Learning Model
1. Go to the “Classify” tab to build a machine learning model.
2. Select a classification algorithm from the left panel (e.g., J48, Random Forest, Naive Bayes).
3. Configure algorithm parameters and choose the target class attribute.
4. Click the “Start” button to build the model.
Model evaluation
1. After the model is built, evaluate its performance on your dataset.
2. Go to the “Classify” tab and use the “Supplied test set” option to load your test data (if available).
3. Click the “Start” button to evaluate the model’s accuracy and other metrics.

Artificial neural networks

Additional module in part II.

Loading Data
1. Click on the “Explorer” button in the Weka GUI Chooser.
2. In the “Preprocess” tab, click the “Open file” button to load your dataset.
3. Weka supports various data formats, including ARFF, CSV, and more.
Attribute Selection
1. After loading your data, you can preprocess it using the “Preprocess” tab.
2. Explore options for data cleaning, transformation, and attribute selection. If your dataset contains many attributes, you may want to perform attribute selection or dimensionality reduction before training your multilayer perceptron.
Select, Train, and Test the Multilayer Perceptron Algorithm & Model
1. In the Weka Explorer, go to the “Classify” tab. Note that you may also use it in Java code if you are using Weka programmatically.
2. Select the MultilayerPerceptron classifier for training a multilayer perceptron.
3. Configure the hyperparameters, such as the number of hidden layers, the number of nodes in each hidden layer, learning rate, and momentum.
4. Click the “Supplied test set” button and select your testing data.
5. Click the “Start” button to build, train, and test the model.

Tests

The group used the CIC IDS 2017 IDS Dataset to test the IDS systems. The threshold for the anomaly-based detection system was determined through trial and error using an ROC curve, which returns the TP rate over the FP rate.

Then, the group generated a receiver-operating characteristic curve for the anomaly detection system, determining the optimal threshold. Lastly, the group used a separate dataset to validate and test the MLP neural network. First, the group trained models to distinguish a specific attack from other traffic and computed a confusion matrix. The group repeated these procedures for five attacks.

For every step of testing, the group cross-checked the program’s results by verifying with a test set of data unseen by the classifier. The quantity of false negatives and true negatives matched the number of attacks in the test set. In other words, the program’s results match the expected results.

DDoS

( n = 2050 )	Predicted Normal	Predicted Anomalous
Actually Normal	1995	5
Actually Anomalous	1	49

Model Accuracy: 99.71%
Average Training Time: 250.24 seconds
Memory Footprint: 2.22 GB

DoSHulk

( n = 2050 )	Predicted Normal	Predicted Anomalous
Actually Normal	2000	0
Actually Anomalous	22	28

Model Accuracy: 98.93%
Average Training Time: 248.57 seconds
Memory Footprint: 2.34 GB

Web Attack

( n = 2050 )	Predicted Normal	Predicted Anomalous
Actually Normal	1997	3
Actually Anomalous	10	40

Model Accuracy: 99.37%
Average Training Time: 245.68 seconds
Memory Footprint: 2.15 GB

FTP-Patator

( n = 2050 )	Predicted Normal	Predicted Anomalous
Actually Normal	1983	17
Actually Anomalous	1	49

Model Accuracy: 99.12%
Average Training Time: 251.13 seconds
Memory Footprint: 2.21 GB

Port Scan

( n = 2050 )	Predicted Normal	Predicted Anomalous
Actually Normal	1974	26
Actually Anomalous	0	50

Model Accuracy: 98.73%
Average Training Time: 248.83 seconds
Memory Footprint: 2.24 GB

Parameter effects using group 4’s anomalous dataset

The group also experimented with changing the layers within the MLP neural network.

Layers	False Negative Rate	False Positive Rate
20	0.500	0.000
50	0.380	0.001
100	0.380	0.001

Table 1: Layer amount in comparison to False Negatives and False Positives

IDS results

Part I

The program uses normalization, preprocessing, and balanced dataset creation. Its misuse IDS design focuses on identifying attack patterns and adjusting labels accordingly. Its anomaly IDS design aims to detect deviations from established baselines, allowing it to identify unseen threats. These principles collectively enhance the program’s ability to prepare data for intrusion detection and design IDS models that can effectively identify both known and unknown network threats.

Accuracy = $\left( \frac{TP + TN}{\text{Total}} \right) \left( \frac{2032}{2050} \right) \times 100\% \approx 99.1\%$

Misclassification Rate = $\left( \frac{FN + FP}{\text{Total}} \right) \left( \frac{18}{2050} \right) \times 100\% \approx 0.88\%$

Precision = $\left( \frac{TP}{TP + FP} \right) \left( \frac{42}{42 + 10} \right) \times 100\% \approx 80.8\%$

True Positive Rate = $\frac{TP}{TP + FN} \approx 99.1\% \text{ (average for benign and abnormal)}$

False Positive Rate = $\frac{FP}{FP + TN} \approx 0.16\% \text{ (average for benign and abnormal)}$

Summarized results

( n = 2050 )	Predicted Normal	Predicted Anomalous
Actually Normal	1990	10
Actually Anomalous	8	42

Part II

Using the MLP neural network and the methods described above, the program’s updated statistics are as follows:

Accuracy = $\quad \left(\frac{TP + TN}{\text{Total}}\right) \quad \left(\frac{2029}{2050}\right) \times 100\% \quad \approx 99.0\%$

Misclassification Rate = $\quad \left(\frac{FN + FP}{\text{Total}}\right) \quad \left(\frac{21}{2050}\right) \times 100\% \quad \approx 1.02\%$

Precision = $\quad \left(\frac{TP}{TP + FP}\right) \quad \left(\frac{31}{31 + 2}\right) \times 100\% \quad \approx 93.9\%$

True Positive Rate = $\quad \frac{TP}{TP + FN} \quad \approx 99.0\% \quad \text{(average for benign and abnormal)}$

False Positive Rate = $\quad \frac{FP}{FP + TN} \quad \approx 0.37\% \quad \text{(average for benign and abnormal)}$

Summarized results

( n = 2050 )	Predicted Normal	Predicted Anomalous
Actually Normal	1998	2
Actually Anomalous	19	31

Work distribution

Throughout the project, the group effectively divided responsibilities to streamline the research, Python programming tasks (including data preprocessing, misuse IDS, and anomaly IDS), and documentation. The team employed collaborative coding practices, working together during debugging to identify and resolve issues promptly. Additionally, the group established a routine of regular check-ins every three to four days during the development process to discuss progress and challenges. To facilitate seamless collaboration and version control, the team utilized GitHub and adhered to GitHub rules, ensuring easy merging of code changes. They also implemented repository settings to require Pull Requests, preventing accidental pushes to the main branch, and allowing the group to maintain a clean and organized codebase throughout the project’s development lifecycle. This collaborative and structured approach contributed to the project’s overall success and efficiency.

In total, it took the group around fifteen hours for research, forty-two hours for development, and twenty-two hours for documentation.

Notes

Note that this post was made for the Introduction to Intelligent Security Systems (CSCI 532) course at the Rochester Institute of Technology.

The course will introduce students into the current state of an application of intelligent methodologies in computer security and information assurance systems design. It will review different application areas such as intrusion detection and monitoring systems, access control and biological authentication, firewall structure and design. The students will be required to implement a course project on design of a particular security tool with an application of an artificial intelligence methodology and to undertake its performance analysis.

If you enjoyed this post on IDS security, consider reading Fine-Tuning LLMs: Pre-trained Transformers with Python or my other research.