A feature taxonomy for network traffic

Date
2019
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Even though Artificial Intelligence is still a black box to a lot of people, both experts and non-experts alike, it has become an important tool in current and future technology. Every day, we trust Artificial Intelligence (AI) with our lives, from driving cars to using medical devices. One such important part of life is the Internet which is basically a worldwide exchange of data packets at a very high rate. Security is an integral part of this exchange which encourages enterprises to use Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) to detect and prevent anomalous activities. As much as we want to use AI to ease our task of anomaly detection, we want to win the trade-off between the true positives, true negatives and false positives, false negatives and ultimately achieve true AI. ☐ Amongst various applications of AI, machine learning (ML) is the most famous. Conceptually, ML is a discipline of discovering probabilistic models which use algorithms to learn new things from data like patterns, behaviors, and decision-making capabilities, etc. The higher the accuracy of a model, the better these patterns are learned. These models are developed by training with a large set of data and testing their accuracy on a test dataset. Therefore, we can safely say that the driving engine behind ML is data. If we want ML to make decisions like a human brain, we need to train it on the best possible version of the data we have. ☐ Trusting a probabilistic mechanism to make the right decision might be mathematically acceptable but that is not the case in a dynamic environment like a network where multiple devices like computers, routers, switches, and servers, etc. are communicating and thousands of packets are exchanged every minute and passing through multiple devices. The volume of data seen or collected at a node in a network is enormous, even in a short span of time. ☐ Our goal in this thesis is to collect a dataset of different types of packets that arrive at a node, for long durations of time, in order to facilitate the identification of unusual traffic which might indicate a system error or a possible attack. All the data in a packet does not contribute to the identification of the anomaly so we don’t use the raw data downloaded directly into a machine learning model. We study the data collected and filter it in a way that useful information is retained, and noise and repetitive data are removed. We also can perform feature engineering on the traffic. We compute several derived characteristics of the data by performing several statistical computations like mean and variance on the numeric data and quantitative flow records[8,9] to find derived characteristics which might play an important role in spotting specific behaviors. As a result, we are able to develop an extensive taxonomy of packet data which may be useful in our goal of detecting anomalies. ☐ This taxonomy is a collection of information both taken directly from the packet files as well as derived from the information collected. It defines a tree-like structure which represents all the features of a packet as well as the traffic flow. The branches of the tree are the categories and subcategories and the leaves of the tree represent the final features which can be directly used as estimators of a machine learning model. ☐
Description
Keywords
Citation