Detection of anomalous process creation chains using word vectorization, normalization, and an autoencoder
Anomaly detection methods are widely used in cybersecurity since they present an effective method for detecting unknown and sophisticated attacks. In general, these methods are based on modelling normal or benign behavior of computing devices, networks and systems in the absence of attacks and then using trained models to detect attacks as deviations or anomalous behavior. By design, anomaly detection-based security mechanisms do not need any prior knowledge about attacks, they only require examples of normal behavior. This approach is especially useful for countering determined and skilful attackers who invest into developing novel attack tactics and techniques. Many different anomaly detection methods exist in the realm of cyber security. In this article, we present one such method, which is based on analysis of process creation chains.
Processes running on computer systems launch other processes. For instance, when you open an application (e.g. a web browser or word processor) on a Windows system, the user interface process (typically explorer.exe) launches it. By collecting data on which processes launch which other processes (parent-child relationships), we can construct process creation trees such as the one pictured below. We hypothesize that rare process creation chains may be indicative of suspicious or malicious activity.
Process creation chains can be assembled by analyzing log data collected from a computer system. This data is usually available as text, and process launches are usually denoted by the full path to both parent and child executables. A full path may look something like this:
In the above example, the path to java.exe contains a directory denoting its version of the runtime environment (in this case, jre1.7.0_127). When the version of this software changes, so does the path:
An easy way to construct parent-child process creation pairs would be to simply strip away the path, leaving the executable name. However, this leaves a backdoor open to attackers – they can simply rename a malicious executable in order to bypass such a mechanism. Therefore, we’d like to recognize all java.exe executables installed under similar directory structures as similar, but different to java.exe installed under a substantially different path (for instance in the Recycle Bin, Downloads directory, temp directory, or web cache). In order to do this, we decided to treat these strings as a natural language processing (NLP) problem, wherein we consider paths as sentences and directory/file names as words. We then encode these words into vectors using a word2vec-style model. Put simply, word vectors are numerical representations that are assigned to words by a machine learning model that is trained to do things such as predict missing words in a sentence or guess which words precede or follow other words. Machine learning models of this type generate representations that capture associations between words like the ones shown below.
In order to prototype our method, we collected approximately 22 million raw log entries pertaining to parent-child process launch events from a network of computers. Our dataset contained 745,379 unique “sentences” (file paths). After tokenization, we obtained 532,717 unique “words” (directory and file names), meaning that almost every sentence contained a unique string. This large vocabulary size was to be expected – file paths contain many variables that can even change over time. In addition to the version-based file paths already mentioned, we encountered date and time strings (e.g. 2019-12-12), language strings (e.g. en-US, pl-PL), and many randomly generated strings such as UUIDs, temporary filenames (install_14123412.exe), temporary directory names, and so on.
We trained a FastText model (with 64-dimensional output vectors) on our tokenized data and analyzed the resulting word vectors. The model, which took between 5 and 20 minutes to train over 20 epochs on an AWS m4.xlarge instance, adequately captured similarity between related file names and directory names, as shown in the example below.
Clustering the resulting word vectors by cosine distance, using the DBSCAN algorithm, we observed that many of the clusters had captured common patterns in the naming of files or directories. Below is a T-SNE projection of the clusters, with labels denoting common patterns found.
Many of the clusters contained simple patterns that could be easily described with short regular expressions. We wrote a script to automatically generate those regular expressions across all of the identified clusters.
This process created about 30 regular expressions, which we then used to normalize and reduce our initial dataset. This resulted in 169,161 unique sentences (about a five-fold reduction) and 75,314 unique words (almost a ten-fold reduction). We then trained a FastText model using these new, normalized paths.
The final part of the puzzle was to build a model able to separate anomalous process creation chains from benign or common chains. We opted to use an autoencoder to do this. Our hypothesis was that the model’s reconstruction error should be lower for common process chains than for rare ones. An autoencoder is a machine learning model designed to replicate its input via a latent (compressed) representation. An autoencoder learns to reproduce inputs it has been trained on such as in this example:
However, when faced with inputs it hasn’t seen, it is more likely to output garbage like this:
The reconstruction error calculation is based on similarity between the model’s inputs and outputs.
Using the vectors obtained from our previous step, we trained a bidirectional RNN autoencoder on process chain sequences observed in our collected data.
The model was trained for 2000 epochs on an AWS p2.xlarge using a batch size of 64, which took between 5 and 10 minutes. After training, we tested its ability to detect anomalous process chains by randomly shifting the order of executables in our training sequences and observing the model’s reconstruction error. For process chains typically seen on computer systems, the autoencoder’s reconstruction error was below 0.004. For randomly shifted sequences, the reconstruction error was high – above 0.004.
As illustrated above, it is easy to set a threshold value above which a process chain can be considered anomalous. While not all rare process chains are suspicious or malicious, the output of this model can be combined with the output of other anomaly detection mechanisms in order to flag activity for closer inspection by an analyst or deliver a verdict.
This research was presented at PyData Warsaw 2019. A video of the presentation can be found here: https://www.youtube.com/watch?v=I0M6Qb-B8nU