The full promise of what artificial intelligence can achieve always seems to be just around the corner. In December of 2019, F-Secure announced Project Blackfin, a research effort with the goal of applying collective intelligence techniques to the cyber security domain. We’re excited to share new developments in this effort.
Project Blackfin research is being conducted by F-Secure, and involves collaboration between our in-house engineers, researchers and data scientists, and academic partners. Collective intelligence techniques involve leveraging collaboration between multiple agents or entities. Such collaboration can lead to emergent behaviours that are able to perform tasks or develop insights that would not be possible with a single model or system. In order to develop collective intelligence techniques, we’ve created a simulation platform to study the interactions between multiple agents in a system. This article describes that platform.
Development of distributed machine learning solutions is a non-trivial task. Most current machine learning solutions are centralized: models are developed, trained, and used in systems that run in data centers or in the cloud. Compared to distributed solutions, development of centralized solutions is much easier, since local data is readily available, the feedback loop during development is short, and machine learning tools are largely designed for such environments.
How the Blackfin Simulation Platform works
The Blackfin Simulation Platform is a development environment specifically designed to prototype and study distributed machine learning mechanisms. It allows developers to quickly and easily simulate the behavior of thousands of independent, interconnected agents, with relatively low overhead. The platform also allows for distributing data sets and simulating real-life behaviour which can be easily repeated for fine-tuning processes.
The simulation platform itself, is built on top of Amazon Web Services (AWS) public cloud and has been designed to support development of three types of distributed machine learning solutions:
- Local Learning – each agent trains a local model with the locally available data and uses the model for local inference.
- Federated Learning – a group of agents collaborate to train a model. Each agent trains a local model with local data and sends model updates to a central parameter server. The parameter server combines updates from individual agents into a global model and delivers the updated model back to the agents.
- Peer-to-Peer Learning – a group of agents collaborate to train a globally optimized model, without the help of any centralized coordination service. Similar to federated learning, agents use local data to train local models. The agents share these local model updates with one another and apply the received updates to their local models to build a globally optimized model.
Three key ingredients are required to run a simulation in the Blackfin Simulation Platform: agent logic, input data and compute capacity.
Agent logic defines how each simulated agent behaves during a simulation, and can include directives on how the agent processes the data it has available and how it communicates with other agents or systems in the simulation. By sharing information with one another, we can simulate the impacts of one agent’s actions and observations on the entire swarm. This is crucial to be able to extend beyond single agents and experiment with true collaborative swarm behaviour.
Agent logic is implemented in Python, thus allowing developers to use the most popular and powerful machine learning tools available. The code required to run simulations is very simple and straightforward.
Input data can take many forms, such as real-world data collected from red-teaming exercises or attack simulations, the output of previous experiments, or synthetic data generated by logic in the agents themselves. Data is stored in Amazon Simple Storage Service (Amazon S3) in a compressed JSON format and partitioned to per-agent input objects. Each simulated agent reads their own data directly, from Amazon S3, and processes it according the logic defined in the agent’s code.
Finally, the platform itself requires adequate compute capacity in order to simulate thousands of agents. Simulations are performed on a cluster of Amazon EC2 instances. We use the Ray distributed execution framework to orchestrate and distribute agents over a cluster of Amazon EC2 instances. Ray automatically provisions the required compute capacity and scales it to handle the needs of a simulation.
With all the ingredients in place, a researcher can spin up a cluster of Amazon EC2 instances, distribute agents onto the cluster, allow them to execute agent logic against input data stored in Amazon S3, and evaluate the results of the simulation. Agent logic can then be tweaked, and the simulation re-run, in order to determine whether results improved or not. This is one of the most powerful capabilities of the Blackfin Simulation Platform – it allows developers to quickly prototype ideas in a reproducible manner, iterate on promising ones, and validate solutions at scale before new models or logic are released to real agents in the field. The Blackfin Simulation Platform is also being used to test the robustness of our models and pipelines against adversarial tactics, such as model poisoning, inference, and availability attacks. The topic of adversarial testing will be discussed further in an upcoming blog post.
How we intend to utilize the Blackfin Simulation Platform
Detecting signs of lateral movement is incredibly tricky, since it requires knowledge of events occurring on or between multiple hosts on a network, over a period of time. Centralized breach detection solutions aren’t suited to these types of tasks, since they often involve working backwards through historical data looking for a chain of events in reverse chronological order.
As part of Project Blackfin, we’re looking to develop methods to detect lateral movement. The way we plan on doing this is to build logic designed to run on each host, that is able to gather readings from other hosts on the network. This logic then feeds those readings into “second order” machine learning models that are conditional on the output of preceding models and logic.
The Blackfin Simulation Platform allows us to prototype this process. By spinning up a large number of hosts, and simulating sequences of events and the output of anomaly detection models on those hosts, we can experiment with mechanisms that selectively gather observations from neighbouring nodes and then feed those observations into models designed to detect chains of events representative of lateral movement. We can then observe the resulting output and tweak mechanisms accordingly. The output of these models can then be used to precipitate further actions, such as gathering of more data, or generation of alerts or reports. We’ll leave the details on what those experiments look like to a future post. If you’re interested in learning more about Project Blackfin, and some of our proposed detection methodologies, take a look at our recent whitepaper.