Skip to content

Trending tags

Threat analysis of a reinforcement learning trading algorithm

Andrew Patel

19.04.21 12 min. read

Continued innovation in the machine learning space is leading to the deployment of new models and systems with novel and interesting functionality. While the study of how the infrastructure surrounding these models might be attacked (and thus secured) is of interest to the cyber security community, so too is the study of the attack surface of the machine learning models themselves. We were recently given the opportunity to study and perform a threat analysis on a prototype energy trading marketplace, known as FLEXIMAREX, that utilizes reinforcement learning agents to perform trading operations. In this article, we present a description of the system and our analysis of its attack surface from a machine learning perspective.

FLEXIMAREX marketplace and smartbot trading agents

FLEXIMAREX is a prototype marketplace for trading energy. It is designed to provide a platform for buying and selling batches of energy flexibility. A batch of energy flexibility is defined as an amount of energy that can be consumed or provided over a pre-defined period of time. By creating a contract in the marketplace, an energy consumer agrees to stop, increase, or decrease their own energy consumption at certain point in time. This promise to perform these actions is traded on the marketplace in a fashion akin to “short-selling” on stock markets. FLEXIMAREX follows free market principles – buy and sell offers are matched in the order of arrival (first come first served). Orders include three criteria: timing, wattage, and price. If a buyer is willing to pay at least the seller’s ask price, the transaction is executed. FLEXIMAREX is intended to serve a large audience ranging from individual households, to industry, to energy providers themselves. Detailed information on the FLEXIMAREX marketplace is available online at

To facilitate energy trading on the marketplace, participants will be provided with smartbots. Each smartbot is designed to communicate with both the marketplace and a local environment control module (that provides information about energy production, consumption, and storage). Smartbot decision logic will be implemented using reinforcement learning algorithms, and can perform the following actions:

  • Place a trade order to buy an energy flexibility batch
  • Place a trade order to sell an energy flexibility batch
  • Hold onto excess energy

Different smartbots have been proposed, each of which are designed to execute different trading strategies, such as:

  • Buy energy at a low price for own consumption (consumer)
  • Sell produced energy excess at a high price (individual household producer)
  • Balance energy consumption and production over time (global energy provider)
  • Generate profit by trading energy, e.g., buying, holding and selling later at a higher price (energy trader)

Reinforcement learning agents need to take a large number of initial exploratory actions before they start to learn an efficient policy in their operating environment. For an agent designed to trade on a marketplace, this means placing many orders, most of which likely result in poor outcomes. Since this initial training phase is costly if done against a live marketplace (the owner of that agent will suffer many trading losses), agents will be pre-trained in simulated environments and against historical data. These pre-trained smartbots will then be provided to customers as a subscription service that can be run locally (in participants’ households) or a remotely (running in the cloud). FLEXIMAREX smartbots will be implemented using public machine learning libraries, including Tensorforce and Tensorflow.

Threats to the FLEXIMAREX marketplace

The use of reinforcement learning in trading agents brings many benefits to the participants of the FLEXIMAREX marketplace. Since agents are constantly trained against their previous actions, they will be able to adapt to changes in the marketplace with limited participant involvement. However, the fact that participants may rely on the autonomous nature of these agents exposes threats to the models themselves. It may be possible for adversaries to manipulate smartbots into making decisions that can cause losses for the victim, jeopardize market stability, or even disrupt the energy grid. Trading algorithms also represent a competitive advantage in this marketplace – better algorithms will make better trades for their owners, and thus may be the target of theft.

Risk assessment and attack goals

Using an approach inspired by the DREAD risk assessment model, we have identified a number of potential attacks against the smartbots proposed to be used in the FLEXIMAREX marketplace. In this article, we focus on adversarial attacks against machine learning models, foregoing analysis of attacks that relate to software, hardware or communications security. The goals and motivations for adversarial attacks targeting these reinforcement learning models may include:

  • Goal 1: Manipulate energy prices for profit. An attacker compromises a model in order to trick a number of smartbots into placing incorrect trading orders.
    • An attacker wanting to sell energy at a high price tricks smartbots into placing buy orders at a high price, thereby trending energy prices in the marketplace upwards.
    • An attacker wanting to buy energy at a low price tricks smartbots into placing sell orders at a low price, thereby trending energy prices in the marketplace downwards.
  • Goal 2: Targeted trade disruption for ransom. An attacker compromises a specific participant’s smartbot, causing it to, for instance, hold energy rather than sell it. This causes the participant’s batteries to reach maximum capacity and newly produced energy to be wasted. The adversary asks for a ransom in order to remedy the situation.
  • Goal 3: Threaten the business of an energy provider. An attacker tricks smartbots belonging to a targeted energy provider to increase their energy demand by placing many buy orders simultaneously. The targeted energy provider is unable to generate enough energy to fulfil demand from buyers. This can have two impacts: a) the reputation of the targeted energy provider is harmed and b) the targeted energy provider is forced to buy energy from its competitors (possibly at a huge loss if competitors up their prices).
  • Goal 4: Global disruption of the energy grid (terror attack). As in goal 3, an attacker tricks energy consumer smartbots (for instance, in a given region) to increase their energy demand. Energy providers are unable to fulfil demand from buyers causing a complete grid failure and the shutdown of all electric appliances in a targeted region.
  • Goal 5: Steal someone else’s trading algorithm. As noted earlier, trading algorithms can be provided to participants via a subscription model. An adversary looking to use one of these algorithms without paying for it, or to set up a rival subscription service, may be tempted to steal it. Stealing a model is achieved by providing the target model with inputs and examining outputs in a sufficient quantity such that a new model that behaves in a similar manner can be trained from scratch. This querying method can also allow an adversary to learn the victim’s automated trading strategy, thus enabling the adversary to design an algorithm that can efficiently exploit that strategy.

All of the goals outlined above can be achieved by compromising the reinforcement learning trading algorithm contained within a smartbot or number of smartbots. The first four adversarial goals trick victim smartbots into placing trading orders that are counter to their intended objectives. Adversarial goals 1 and 2 can be achieved by manipulating as little as a single smartbot while adversarial goals 3 and 4 require many smartbots to be compromised. The 5th adversarial goal (model stealing) does not require smartbot compromise – it can be executed without affecting the decisions of the targeted agent. Table 1 summarizes the number of compromised smartbots required to achieve each adversarial goal, the victim(s) of each attack, and the damage they may suffer because of the attack.

Table 1 illustrates a few key points. Victims for most attacks are the same as those from whom the trading algorithm would be compromised, with the exception of goal 4 which affects all parties on a regional energy grid, regardless of whether their trading algorithm was compromised or not. Goal 4 is the attack with the largest amplification effect and the largest overall impact. In general, the number of victims and the severity of damage caused are both linked to the number of compromised smartbots required to execute the attack. Attacks with the highest number of victims and the greatest damage require the most compromised smartbots. Although attack goals 1 and 3 only target a single or small number of smartbots, other participants suffer collateral damage.

Attack surface

Attacks against machine learning models can be launched at two different times in their lifecycle: during training and re-training or at runtime. In this section we discuss the attack surface of the reinforcement learning trading algorithm and describe potential attacks that can be executed in order to achieve the adversarial goals identified in the previous section.

Runtime attacks

Figure 2 depicts the information used by the trading algorithm to render its decisions (buy/hold/sell), and the parties providing this information.

Adversarial goals 1-4 can be reached by performing an evasion attack. This attack consists of modifying data that serves as inputs to the smartbot. Since goals 1 and 2 require only one smartbot to be tricked, an attacker may choose to modify demand-supply information from the victim’s local environment control module in order to achieve their goals. In order for this attack to work, the environment control module or the communication channel between the environment control and the smartbot must be compromised. In goals 3 and 4 the adversary must trick multiple smartbots. Considering that it would be difficult to compromise multiple separate environment control modules or their communication paths, a more effective approach may involve tampering with the FLEXIMAREX marketplace, or communication paths between the marketplace and victim smartbots. Such an attack may be more challenging if the marketplace is well secured, or a man-in-the-middle attack is not possible. Either way, the attacker should also ensure that adversarial modification of information flowing between the marketplace and victim systems is subtle enough to avoid detection. It is definitely possible to algorithmically generate adversarial modifications such that they remain unnoticeable to humans while successfully fooling the target models for an extended period of time.

Adversarial goal 5 is typically achieved at runtime using model extraction/stealing attacks. These attacks can be performed by submitting several queries to a trading algorithm and recording its predictions. Queries and predictions can then be used to train a surrogate model that mimics the victim’s policy.

Training attacks

Since goals 1-4 modify the decision logic of the trading algorithm, attacks that compromise the model’s training process itself are also viable for achieving adversarial intent. Such attacks can be carried out in two different ways – either by attacking the offline training environment used to initially train these models, or by attacking the continual retraining process used by models already in production.

During offline training, both the marketplace and environment control module are simulated, a shown in Figure 3. Open frameworks such as OpenAI Gym are typically used to build these simulation environments. In order to attack models trained in such an environment, an adversary may be able to influence the training process by:

  • Modifying public libraries used by the training code (such as scikit-learn, TensorFlow, PyTorch, etc.). This allows the adversary to compromise training logic.
  • Modifying publicly available simulation environments (such as those available in OpenAI Gym). This allows the adversary to game information used as inputs to the model, or to trick the agent’s reward computation.
  • Compromise the infrastructure used for training and running the simulation. This allows the adversary to execute any of the above outcomes.

Note that a compromised simulation environment can also lead to a data poisoning attack if it generates data used for further training. If an adversary gains access to an offline training environment, they can also achieve goal 5 rather easily, since they can simply exfiltrate the agent’s model code and learned weights.

The second scenario involves attacks against the re-training process in deployed models. The attacks described above all apply to re-training in deployment in the same way.


Using smartbots to automate energy trading has both upsides and downsides for customers of the FLEXIMAREX marketplace. Such smartbots can be used to automatically and efficiently trade energy batches based on a user’s preferences and can adapt to evolution and changes in the marketplace. However, these smartbots can also be attacked in a variety of ways. An attacker can manipulate the input information to the algorithm, its training platform, its training libraries, or the simulation platform used to create new agents, in order to achieve various attack goals that have detrimental effects for any participant in the marketplace. Generally speaking, the overall impact of an attack will be relative to the number of compromised smartbots. However, large-scale attack scenarios that would be of interest to terrorists are possible. Due to the automated nature of smartbots, it is highly likely that any attack designed to subtly manipulate prices in the marketplace would go unnoticed for a long time.

In order to mitigate the risk of attacks, assets that are shared by most trading algorithms should be properly secured. This includes marketplace data, training libraries, initial training simulation environment, and systems used to run the smartbots. Methods for detecting compromised input data should also be deployed to detect and prevent data poisoning and evasion attacks at any scale.

Finally, one has to consider the fact that well-trained smartbots can become the target of theft. As such, methods to detect the types of queries used for model stealing should also be deployed where needed.

This threat analysis was conducted by Samuel Marchal (F-Secure Corporation) as part of FLEXIMAR and F-Secure’s Project Blackfin. FLEXIMAR, a project supported by Business Finland, aims to develop a new real-time marketplace for demand-side management of electricity. F-Secure’s Project Blackfin is a multi-year research effort with the goal of applying collective intelligence techniques to the cyber security domain.

Andrew Patel

19.04.21 12 min. read


Related posts


Newsletter modal

Thank you for your interest towards F-Secure newsletter. You will shortly get an email to confirm the subscription.

Gated Content modal

Congratulations – You can now access the content by clicking the button below.