In [119]:
from adsy.display import extended_styles
extended_styles()
Out[119]:
Toggle Input

Event-Driven Cyber

Ultimately, we would like to model a within-perimeter attack and detection scenario on a large cyber network. We will be using recurrent (LSTM) neural networks to solve for both the attacker and defender strategies. As we know, computational "gotchyas" can arise in many parts of this model. Therefore, we are going to build the model piece by piece. The plan of attack is as follows:

1. In a two node network, solve for optimal defender strategey for a fixed attacker strategy. (Done)

2. In a two node network, solve for optimal attacker strategy for a fixed defender strategy. (Done)

3. Using fictious play, solve for equilibrium in a two node network. (Done)

4. In a multi-node network, solve for optimal defender strategy for a fixed attacker strategy

5. In a multi-node network, solve for optimal attacker strategy for a fixed defender strategy

6. In a multi-node network, solve for equilibrium using fictious play or some other learning algo...

7. Scale, scale scale scale

Two Node network

In this model, there are only two nodes. Call them A and B. Node A sends benign messages to B where inter-message times are distirbuted exponentially with rate $\lambda_b$. The scenario starts at time $T=0$ without an attacker present in the network. An attacker enters the network at a time determined by an exponential random variable with rate $\lambda_e$. When the attacker enter's the network, it sends malicious messages at rate $\lambda_m$.

The defender continually accumulates a reward of $r$ per unit time the network is running and incurs a cost of $c_m$ every time the attacker sends a message. All costs and rewards are discounted at a rate of $\delta$.

Everytime a message goes from A to B, the defender chooses to either keep the network running or sound an alarm and shut down the network. Effectively, the defender wants to keep the network running as long as possible (so that is accumulates $r$) but shut down the network as soon as the attacker enters.

1. Optimal Defender Strategy for Fixed Attacker Strategy

First, let's look and see the performance as we sweep through various parameters for the defender

In [1]:
import plot_tools as pt
%matplotlib inline
r= pt.Results()
fig, ax = r.plot_convergence()

Before looking at the results, note that the experiment I did was vary $\lambda_m$ and $c_m$ but held the ratio $\lambda_m/c_m$ constant. Intuitively, this varies the attacker strategy that sends relatively unharmful messages very fast versus sending very harmful messages very slowly.

The figure shows that the algorithm learns rather quickly (convergence at about epoch 75) for each parameter specification. It is easy to verify that the top 3 lines are indeed the optimum for the parameter specification. Although it is not as clear for the other parameter specifications, we can check some other plots to verify our intuition.

First, let's look at how the false alarm rate varies with changes in the parameters

In [2]:
fig, ax = r.plot_false_alarms()

This is exactly as expected. When the malicious messages are very frequent but essentially costless, the defender wants to make sure that the attacker is present in the network before sounding a false alarm. On the other hand, when malicious messages are rare but relatively more costly, the defender is willing to sound an alarm with less certaintly regarding the presence of an attacker.

For yet another sanity check, let's see how well the defender would do if it did not take into account any of the information and just shutdown the network at a fixed time.

In [6]:
import two_node_proto as proto
import numpy as np
parameters = proto.parameters
parameters["lm"] = 1.25
parameters["cm"] = 1.2
defender = proto.Defender(parameters)
eus = []
for e in np.arange(.25,20,.5):
    eus.append(defender.expected_reward_at_time(e, 10000))
In [24]:
from matplotlib import pyplot as plt
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(.25,20,.5), eus, label="Expected Utility at Fixed Alarm Time")
ax.set_xlabel("Stopping Time")
ax.set_ylabel("Expected Utility")
ax.plot([0,20],[r.results[3].Rewards.values[-1]]*2, label="Expected Utility Under LSTM Policy")
ax.legend(loc=4);

In the above plot, $\lambda_m=1.25$ and $c_m=1.2$. We see that the LSTM network is doing better than just optimizing over the prior and actually using the observations to improve its reward. Let's make the same comparison with an even slower rate of attack.

In [25]:
import two_node_proto as proto
import numpy as np
parameters = proto.parameters
parameters["lm"] = .3125
parameters["cm"] = 4.8
defender = proto.Defender(parameters)
eus2 = []
for e in np.arange(.25,20,.5):
    eus2.append(defender.expected_reward_at_time(e, 10000))
In [28]:
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(.25,20,.5), eus2, label="Expected Utility at Fixed Alarm Time")
ax.set_xlabel("Stopping Time")
ax.set_ylabel("Expected Utility")
ax.plot([0,20],[r.results[5].Rewards.values[-1]]*2, label="Expected Utility Under LSTM Policy")
ax.legend(loc=4);

In the above plot, $\lambda_m=.3125$ and $c_m=4.8$. We see that even with a very slow rate of attack, the LSTM policy does better than the fixed policy, even though the observation signal-to-noise is relatively low.

I think this is convincing enough to move onto solving for the attacker's policy for a fixed defender's policy. We can of course always do more parameter sweeps and more comparisons. However, note that the parameter specifications I used here are some of the hardest. The reason is that $\lambda_m \times c_m$ is relatively close to $r$. This means that the payoff from sounding an alarm versus the payoff of letting the network run are very ``close'' for any given beleif of the defender, yet the algorithm still performs well.

2. Optimal Attacker Strategy for Fixed Defender Strategy

In this experiment, we will train an attacker against a fixed defender strategy. To do so, we first have to specify what the defender's fixed strategy should be. In this example, we fix the defender's strategy to be the best response to an attacker that has an opportunity to and indeed does send a malicious message at a rate of $\lambda_m=5$. The defender incurs a cost of $c_m=.5$ per malicious message.

Under this fixed attacker strategy, on average the attacker earns a reward of $4$. That is, the attacker sends---on average--- 4 messages before it is detected. The goal is to see if we can now train an attacker to achieve a reward higher than this. Presumably, if the attacker were to stagger its messages instead of sending them at every opportunity, it would be able to "hide" in the normal network activity better. Let's see if that is that case.

In [27]:
import pandas as pd
df = pd.read_csv("./results/a1/attacker1.csv", index_col=0)
df.columns = ["Expected Reward", "Time Between Messages"]
ax1, ax2 = list(df.plot(subplots=True, figsize=(14,6)))
ax2.set_xlabel("Epoch")
ax1.set_ylabel("Expected Reward")
ax2.set_ylabel("Average Time Between Attacker Messages")
Out[27]:
<matplotlib.text.Text at 0x7f85542d1c10>

The attacker learns to wait! In the process, it's expected utility goes to 10 and there are virtually no false alarms (I end the simulation after 100 possible actions of the attacker). Let's do the experiment one more time just to make sure we didn't get lucky.

In [29]:
df = pd.read_csv("./results/a2/attacker2.csv", index_col=0)
df.columns = ["Expected Reward", "Time Between Messages"]
ax1, ax2 = list(df.plot(subplots=True, figsize=(14,6)))
ax2.set_xlabel("Epoch")
ax1.set_ylabel("Expected Reward")
ax2.set_ylabel("Average Time Between Attacker Messages")
Out[29]:
<matplotlib.text.Text at 0x7f854faccc10>

In this case it learns much faster! So the example above was the worst case sceanrio. There is no need to worry about the (slight) difference in the expected reward and the time between messages between the two experiments because the testing set in each experiement was small (100 simulations of the POMDP).

In [2]:
#df = pd.read_csv("./results/a3/attacker3.csv", index_col=0)
#df.columns = ["Expected Reward", "Time Between Messages"]
#ax1, ax2 = list(df.plot(subplots=True, figsize=(14,6)))
#ax2.set_xlabel("Epoch")
#ax1.set_ylabel("Expected Reward")
#ax2.set_ylabel("Average Time Between Attacker Messages")

The expected reward is higher. However, this is likely due to the policy capturing more complex dependencies on past events.

3. Equilibrium

We will now use the vanilla version of fictious play to (hopefully) find the Nash equilibrium of the game. Let's take a look at some of the results. To be sure of the convergence properties, we ran the simulations far longer than we needed to. We performed 25 iterations of fictious play (for each player) and used a batch size of 300. To save time during training, we only tested policies every 20 epochs.

In [5]:
from results import *
res = ReadResults("./fpboth/results/exp2/")

First, we want to see if the defender and attacker strategyies converge. However, it is difficult to use the parameteres of the LSTM network to determine if strategies converge. Instead, we will use high-level properties of strategies and see if they converge when performing fictious play.

In [38]:
%matplotlib inline
ax = (res.falsealarms / 600.).plot()
ax.legend(ncol=4, loc=0)
ax.set_title("False Alarm Rate for various FP iterations")
ax.set_xlabel("Training Epoch (/20)")
ax.set_ylabel("False Alarm Rate");

This graph shows us two things. First, it shows that starting from a random policy, the defender's false alarm rate shrinks drastically. This indicates that there is learning taking place. It also appears that the false alarm rate is converging to the same value. It is likely that the fluctuations are due to fluctuations in the training./

In [34]:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
ax.plot(res.falsealarms.tail(1).mean().values.flatten()/600.)
ax.set_xlabel("Final Fictious Play Iteration")
ax.set_ylabel("False Alarm Rate");

It appears that the false alarm rate is stabalizing and that there is no systematic pattern in the false alarm rate as we continue iterating on fictious play. We can confirm our notation that the defender's strategy is converging by also looking at its expected utilities.

In [37]:
%matplotlib inline
ax = res.dus.plot()
ax.legend(ncol=4, loc=0)
ax.set_title("Defender Expected Utility for various FP iterations")
ax.set_xlabel("Training Epoch (/20)")
ax.set_ylabel("Expected Utility");

This plot shows that for the initial iterations, we see some gains in the defender's expected utility. However, after about 6 iterations, the expected utility always converges to about 7. This is enough to convince me that under fictious play, the defender's strategy and payoffs are converging, and we know that if strategies converge under fictious play, they are an equilibrium. We also see that the LSTM parameters converge when we reset the strategies to the same starting values.

We can also look at the attacker

In [63]:
ax = res.atimebetween.plot()
ax.legend(ncol=4, loc=0)
ax.set_title("Average Time Between Messages for various FP iterations")
ax.set_xlabel("Training Epoch (/20)")
ax.set_ylabel("Time Between Messages");

We see that starting from a random policy, the attacker learns to space out its messages. It has the opportunity to send out a message once every $.1$ time units but learns that the best thing for it to do is to send out a message about once every $.85$ time units.

In [83]:
fig, ax = plt.subplots()
ax.plot(res.atimebetween.tail(1).values.flatten())
ax.set_xlabel("Final Fictious Play Iteration")
ax.set_ylabel("Time Between Messages");

We see that the attacker strategy reaches about .825 seconds between messages and the oscillations are likely sue to the stochasticity in the training data. Lastly, we can look at the attacker's expected utility

In [97]:
ax = res.aus.plot()
ax.legend(ncol=5, loc=4)
ax.set_title("Attacker's Expected Utility for various FP iterations")
ax.set_xlabel("Training Epoch (/20)")
ax.set_ylabel("Expected Utility");
fig = plt.gcf()
fig.set_size_inches(8.5, 6.5)