Acquisition and processing
When attending to talks about APT -or when giving them- sometimes you hear sentences like “most threat actors are focused on information theft” or “Russia is one of the most active actors in APT landscape”. But, where do all those sentences come from? We have spent a whole night exploiting APT data for fun and (no) profit, in order to provide you with some curiosities, facts, data… you can use from now in your APT talks!! 🙂
Since 2019 the folks at ThaiCERT publish the free PDF book “Threat Group Cards: A Threat Actor Encyclopedia” and they have an online portal (https://apt.thaicert.or.th/cgi-bin/aptgroups.cgi) with all the information regarding APT groups acquired from public sources. In this portal, apart from browsing threat groups and their tools, they present some statistics about threat groups activities (source countries, target countries and sectors, most used tools…). Most of these threat groups are considered APT (at the time of this writing, 250 out of 329, with last database change done 20 October 2020). But what happens when you need specific statistics or correlations? You can download a JSON file and exploit it yourself:
$ curl -o out.json https://apt.thaicert.or.th/cgi-bin/getmisp.cgi?o=g
But JSON is a modern thing and is hard to handle with awk, one of the Tools from the Gods (https://www.securityartwork.es/2018/04/12/the-tools-of-gods/); so we also download JSON.sh to convert it to a pipeable format:
$ curl -o JSON.sh https://raw.githubusercontent.com/dominictarr/JSON.sh/master/JSON.sh
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
100 4809 100 4809 0 0 15512 0 --:--:-- --:--:-- --:--:-- 15512
$ chmod +x JSON.sh
$
Now, we parse the JSON file with JSON.sh:
$ cat out.json |./JSON.sh -l > work.txt
Et voilà, we have a file to feel comfortable with. But to feel more comfortable, we split the file into many files, one for each threat actor identified by ThaiCERT (in our main file, by the “values” key):
$ n=`awk -F, 'index($1,"values")>0 {print $2}' work.txt |grep -v value| sort -n|uniq|tail -1` export n
$ for i in $(seq 1 $n);do grep "values\",$i," work.txt >$i.txt;done
$
Please, don’t blame about the efficiency of this one-liner; it will be executed only once. While you are reading this line, now we have one single text file for each threat actor:
$ ls [0-9]*.txt |wc -l
327
$
Each of the text files is composed by entries of the form “[key] value”; just an example:
$ cat 98.txt
["values",98,"value"] "DustSquad, Golden Falcon"
["values",98,"description"] "(Kaspersky) For the last two years we have been monitoring a Russian-language cyberespionage
actor that focuses on Central Asian users and diplomatic entities. We named the actor DustSquad and have provided private
intelligence reports to our customers on four of their campaigns involving custom Android and Windows malware. In this
blogpost we cover a malicious program for Windows called Octopus that mostly targets diplomatic entities.\n\nThe name
was originally coined by ESET in 2017 after the 0ct0pus3.php script used by the actor on their old C2 servers. We also
started monitoring the malware and, using Kaspersky Attribution Engine based on similarity algorithms, discovered that
Octopus is related to DustSquad, something we reported in April 2018. In our telemetry we tracked this campaign back to
2014 in the former Soviet republics of Central Asia (still mostly Russian-speaking), plus Afghanistan."
["values",98,"meta","synonyms",0] "DustSquad"
["values",98,"meta","synonyms",1] "Golden Falcon"
["values",98,"meta","synonyms",2] "APT-C-34"
["values",98,"meta","synonyms",3] "Nomadic Octopus"
["values",98,"meta","attribution-confidence"] "50"
["values",98,"meta","country"] "RU"
["values",98,"meta","motivation",0] "Information theft and espionage"
["values",98,"meta","date"] "2014"
["values",98,"meta","cfr-target-category",0] "Defense"
["values",98,"meta","cfr-target-category",1] "Government"
["values",98,"meta","cfr-target-category",2] "Media"
["values",98,"meta","cfr-suspected-victims",0] "Afghanistan"
["values",98,"meta","cfr-suspected-victims",1] "Kazakhstan"
["values",98,"meta","refs",0] "https://apt.thaicert.or.th/cgi-bin/showcard.cgi?u=982ea477-0c28-490e-87d6-3f43da257cae"
["values",98,"meta","refs",1] "https://securelist.com/octopus-infested-seas-of-central-asia/88200/"
["values",98,"meta","refs",2] "https://www.zdnet.com/article/extensive-hacking-operation-discovered-in-kazakhstan/"
["values",98,"related",0,"dest-uuid"] "e74394ee-e4ab-4642-aca4-fa84d0dcabbf"
["values",98,"related",0,"tags",0] "estimative-language:likelihood-probability=\"almost-certain\""
["values",98,"related",0,"type"] "uses"
["values",98,"related",1,"dest-uuid"] "3d3bf55f-402e-4122-a52b-196aed8e6507"
["values",98,"related",1,"tags",0] "estimative-language:likelihood-probability=\"almost-certain\""
["values",98,"related",1,"type"] "uses"
["values",98,"related",2,"dest-uuid"] "7ff6da6a-d13a-42db-91ac-ac6c3915f3d0"
["values",98,"related",2,"tags",0] "estimative-language:likelihood-probability=\"almost-certain\""
["values",98,"related",2,"type"] "uses"
["values",98,"uuid"] “982ea477-0c28-490e-87d6-3f43da257cae”
$
Now everything is ready to start parsing the files and getting results. Let’s go!
Analysis: silly and simple questions
Once we have processed the gathered information we can start our analysis trying to ask the silly and simple questions that many times we wonder. Let’s go.
Which are the groups with more synonyms?
The silliest question I always wonder is why we use so many names for the same actor. Which one is the group with more names? Let’s see:
$ for i in [0-9]*.txt; do c=`grep synonyms\", $i|grep -vi operation|wc -l `; echo $c $i;done |sort -n|tail -1 18 233.txt $
The result is “233.txt”, which corresponds to APT 28, with 18 synonyms; the second one in the ranking, with 16 names, is Turla. Casually, both of them are from Russia (we’ll see later some curiosities about Russia).
Apart from that, a personal opinion: 18 names for the same group! Definitely, once again, we need a standard for threat actor names. This can be your first sentence when giving a talk about APT: where is an ISO committee when it’s needed?
Which groups are from my country?
Well, outside well known actors… how many groups are from my country? Spanish ISO 3166-1 country code is ES, so let’s look for Spanish threat actors with a simple command, as well as threat actors from other relevant countries
$ grep \"country\" [0-9]*.txt|grep -w ES $ grep \"country\" [0-9]*.txt|grep -w DE $ grep \"country\" *.txt|grep -w IL 183.txt:["values",183,"meta","country"] “US,IL" $
No identified groups from Spain… well, I’m sure this has a technical explanation: Spanish groups are so stealth that they are difficult to discover, and their OPSEC is so strong that, in case of being discovered, attribution is impossible. For sure! But what about Germany? Where is your Project Rahab now? And what about Israel, with only a sad starring together with US? Yes, it’s Stuxnet, but only a single starring… I hope you are as good as Spanish groups: nobody can discover you, and attribution is impossible 🙂 Another sentence for your APT talks: in the group of most stealth countries we can find Germany, Israel… or Spain.
Has any threat group a clear attribution?
The answer, exploiting our data, is simple: NO. All groups have a “50” attribution confidence.
$ grep attribution-confidence [0-9]*.txt|awk '{print $2}'|grep -v ^\”50\" $
One moment… this is an error. How can FBI show folks from Russia, China or Iran in their “most wanted posters” without a clear attribution? Attribution matters, and I personally think some groups (just like APT28, my favorite one) should have a higher attribution confidence value.
Simple questions
Once we have answered our silliest questions, it’s time to wonder just simple ones… in first place:
Which are the most targeted countries? And the most targeted sectors?
Once we have answered our silliest questions, it’s time to wonder just simple ones… in this first case, no analysis has to be performed, as ThaiCERT directly shows these statistics in their portal. The most targeted country is US, followed by UK. Who could imagine that? 🙂 And the most targeted sectors are Government and Defense. Also a big surprise…
Which are the most active countries?
As before, no analysis has to be performed. No surprises here: China, Russia and Iran are the most active countries, in this order, followed by North Korea.
Which is the most analyzed threat group?
A simple query gives you the answer:
$ for i in [0-9]*.txt; do c=`grep refs\", $i|wc -l `; echo $c $i;done |sort -n|tail -1 69 233.txt $
The result is “233.txt”, which corresponds to APT28, with 69 references in the database (remember, APT28 was also the group with more synonyms…both facts are obviously related); the second one in the ranking, with 58 references, is Lazarus.
Which are the oldest threat groups? How is the distribution of threat groups discovery/activity among time?
This is a more interesting question than previous ones… Let’s construct and print a simple associative array from our data:
$ grep \"meta\",\"date\"\] [0-9]*.txt|awk '{print $2}'|sed 's/\"//g'|awk '{a[$0]++}END{for(k in a){print k,a[k]}}' >years $
Now we plot the file to see results:
gnuplot> set boxwidth 0.5 gnuplot> plot 'years' with boxes
We can see two clear outliers, one dated at 1919 and the other dated at 1947; the first one is UK GCHQ and the second one is US CIA, and they both show the date when those services were established. As no other group is considered in this way (for example, Sofacy/APT28 “date” is not set to 1942, the GRU one), we could adjust those dates to a more realistic ones; but as this is not a IEEE paper about anomaly detection, but a simple blog post, it’s faster to simply remove both txt files and re-run to get our results (and set xtics to 1 in gnuplot):
We can see the oldest APT group is dated on 1996; looking at our txt files, this group is Turla, which started its activities 24 years ago. Five years later, in 2001, Equation Group officially started to operate (but we all suspect this is not probably true, and they started before)
The number of identified groups operating since 2010 is growing fast; 2018 is the year when most groups are dated, a total of 33.
Which are the main motivations for APT groups?
Again, associative arrays are our friends:
$ grep \"motivation\", [0-9]*.txt |awk -F\" '{print $8}'|sed 's/\"//g'|awk '{a[$0]++}END{for(k in a){print k,a[k]}}' Financial gain 32 Information theft and espionage 216 Sabotage and destruction 14 Financial crime 50 $
No hacktivism, no surprise… As we suspected, most threat groups are focused on CNE operations, more than on CNA ones… We’ll focus later on the last threat groups, those with destructive or manipulation capabilities…
Analysis: (not so) silly and simple questions
Once we have answered some silly & simple questions, it’s time to wonder more complex ones, so let’s imagine…
Are CNA threat actors increasing their activities during last years?
In the simple questions, we have concluded that sabotage and destruction motivations are not the most common among threat groups. But we are interested in these ones. Let’s see them among time:
for i in `grep "Sabotage and destruction" [0-9]*.txt|awk -F: '{print $1}'`; do grep \"meta\",\"date\"\] $i|awk '{print $2}'|sed 's/\"//g';done|awk '{a[$0]++}END{for(k in a){print k,a[k]}}’ >years.cna
Plotting the results, we have:
gnuplot> set boxwidth 0.5 gnuplot> set boxwidth 0.5 gnuplot> set xtics 1 gnuplot> set ytics 1 gnuplot> set yrange [0:5] gnuplot> plot 'years.cna' with boxes
Since 2012, the number of these threat actors has increased significantly: 9 out of 14 groups in the last eight years, so we can say it’s a growing trend. Out of curiosity , the oldest group with CNA capabilities is dated in 2001. Can you guess its name? Yeah… Equation Group.
Which are the countries with more CNA capabilities?
Let’s look for the main hostile countries performing destructive or manipulation operations:
$ for i in `grep "Sabotage and destruction" [0-9]*.txt|awk -F: '{print $1}'`; do grep \"meta\",\"country\"\] $i|awk '{print $2}'|sed 's/\"//g';done|awk '{a[$0]++}END{for(k in a){print a[k],k}}'|sort -n 1 KP 1 US 1 US,IL 2 IR 7 RU $
Russia has seven identified threat groups performing those operations; far, far away from Iran, with only two threat groups… Without any doubt, Russia is the champion in this ranking!!!
And what about cyber crime? What about threat actors focused on pure economic interests?
We can perform a similar query than before to get these results:
1 "BY" 1 "BY" 1 "IR" 1 "IT" 1 "KZ" 1 "PK" 1 "RO" 1 "SA" 1 "UA" 2 "US" 3 "KP" 6 "CN" 27 "RU
Once again, the gold medal is for Russia also when talking about cybercrime groups.
So Russia is the champion… can we focus on its information needs?
Sure. Let’s see which sectors and countries are the ones targeted by Russian actors. In first place, look at the target sectors:
$ cat russia.sh #!/bin/sh for i in [0-9]*.txt; do grep -w RU $i >/dev/null if [ $? -eq 0 ]; then grep cfr-target-category $i fi done |awk -F\" '{a[$8]++}END{ for(k in a){print a[k],"\x22"k"\x22"}}'|sort -n $ ./russia.sh >temp $
Now let’s prepare our data to be plotted:
$ awk '{print k++,$2,$3,$4,$1}’ temp >sectors.ru #sorry for the quick hack $
gnuplot> set boxwidth 0.5 gnuplot> set xtics rotate by 45 right gnuplot> unset key gnuplot> set style fill solid gnuplot> set title "Russian target sectors" gnuplot> plot "sectors.ru" using 1:3:xtic(2) with boxes
As we can see, the main targets of the Russian Federation are the financial, government, defense, energy and media/education. The “media” sector as a target is very curious… or it is not?
Now let’s look at the countries; modifying our script, and looking only for countries that have been targeted by at least five groups -simply for graphical reasons-, we get the following graph:
The first Russian target is… itself!! Well, it may not be a surprise if we deep into Russian intelligence (remember our older posts about the Russian Cyber Intelligence Community??). After Russia, we can confirm the Russian geographical areas of interest: ex-USSR republics and NATO mainly. Well, not a surprise if you know anything about Russian intelligence.
Which countries are entering the APT arena in the last years?
First, we generate the datafile extracting the country and year from every threat group card -and labeling them with a sequential number in order to plot-:
#!/bin/sh for i in [0-9]*.txt; do c=`grep \"country\"\] $i|awk '{print $2}'|sed s/\"//g` y=`grep \"date\"\] $i|awk '{print $2}'|sed s/\"//g` if [ ! -z $c ] && [ ! -z $y ]; then echo $c $y fi done | awk 'BEGIN{k=1}{if (a[$1]=="") {a[$1]=k++} ; {print $2" "a[$1]" "$1 }}’
Now let’s draw our work:
As we can see, during the last ten years KP (North Korea) and, specially, IR (Iran), have been particularly active, increasing its activities, together with the usual actors (China, Russia or US). Other countries which were active during the first five years of the decade (SY, Syria, or IN, Indonesia, for example) now seem less active -or at least, its new threat groups are not discovered-; a detail: the old groups from these countries can also be active now… it’s a little detail we’ll comment now.
I work for a Fortune 500 company. Can I have a magic quadrant for APT groups?
Sure. Gartner does its research to position technology players within a specific market and represent them in a Magic Quadrant (https://www.gartner.com/en/research/methodologies/magic-quadrants-research). These quadrants classify each player into four categories (leaders, visionaries, niche players and challengers) by analyzing its “ability to execute” and its “completeness of vision”.
For our quadrant, let’s consider “Ability to execute” as the period each actor has been active, and consider “Completeness of vision” as the diversity of targets. Why this criteria? We can consider (well, more or less… remember this is not an IEEE paper but a blog post!) that an ability to execute can be estimated by the years a threat actor has been active (this is, has been executing operations); this is an interesting point: data gathered from APT groups from ThaiCERT source marks only its “foundation” date, not the period they have been active. It should be necessary the use of a “last time seen” data to estimate a real ability to execute.
On the other hand, the completeness of vision is calculated by the number of targets a threat group has, both considering countries and sectors. A simple criteria: the more targets you have, the more complete is your vision… perhaps not exact from an academic point of view, but remember what we said about the IEEE paper 🙂
Following our criteria, we can draw the first version of our Magic Quadrant; first, we can write a simple script to get the data, extracting for each threat actor its name, country (later we’ll see why we are interested in the country), number of sectors and target countries and years active:
$ cat extract.sh n=`ls [0-9]*.txt|wc -l` for i in $(seq 1 $n);do t=`grep "$i,\"value" $i.txt|awk -F"\t" '{print $2}'|sed 's/\"//g' ` name=`echo $t|awk -F, '{print $1}'` country=`grep \"country\"] $i.txt|awk '{print $2}'|sed 's/\"//g' ` date=`awk 'index($1,"date")>0 {print $0}' $i.txt |awk '{print $2}'|sed 's/\"//g' ` y=`date +%Y` ability=`expr $y - $date` sectors=`grep -w cfr-target-category $i.txt|wc -l` countries=`grep -w cfr-suspected-victims $i.txt|wc -l` completeness=`expr $sectors + $countries` echo $name:$pais:$completeness:$ability|awk -F: '{if($3>0 && length($2>0) && $4>=0){print $0}}' done $ ./extract.sh >data 2>/dev/null $
This script generates an output with the following format:
$ head data Aggah::31:2 Allanite::3:3 APT 3:CN:17:13 APT 4:CN:6:13 APT 5:CN:5:13 APT 6:CN:2:9 APT 12:CN:9:11 APT 16:CN:7:5 APT 17:CN:21:11 APT 19:CN:12:7 $
Let’s format this file:
$ awk -F: '{print $3" "$4" ""\x22"$1"\x22"}' data >quadrant $
And let’s also make a “nice” magic quadrant:
$ cat quadrant.plot set title "APT groups" set xlabel "Completeness of vision" set ylabel "Ability to execute" set format y "" set format x "" unset key set parametric set arrow 1 from 40,0 to 40,25 nohead set arrow 2 from 0,12.5 to 80,12.5 nohead plot 'quadrant' w labels point pt 7 offset char 1,1 $
So it’s done, as we can see:
Definitely, not a nice Magic Quadrant suitable for our marketing team, but suitable to get interesting conclusions: Turla is a LEADER, as well as APT28 or Equation Group. Now you can say in your conference why Russia is the champion (remember also when talking about CNA): two out of its groups in the upper right side of the Magic Quadrant.
But inside Russian groups, the Champions League, how could this magic quadrant be? As you remember, we included the country code for each group in our previous script; this is useful to draw national magic quadrants. For example, the Russian one:
$ awk -F: '$2=="RU" {print $3" "$4" ""\x22"$1"\x22"}' data >quadrant.russia $
Changing the title in our previous .plot file, and loading this new data file, we can get the Russian APT magic quadrant:
Please note that as Turla is a Russian group, and it was the clear leader from the previous global magic quadrant, and there are also Russian groups in the lower side, there is no need to change plot parameters; if we try the same with Chinese groups, a little adjust has to be done to get this result:
NetTraveler is the leader here; operating since 2004, and with 4 target sectors among 41 target countries, it’s definitely a robust threat group. Resilient, as we can call them now 🙂
Can I have more sentences for my APT talks?
With a little help from AWK and gnuplot you can generate your own statistics, magic quadrants for your favorite country or any other data or correlation you may need. Apart from that, ThaiCERT maintains another JSON file with data related to the tools used by threat actors, so enjoy!!
Conclusions
Now you have some tips, evidence-based, for your APT talks (don’t forget to use these tips together with Sun Tzu’s “Art of War” quotes); with some more time, you can get to more stupid or interesting conclusions about threat groups activities, interests and origins. And exploiting other datasets (MITRE ATT&CK, here we go!) We can expand those conclusions.
Some key data we can conclude after this little analysis of data:
- It seems clear that Russia plays in the APT Champions League. It’s the most active country in all kind of threat activities, from sabotage to espionage or financial gain.
- The threat group leader is also a Russian one: Turla, operating from almost a quarter century -in this case we can confirm it’s still active- and with targets from a long list of countries and sectors.
- The most loved by analysts threat group is also a Russian one: APT28. Maybe for this reason is the threat group with more synonyms.
- The number of threat actors with CNA capabilities has increased during last years, once again with Russia leading the ranking.
- Apart from classical players, two actors have been particularly actives in the last years: Iran and North Korea.
- It should be interesting to identify a parameter for threat groups, something like “last time seen”, in order to calculate the years a group has been active.
- Using different, vendor-dependent names, for the same threat actor is a little chaos when analyzing data. In this sense, a good effort is MISP’s UUID for each group (https://github.com/MISP/misp-galaxy/blob/main/clusters/threat-actor.json#L2434), as @adulau noted.
- With some imagination and gnuplot you can have your own APT Magic Quadrant for marketing purposes.
- Disclaimer: this is just a simple blog post, not a scientific paper, so don’t expect non questionable sentences here!
- And the most important conclusion: AWK is your friend. Remember:
Leave a Reply