Sorting Through Haystacks to Find CTI Needles
Clouded vision
CTI systems are confronted with some major issues ranging from the size of the collection networks to their diversity, which ultimately influence the degree of confidence they can put on their signals. Are they fresh enough and sufficiently reliable to avoid any false positives or any poisoning? Do I risk acting on outdated data? This difference is major since a piece of information is just a decision helper, whereas a piece of actionable information can directly be weaponized against an aggressor. If raw data are the hayfields, information is the haystacks, and needles are the actionable signal.
To illustrate the collection networks’ size & variety point, without naming anyone in particular, let’s imagine a large CDN provider. Your role is to deliver, on a massive scale, content over HTTP(s). This attracts a lot of “attention” and signals, but only on the HTTP layer. Also, any smart attacker will probably avoid probing your IP ranges (which are public and known in your AS). Hence, you only receive the indiscriminate “Gatling guns” scanners or direct attacks over an HTTP layer. This is a very narrow focus.
Now if you are a large EDR/XDR or whatever glorified antivirus, you also can argue that you have a huge detection network spanning million of devices… Of wealthy enterprises. Because let’s face it, not every non-profit, public hospital or local library can afford to pay for those tools. Hence you potentially only see threats targeted at sophisticated actors, and mostly the ones carried by malware on LAN machines.
On the honeypot front, there is no silver bullet either. The “Gatling guns scanners” represent the background radioactivity of the Internet. A sort of static noise which is constantly present in the surroundings of any Internet-connected device. Here, the problem is rather that no decent cyber criminal group will use any meaningful resources to target a honeypot machine. What’s the point of investing some DDoS resources in knocking down a straw dummy? Would you use any meaningful exploit or tool, let alone burn your IP, on a “potential” target? Honeypots collect “intentions”, automated exploitation, something along the lines of “this IP wants to know if you’re (still) vulnerable to log4j“.
Stay ahead of the game with CrowdSec, an open-source security suite that offers crowdsourced protection against malicious IPs. With its simple integration into your existing security infrastructure, you gain behavioral detection and automated remediation. Plus, you will benefit from highly actionable cyber threat intelligence with zero-false positives and a reduced volume of alerts built from a network of 190k+ machines spread over 180+ countries. Don’t fight alone, let the crowd support you. Get started with CrowdSec for free!
It can be interesting to a certain extent but it is limited to low-hanging fruits. Also, your diversity is limited by your capacity to spread in many different places. If all your probes (honeypots) are sitting over ten or worse, just 3 or 4 different clouds, you can’t see everything, and you can be “dodged”, meaning criminals can voluntarily skip your IP ranges to avoid detection. You also need to organize your deployment system for every platform, and yet you’ll only see the IP not dodging GCP, AWS, or whatever cloud you’re working with. And since those providers are no NGOs, your network size is also limited by…money. If a fully automated HP running on XYZ cloud costs you $20 monthly, your pocket must be deep to run thousands of them.
Establishing a counter-offensive
To curb the trajectory of mass cyber criminality, we need to act on a resource that is limited in essence, otherwise, you cannot organize a proper “shortage”. The famous Conti-Leaks cast an interesting light upon the actual pain points of a large cybercrime group. Obviously (crypto) money laundering, recruitment, payrolls, the classical ones you’d expect. But interestingly enough, when you read the exchanges on their internal chat system, you can see IP, changing them, borrowing, renting, cleaning them, installing the tools, migrating the ops and C2, etc. is … costly. Both time & money-wise.
There are nearly infinite variations of hashes and SHA1 offers a space of 2^160 possibilities. So collecting them is one thing, but you’re almost sure any new malware variation will have a different signature. As we speak, most of the CI/CD procedures of any decent cyber criminal group already include the modification of one byte before sending the payload to a target.
Aiming at domain names is fighting against an infinite space in size as well. You can book domain1, domain2, domain3, etc. There is technically no limit to the number of variations. There are smart systems out there, protecting your brand and checking if any domain names similar to yours have been booked lately. These pre-crime-style systems are very helpful to deal with an upcoming phishing attempt. You start to be proactive with this kind of stance & tools.
It’s anyway useful to track & index malevolent binaries based on their Hashes or the C2 they try to contact or even indexing IP trying to auto-exploit known CVE, but doing so is a rather reactive stance. You don’t strike back by knowing the position or tactic of the enemy, you do so by crippling its offensive capabilities, and this is where IP addresses are very interesting. The system is decades old and will still be there after us. It’s
Now there is a resource that actually is in scarcity: IPV4. The historic IP space is limited to around 4 billion of them. Bringing the fight to this ground is efficient because if the resource is in scarcity, you can actually be proactive and burn IP addresses as fast as you are aware one is used by the enemy. Now, this landscape is an ever-evolving one. VPN providers, Tor, and Residential proxy apps offer a way for cybercriminals to borrow an IP address, let alone the fact that they can leverage some from already compromised servers on the dark web.
So if an IP address is used at é moment in time, it’s possible that it isn’t anymore the next hour and you then generate a false positive if you block it. The solution is to create a crowdsourcing tool protecting all sizes of businesses, across all types of places, geographies, clouds, homes, private corps DMZ, etc., and on all types of protocols. If the network is big enough, this IP rotation isn’t a problem because if the network stops reporting an IP, you can release it, whereas the new one rising in a number of reports needs to be integrated into a blocklist. The larger the network, the more real-timish it becomes.
You can monitor almost any protocol except UDP-based ones, which must be excluded since it’s easy to spoof packets over UDP. So by considering reports on a UDP-based protocol for banning an IP, you could easily be tricked. Other than that, every protocol is good to monitor. As well you can definitely look for CVE but, even better, for behavior. By doing so, you can catch business-oriented aggressions that may not only be CVE based. A simple example, beyond the classical L7 DDoS, scans, credential bruteforce or stuffing is scalping. Scalping is the action of auto-buying a product with a bot on a website and reselling it for a benefit on eBay for example. It’s a business layer issue, not really a security-related one. The open-source system CrowdSec was designed exactly to enable this strategy.
Finally, for the last two decades, we were told, “IPV6 is coming, be ready”. Well… let’s say we had time to prepare. But it’s really here now and 5G deployment will only accelerate its usage exponentially. IPV6 changes the stage with a new IP addressable pool as big as 2^128. This is still limited in many ways, not the least because all V6 IP ranges are not fully used yet but also because everyone is getting many IPV6 addresses at once, not just one. Still, we speak about a vast amount of them now.
Let’s couple AI & Crowdsourcing
When data start to flow massively from a large crowd-sourced network and the resource you try to shrink is getting larger, AI sounds like a logical alley to explore.
The network effect is already a good start on its own. An example here could be credential stuffing. If an IP uses several login/pass couples at your place, you’d call it a credential bruteforce. Now at the network scale, if you have the same IP knocking at different places using different login/pass, it’s credential stuffing, someone trying to reuse stolen credentials in many places to see if they are valid. The fact that you see the same action, leveraging the same credentials from many different angles, gives you an extra indication of the purpose of the behavior itself.
Now, to be honest, you don’t need AI to sort out Credential bruteforce from Credential Reuse or Credential stuffing, but there are places where it can excel though, specifically when teamed with a large network to get heaps of data.
Another example could be a massive internet scan, made using 1024 hosts. Each host could scan only one port and that would likely go unnoticed. Except if you see, in many different places, the same IP scanning the same port within a similar timeframe. Again, barely visible at the individual scale, obvious on a large one.
On the other hand, AI algorithms are good at identifying patterns that wouldn’t be visible if you look only in one place at a time but blatant at the scale of a large network.
Representing the data into appropriate structures using graphs and embeddings can uncover complex degrees of interaction between IP addresses, ranges, or even AS (Autonomous Systems). This lead to identifying cohorts of machines working in unison toward the same goal. If several IP addresses are sequencing an attack in many steps like scanning, exploiting, installing a backdoor, and then using the target server to join a DDoS effort, those patterns can repeat in logs. So if the 1st IP of the cohort is visible at a given timestamp and the 2nd 10 minutes later and so on, and this pattern repeats with the same IPs in many places, you can safely tell everyone to ban the 4 IP addresses at once.
The synergy between AI and crowd-sourced signals allows us to address each other’s limitations effectively. While crowd-sourced signals provide a wealth of real-time data on cyber threats, they might lack precision and context, eventually leading to false positives. AI algorithms, on the other hand, usually only become relevant after absorbing an enormous amount of data. In return, those models can help refine and analyze these signals, eliminating noise and unveiling hidden patterns.
There is a powerful couple to marry here.