Malware names are a mess: we used fca to build a ‘family tree’ for threats
Different antivirus tools call the same threat different names. We used Formal Concept Analysis (FCA) to automatically build a unified, logical hierarchy from this chaos, making malware easier to track.
In cybersecurity, how do you fight an enemy you can’t even agree on what to call?
This is a massive problem in malware detection. You submit a malicious file to VirusTotal and get 70 different answers. One antivirus calls it “Trojan.Generic”, another says “Mal/Kryptik-G”, and a third calls it “W32.Troj”. Are they the same? Is one a variant of the other?
This lack of standardized naming makes it incredibly difficult to track threats, share intelligence, and perform digital forensics. In our 2024 paper in Forensic Science International: Digital Investigation, we proposed a solution: use logic to build the “family tree” they all belong to.
🧐 The problem: a ‘babel’ of threat names
The core issue is the lack of “homogenisation”. Every cybersecurity vendor has its own proprietary naming scheme. This creates a “Tower of Babel” scenario where: * Analysts can’t collaborate: How can two experts share data about “Mal/Kryptik-G” if one of them only knows it as “Trojan.Win32”? * It’s impossible to see the big picture: You can’t see the relationships between threats. Is this new “Trojan.Agent.D” related to the “Kryptik” family? * Forensics becomes difficult: A clear hierarchy is essential for legal and forensic investigation.
💡 Our solution: let fca find the hierarchy
We realized the solution wasn’t to force everyone to use the same names. The solution was to find the logical structure hidden within those messy names.
Our tool of choice? Formal Concept Analysis (FCA).
We built a system that: 1. Takes a dataset of malware samples (the “objects”). 2. Uses all the different labels from all the different antivirus engines as the “attributes” (e.g., trojan, generic, kryptik, win32). 3. Runs FCA to build the concept lattice.
This lattice is, by definition, a perfect, logical hierarchy. It automatically groups malware samples based on the labels they share.
🚀 The results: a clean ‘family tree’ from messy data
The result is a unified taxonomy. Instead of a flat, confusing list of names, we get a “family tree” that reveals the true relationships between threats.
*
For example, a group of samples all labeled trojan by Vendor A and kryptik by Vendor B will automatically form a “concept” in the lattice. This concept is a child of the more general trojan concept, giving us a clear parent-child relationship.
This approach provides a logical, mathematical, and automated way to bring order to the chaos.
🔬 Why does this matter?
This framework acts as a “Rosetta Stone” for malware analysts. It provides a common, objective map that everyone can use to understand the threat landscape, even if their individual tools use different names.
For digital forensics, this is a huge step forward. It provides a provable, hierarchical description of malware threats, which is essential for standardizing evidence and analysis in legal contexts. It makes malware classification a formal science, not just a naming convention.
📖 The full paper
For the complete methodology, the dataset description, and the full analysis of the generated malware hierarchies, you can read the original journal article.
A Formal Concept Analysis approach to hierarchical description of malware threats. Authors: Manuel Ojeda-Hernández, Domingo López-Rodríguez, Ángel Mora. Journal: Forensic Science International: Digital Investigation (vol. 50, 301797)