Detecting malware families via minimal generators of behavioral concepts

Cybersecurity
FCA
Algorithm
Minimal generators
Malware
Publication idea

This paper proposes a novel FCA-based malware detection framework. We use the minimal generators of behavioral concepts as robust, interpretable signatures to classify malware. This method provides high accuracy and clear, human-readable explanations for security analysts.

Author

Domingo López Rodríguez, Manuel Ojeda-Hernández, Ángel Mora

Published

25 November 2025

Keywords

Malware detection, Behavioral analysis, Minimal generators, Formal Concept Analysis, Interpretability, Explainable AI

The challenge in malware detection is to move beyond simple signature matching towards identifying the core behaviors that define a threat family. This paper proposes a novel detection framework using Formal Concept Analysis (FCA). We construct a formal context where malware samples are objects and their dynamic behaviors (e.g., API calls, registry modifications, network connections) are attributes. We argue that the minimal generators of the concepts in this lattice represent the essential, non-redundant sets of behaviors that are sufficient to trigger a specific malicious classification. These minimal generator sets form robust and interpretable “behavioral signatures.” We develop an algorithm to extract these signatures and use them to classify new, unseen malware samples. Our results on a large dataset of malware show that this method not only achieves high detection accuracy but also provides security analysts with clear, human-readable explanations for why a sample is classified as malicious.

Introduction

Modern malware is polymorphic and evasive, making traditional signature-based detection increasingly ineffective. Behavioral analysis, which focuses on what malware does rather than what it is, offers a more robust alternative. However, extracting a concise and reliable behavioral signature from a noisy execution trace is a significant challenge.

In a previous work, we used FCA to create a hierarchical taxonomy of malware threats based on the labels provided by different antivirus engines. This provided a structured view of malware families. The logical next step, which we address in this paper, is to identify the underlying behaviors that define these families.

Our approach is grounded in the FCA concept of minimal generators. A minimal generator of a concept is a minimal set of attributes (behaviors) whose closure yields the full intent of the concept. We posit that these minimal generators are the “causal core” of a malware family’s behavior. Our contributions are:

  1. A formal framework for modeling malware behavior using FCA, where minimal generators serve as behavioral signatures.
  2. An efficient algorithm for extracting and managing these minimal generator signatures from a behavioral context.
  3. An empirical validation showing that our signature-based classifier achieves high accuracy and provides superior interpretability compared to black-box machine learning models.

Methodology and expected results

Step 1: Behavioral context construction

We will use a dynamic analysis sandbox (e.g., Cuckoo Sandbox) to execute a large set of malware samples. For each sample, we extract a feature vector of binary attributes representing its behavior (e.g., `creates_file_in_system32`, `connects_to_port_80`, `modifies_run_key`). These vectors form the rows of our formal context K\mathbb{K}, where objects are malware samples and attributes are behaviors.

Step 2: Identifying key concepts and minimal generators

We will first build the concept lattice B(K)\mathfrak{B}(\mathbb{K}). We are particularly interested in concepts that are good predictors of a specific malware family (as defined by our previous taxonomy). For a target malware family FF, we can identify a concept (A,B)(A,B) as being representative of FF if a high percentage of objects in its extent AA belong to FF.

Main theoretical/algorithmic task

For these key concepts, we will compute their set of all minimal generators, MinGen(B)\text{MinGen}(B). This is the core of our method. A key result will be to demonstrate that the set (A,B)KeyConceptsMinGen(B)\bigcup_{(A,B) \in \text{KeyConcepts}} \text{MinGen}(B) forms a comprehensive and non-redundant signature base. We will leverage existing or new algorithms for minimal generator computation, possibly extending the CbO-based approach as mentioned in the research memorandum.

Step 3: Classification

For a new, unseen sample with behavior set XX, classification is straightforward and fast. We check if XX contains any of our computed minimal generator signatures. If GXG \subseteq X for some GMinGen(B)G \in \text{MinGen}(B) associated with family FF, we classify the sample as belonging to family FF. This is inherently explainable: the sample is malicious because it exhibits the minimal behavior set GG.

Work plan

  • Months 1-2: Set up the dynamic analysis pipeline and collect a large dataset of malware samples and their behavioral reports. Construct the formal context.
  • Months 3-5: Implement or adapt an efficient algorithm for minimal generator computation. Extract the minimal generator signatures for key malware families.
  • Months 6-8: Build the classifier and evaluate its performance (accuracy, precision, recall, F1-score) against standard ML classifiers (e.g., Random Forest, SVM) on a held-out test set.
  • Months 9-12: Conduct a qualitative analysis, presenting examples of discovered signatures and showing how they provide clear explanations for malware behavior. Write the manuscript.

Potential target journals

  1. Computers & Security (Q1): A leading journal in cybersecurity that publishes novel detection methods.
  2. Forensic Science International: Digital Investigation (Q3): An excellent venue given the successful publication of the previous taxonomy paper there. This would be a natural follow-up.
  3. Journal of Computer Virology and Hacking Techniques (Q2): A specialized journal that would be highly receptive to this novel, formal methods-based approach.

Minimum viable article (MVA) strategy

This project splits naturally into a foundational FCA paper and a cybersecurity application paper.

  • Paper 1 (The MVA - algorithmic contribution):
    • Scope: This paper focuses on the FCA contribution: “A CbO-based algorithm for efficiently mining minimal generators”. It would present the new or adapted algorithm for finding minimal generators, prove its correctness, and analyze its complexity. The malware application would be used as a motivating example and for the experimental evaluation, but the paper’s core contribution would be the algorithm itself.
    • Goal: To contribute a new, efficient algorithm to the core FCA toolbox.
    • Target venue: A theoretical computer science or FCA-centric venue, such as the ICFCA conference or the journal Discrete Applied Mathematics.
  • Paper 2 (The cybersecurity application):
    • Scope: This paper focuses entirely on the application. It would briefly cite Paper 1 for the algorithmic details and then dive deep into the cybersecurity problem. It would describe the data collection, context construction, and a thorough evaluation of the detection method against state-of-the-art malware detection systems. The emphasis would be on performance, scalability, and, most importantly, the explainability of the results for security analysts.
    • Goal: To introduce a novel, interpretable malware detection paradigm to the cybersecurity community.
    • Target venue: A top-tier cybersecurity journal like Computers & Security or IEEE Transactions on Information Forensics and Security.