Semantic LLM — Comprehensive Binary Analysis for Malware Detection

Thu, Sep 1, 2022

Overview

A full-stack agentic AI system for zero-day binary malware analysis, built at Huawei R&D’s Anshi Lab. The system operates at the intersection of LLM for Security (binary analysis) and Security for LLM (adversarial robustness), and is deployed across heterogeneous hardware including Huawei NPU, GPU clusters, and IoT edge devices.

Semantic Function Model (SFM)

Developed two architecture variants for function-level binary analysis:

Tokenless Instruction Set Transformer — takes 32-dimensional architecture-specific instruction sets as input, eliminating the need for a separate tokenizer.
Intermediate Representation Tokenizer — lifts binaries to LLVM IR with a POV Normalization Engine for architecture-agnostic analysis.

Semantic Program Model (SPM)

Replaced Self-Attention with Holographic Reduced Representations (HRR) in a transformer. This maps XOR-logic to the Query-Key interaction with O(T log T) complexity, enabling analysis of malicious binaries with 100,000+ functions. The symbolic binding operations act as a natural adversarial noise filter, making the model inherently resistant to adversarial attacks.

Malware Analyst LLM

Utilized Mixture-of-Experts (MoE) routing across SFM and SPM pathways. Elevated the container framework with Agent Client Protocol (ACP) and Model Context Protocol (MCP) infrastructure that dynamically coordinates:

6 multi-turn Online Agents including: Program Encoder Signature Generator, KNN Search, CFG Segment Classifier (GAT), LLM4Decompile Code Generation, and Pangu-R1 reasoning for explainability.
10+ Autonomous Tools including: Ghidra Pro Disassembler, LLVM IR Lifter, Static and Dynamic Behavior Logger (Emulator), and Chroma RAG-DB.

Downstream Capabilities

Scaled capabilities include Function DNA Matching, vulnerability auditing, and cross-architecture code similarity search using high-dimensional function/program embeddings from SFM and SPM.

Heterogeneous Hardware Deployment

Huawei NPU: W8A8 dynamic quantization via HiFloat (HF8) using CANN & ModelSlim.
IoT/Edge: Progressive Teacher–Student Distillation.
GPU Clusters: Mixed-precision training against open-source and in-house malware repositories.

Tech Stack: Python, C++, Assembly, PyTorch, Hugging Face (Transformers, PEFT), LoRA/QLoRA, CANN, vLLM, NVIDIA TensorRT-LLM, DSPy, PydanticAI, CrewAI, MCP, Pinecone, CPU/GPU/NPU Profiling