A comprehensive framework for automated static malware analysis using Ghidra's headless mode. This project extracts key features from malware binaries to create datasets for machine learning models and malware research.
Malware Source: vx-underground, virussign, malwarebazaar, etc.
Ransomware: https://github.com/Cryakl/Ransomware-Database/
This project combines Python and Java scripts to analyze malware samples using NSA's Ghidra reverse engineering framework. The system:
- Automates batch analysis of multiple malware samples
- Extracts key static features from binaries
- Generates structured datasets (CSV) for machine learning
- Detects suspicious indicators like dangerous APIs, obfuscation techniques, and more
The system operates in two modes:
- GUI Mode: Interactive analysis through Ghidra's graphical interface
- Headless Mode: Automated batch processing through command-line
The workflow consists of:
- Python script scans for malware files
- For each file, Ghidra headless analyzer is invoked
- Java script extracts features during analysis
- Results are collected into a CSV dataset
The system extracts the following features from binary files:
Feature | Description |
---|---|
FuncCount | Total number of functions |
AvgInstrPerFunc | Average instructions per function |
APIEntropy | Entropy score of API calls |
SuspStringCount | Count of suspicious strings |
MaxRefCount | Maximum reference count |
ObfuscationScore | Degree of code obfuscation |
IsAnomalous | Binary classification of anomalous patterns |
- Ghidra 11.3.1 or higher
- Python 3.6+
- Java Runtime Environment
-
Clone this repository:
git clone https://github.com/GlgApr/Malware-Analyzer.git cd Malware-Analyzer
-
Configure paths in
malware_analyzer.py
:GHIDRA_PATH = "/path/to/your/ghidra/" PROJECT_PATH = "/path/to/store/ghidra/projects" MALWARE_PATH = "/path/to/malware/samples" OUTPUT_PATH = "/path/to/output/directory"
-
Place
ExtractMalwareFeatures.java
in your Ghidra scripts directory or specify its path in the Python script.
python3 malware_analyzer.py
This will:
- Scan the malware directory for supported file formats
- Process each file with Ghidra headless analyzer
- Extract features and save them to a CSV file
- Clean up temporary projects to save disk space
The analyzer supports multiple file formats including:
- Windows executables (.exe, .dll)
- Linux ELF files (.elf)
- Scripts (.js, .ps1, .hta, .bat)
- Archive files (.zip)
- And more
The system generates a CSV file (malware_features.csv
) containing extracted features that can be used for:
- Machine learning model training
- Static analysis research
- Malware family clustering
- Threat intelligence
Example output format:
FileName,FuncCount,AvgInstrPerFunc,APIEntropy,SuspStringCount,MaxRefCount,ObfuscationScore,IsAnomalous
malware1.exe,245,18.7,0.78,12,87,0.65,True
malware2.dll,132,15.2,0.54,5,42,0.32,False
- Fully automated analysis pipeline
- Comprehensive feature extraction
- Integration with Ghidra's powerful decompilation
- Batch processing of multiple files
- Static analysis only (no runtime behavior)
- Heavily packed malware requires manual unpacking
- Resource-intensive for large binaries (>100MB)
- Parallel processing for faster analysis
- Integration with dynamic analysis sandboxes
- Interactive dashboard for visualization
- Enhanced feature extraction
Sample output data is available in the repository under output/malware_features.csv
. Additional logs and feature extraction results can be found in the output directory.
malware_analyzer.py
: Main Python script for batch processingExtractMalwareFeatures.java
: Ghidra script for feature extractionoutput/
: Directory containing analysis results and CSV datasets
Project resources available at:
- GitHub: GlgApr
- Contact: 0xAnarki@proton.me