MW-Classifier

This project is a student project of a Malware-Classifier. Machine Learning research on malware detection and attribution

All the work is based on the amazing book Malware Data Science !

Project architecture

The project's global architecture is as follow :

mw-classifier/
│
├── main.py                                    # Main application entry point
├── config.yml                                 # Config parameters
├── cached_features.pkl                        # Cached python malware features with Pickle for fast debugging
├── features/                                  # Contains all features extractors for malware samples
│   ├── features_extractor.py                  # Simultaneous localization
│   ├── strings.py                             # Simultaneous localization
│   └── static_iat.py                          # Template class for a module object
├── SAMPLES/                                   # Contains malware samples for testing and debugging
├── graphics/                                  # Contains all chart results to evaluate the model (similarity matrix...)
├── engine/                                    # Heart of the project, orchestrate features extractor, database or caching and execution of a model
│   ├── similarity_engine.py                   # Orchestrator of the project execution
│   ├── minhashcustom.py                       # Simultaneous localization
│   └── redis_storage.py                       # Template class for a module object
├── models/                                    # Contains all similarity analysis models
│   ├── model_template.py                      # Simultaneous localization
│   ├── model_runner.py                        # Simultaneous localization
│   └── hnsw_search_nearest_neighbor.py        # Template class for a module object
├── utils/                                     # Useful functions like Logger, Config class, Neo4J utilities
│   ├── config.py                              # Config class to avoid passing config as parameter for each class object
│   ├── logger.py                              # Logging class
│   ├── neo4j_graph.py                         # Neo4J class to manage graph objects
│   └── tools.py                               # Standalone functions (load YAML file...)
├── tests/                                     # Unit tests for the application to be implemented
└── requirements.txt                           # Project dependencies

Installation

sudo apt install build-essential libsystemd-dev in order to install systemd python lib (or you will get ERROR: Failed building wheel for cysystemd)
Clone the repository : git clone
Rename config.sample.py to config.py and fill the fields with your own values. To avoid any issue risking a detonation, makes the SAMPLES folder in read only mode.
Set up venv and install requirements :

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

In order to run the complete project with databases, start docker containers with docker-compose up -d and then run the project. It is not necessary for Neo4J docker which is started via Docker Python API inside the code. It is useful only if you use Redis and so on ! The compose file is built in a way that you can chose which services are launched like docker compose up -d redis to start the redis server database.

Run : docker compose up -d neo4j (by default) then python3 main.py

Datasets

In order to use the project, you can use this dataset for instance : https://github.com/cyber-research/APTMalware/tree/master

I personally use Mandiant APT1 dataset and VX-Underground dataset, thank you very much for your work ;)

Project's overview

Tree :

features : contain scripts to extract features from a sample
malware-similarity : scripts to make a graph of similarity between malware samples
utils : utilities functions
shared_code_analysis : scripts to train a model on a dataset and then detect similarities with a submited sample

ML functions overview

Jaccard index

Similarity Engine

Based on a dataset (here APT1), makes graph of similarity between malware samples which can be associated as families.

Start it with python3 main.py, the Neo4J docker is started with Python code !

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.github/workflows		.github/workflows
SAMPLES/APT1_MALWARE_FAMILIES		SAMPLES/APT1_MALWARE_FAMILIES
engine		engine
features		features
models		models
simple_similarity_engine		simple_similarity_engine
utils		utils
.gitignore		.gitignore
README.md		README.md
compose.yml		compose.yml
config_template.yml		config_template.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MW-Classifier

Project architecture

Installation

Datasets

Project's overview

ML functions overview

Similarity Engine

About

Uh oh!

Releases

Packages

Uh oh!

Languages

k8pl3r-sh/mw-classifier

Folders and files

Latest commit

History

Repository files navigation

MW-Classifier

Project architecture

Installation

Datasets

Project's overview

ML functions overview

Similarity Engine

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages