A complete and modular system for phishing detection, combining traditional machine learning, deep learning, and handcrafted URL feature engineering.
For a fast hands-on experience, check out the following website where you can test the phishing detection model:
MGCF-Net/
β
βββ dataset/ # Balanced dataset (phishing & legitimate URLs)
βββ src/ # Main codebase
β βββ dl.py # Main script (implement for the architectures)
β βββ ml.py # Traditional ML models (SVM, RF, NB)
β βββ dl_test.py # Run pretrained model on test set
β βββ dl_run.py # running for web
β βββ ... # Supporting modules
βββ requirements.txt # Python dependencies
βββ README.md # Project overview
git clone https://github.com/1Hun0ter1/MGCF-Net.git
cd MGCF-Net
Ensure you have Python β₯ 3.8 and install dependencies:
conda env create -f environment.yaml -n MGCF-Net
cd src
CUDA_VISIBLE_DEVICES=0 python dl.py \
-ep 20 \
-bs 1000 \
-arch MGCF_Net \
-wd 1e-3 \
-feature "word-level" \
-lr 'cosine' \
-nw 50000 \
-data "balanced_dataset" \
-enhanced False
Parameter | Description |
---|---|
-ep |
Number of training epochs (e.g., 20) |
-bs |
Batch size for both training and testing (e.g., 1000) |
-arch |
Architecture name, e.g., MGCF_Net , DeepCNN_Light_Hybrid , DeepCNN_Light_V2_2 , rnn , brnn , cnn_base , etc. |
-wd |
Weight decay (L2 regularization strength), e.g., 1e-3 |
-feature |
Feature extraction method: char-level , word-level , TF-IDF , n-grams |
-lr |
Learning rate scheduler: none , cosine , exponential |
-nw |
Number of words to consider as features (e.g., 50000) |
-data |
Dataset type, e.g., "balanced_dataset" |
-enhanced |
Enable adversarial data enhancement (True/False ) |
Note: The results will be automatically saved under
test_results/custom/...
with timestamped folders.
cd src
CUDA_VISIBLE_DEVICES=0 python ml.py -model "SVM"
"SVM"
"RandomForest"
"LogisticRegression"
"KNeighbors"
cd src
python dl_run.py
This launches a simple/branch/file-uploaded URL-checking web page with your trained model in the backend.
cd src
python dl_test.py -m path/to/model_all.keras -r path/to/result_dir/
MGCF-Net is a custom-designed neural architecture for phishing URL detection. It fuses multi-granular textual features with handcrafted and domain-aware signals via a Cross-Attentive Fusion Mechanism.
-
𧬠Multi-Granular Feature Fusion
Combinechar/word embeddings
,manual heuristics
, anddomain reputation
features. -
π Local + Global Context Modeling
Capture semantic signals via:CNN
: for local n-gram patternsBiLSTM
: for sequential URL dependencies
-
π― Cross-Attention Fusion Layer
Aligns semantic representations with domain-specific signals to strengthen detection of disguised or adversarial phishing URLs. -
π‘οΈ Adversarial Robustness
Trains with auto-generated attack URLs (e.g.,typo-squatting
,fake subdomains
) for real-world simulation.
[ url_sequence , manual_domain_features ]
url_sequence
: tokenized from raw URL (char-level / word-level)manual_domain_features
: 9D feature vector
β includes heuristic + PageRank-derived metrics
You can download the balanced phishing/legitimate URL dataset collected from Common Crawl
and URLhaus
as well as pretrained deep learning model checkpoints from the following link:
π¦ Link: https://pan.baidu.com/s/1_4wVWxnYk4OoVasEJDXjnQ?pwd=5evu
π Access Code: 5evu
@program{SSS-CW2025Huang,
title={Multi-Granular Context Fusion Network for Phishing URLs Detection},
author={Hao Huang, Chuyu Zhao, Mingshu Tan, Zhuyi Li, Tianshu Wen, Zijie Chen, Yu Meng, Yitong Zhou}
year={2025}
}
If you find this project useful, consider citing or starring π the repo.
This project also benefits from insights and ideas found in related open-source efforts in the phishing detection community, including prior work on dataset structuring and evaluation pipelines (e.g., dephides).