More Web Proxy on the site http://driver.im/

research-article

Public Access

Diverse Datasets and a Customizable Benchmarking Framework for Phishing

Authors:

Ayman El Aassal,

Luis Felipe Teixeira De Moraes,

Avisha DasAuthors Info & Claims

IWSPA '20: Proceedings of the Sixth International Workshop on Security and Privacy Analytics

Pages 35 - 41

https://doi.org/10.1145/3375708.3380313

Published: 16 March 2020 Publication History

Abstract

Phishing is a challenging problem that has been addressed by many researchers in several papers using many different datatsets and techniques~\citedas2019sok. Researchers usually test their proposed methods with limited metrics, datasets, and parameters when presenting new features or approach(es). Hence, the need arises for a benchmarking framework and dataset to evaluate such systems as comprehensively as possible. In this paper, we discuss: (i) our efforts on the creation and dissemination of diverse and representative datasets for phishing email, website and URL detection, and (ii) PhishBench, our framework for benchmarking phishing detection systems. PhishBench allows researchers to evaluate and compare features and classification approaches easily and efficiently on the provided data.

References

[1]

Ayman El Aassal, Luis Moraes, Shahryar Baki, Avisha Das, and Rakesh Verma. 2018. Anti-Phishing Pilot at ACM IWSPA 2018: Evaluating Performance with New Metrics for Unbalanced Datasets. In Proc. of IWSPA-AP: Anti-Phishing Shared Task Pilot at the 4th ACM IWSPA. 2--10. http://ceur-ws.org/Vol-2124/#anti-phishing-pilot

[2]

Neda Abdelhamid, Aladdin Ayesh, and Fadi Thabtah. 2014. Phishing detection based associative classification data mining. Expert Systems with Applications, Vol. 41, 13 (2014), 5948--5959.

[3]

Andronicus A. Akinyelu and Aderemi O. Adewumi. 2014. Classification of Phishing Email Using Random Forest Machine Learning Technique. Journal of Applied Mathematics, Vol. 2014 (2014), 1--6. https://doi.org/10.1155/2014/425731

[4]

Alexa. 2019. Alexa Top Sites. https://aws.amazon.com/alexa-top-sites/

[5]

Adithya Balaji and Alexander Allen. 2018. Benchmarking Automatic Machine Learning Frameworks. (2018).

[6]

Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for Data Quality Assessment and Improvement. ACM Comput. Surv., Vol. 41, 3 (July 2009), 16:1--16:52.

Digital Library

[7]

André Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard Paass, and Siehyun Strobel. 2010. New filtering approaches for phishing email. Journal of Computer Security, Vol. 18 (2010), 7--35.

Digital Library

[8]

A. Das, S. Baki, A. El Aassal, R. Verma, and A. Dunbar. 2019. SoK: A Comprehensive Reexamination of Phishing Research from the Security Perspective. IEEE Communications Surveys Tutorials (2019), 1--1. https://doi.org/10.1109/COMST.2019.2957750

[9]

D. G. Dobolyi and A. Abbasi. 2016. PhishMonger: A free and open source public archive of real-world phishing websites. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI) . IEEE, Tucson, AZ, USA, 31--36. https://doi.org/10.1109/ISI.2016.7745439

Digital Library

[10]

Ayman El Aassal, Shahryar Baki, Avisha Das, and Rakesh Verma. 2020. An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs. IEEE Access (2020).

[11]

Ian Fette, Norman Sadeh, and Anthony Tomasic. 2007. Learning to Detect Phishing Emails. In Proceedings of the 16th International Conference on World Wide Web (WWW '07). Association for Computing Machinery, New York, NY, USA, 649--656. https://doi.org/10.1145/1242572.1242660

Digital Library

[12]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'15). MIT Press, Cambridge, MA, USA, 2755--2763. http://dl.acm.org/citation.cfm?id=2969442.2969547

[13]

P. J. A. Gijsbers, Erin LeDell, Janek Thomas, Sé bastien Poirier, Bernd Bischl, and Joaquin Vanschoren. 2019. An Open Source AutoML Benchmark. (2019).

[14]

I. R. A. Hamid and J. Abawajy. 2011. Phishing Email Feature Selection Approach. In 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications . IEEE, Changsha, China, 916--921. https://doi.org/10.1109/TrustCom.2011.126

[15]

I. R. A. Hamid and J. H. Abawajy. 2013. Profiling Phishing Email Based on Clustering Approach. In 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications . IEEE, Melbourne, VIC, Australia, 628--635. https://doi.org/10.1109/TrustCom.2013.76

[16]

YuFei Han and Yun Shen. 2016. Accurate Spear Phishing Campaign Attribution and Early Detection. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). Association for Computing Machinery, New York, NY, USA, 2079--2086. https://doi.org/10.1145/2851613.2851801

Digital Library

[17]

Cheng Huang, Shuang Hao, Luca Invernizzi, Jiayong Liu, Yong Fang, Christopher Kruegel, and Giovanni Vigna. 2017. Gossip: Automatically Identifying Malicious Domains from Mailing List Discussions. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (ASIA CCS '17). Association for Computing Machinery, New York, NY, USA, 494--505. https://doi.org/10.1145/3052973.3053017

Digital Library

[18]

M. Khonji, Y. Iraqi, and A. Jones. 2011. Lexical URL analysis for discriminating phishing and legitimate e-mail messages. In 2011 International Conference for Internet Technology and Secured Transactions . IEEE, Abu Dhabi, United Arab Emirates, 422--427.

[19]

Hung Le, Quang Pham, Doyen Sahoo, and Steven CH Hoi. 2018. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. (2018).

[20]

Guillaume Lemaî tre, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, Vol. 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365

Digital Library

[21]

Wei Liu, Sanjay Chawla, David A Cieslak, and Nitesh V Chawla. 2010. A robust decision tree algorithm for imbalanced data sets., bibinfonumpages766--777 pages.

[22]

Jose Nazario. 2004. The online phishing corpus. https://monkey.org/ jose/phishing/

[23]

Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. 2016. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO '16). ACM, New York, NY, USA, 485--492. https://doi.org/10.1145/2908812.2908918

[24]

OpenDNS-PhishTank. 2012. The PhishTank Database. http://www.phishtank.com/developer_info.php .

[25]

OpenPhish. 2019. OpenPhish. https://openphish.com/index.html

[26]

Mustafa A Mohammad Rami, McCluskey Lee, and Thabtah Fadi. 2015. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/phishing

[27]

websites

[28]

Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval .McGraw-Hill, Inc., New York, NY, USA.

Digital Library

[29]

Choon Lin Tan. 2018. Phishing Dataset for Machine Learning: Feature Evaluation. http://dx.doi.org/10.17632/h3cgnj8hft.1#file-286768bb-83f2--4e59--9210--6fed84e3c7fd

[30]

Rakesh Verma and Avisha Das (Eds.). 2018. Proceedings of the 1st Anti-phishing Shared Pilot at 4th ACM IWSPA (IWSPA-AP). CEUR. http://ceur-ws.org/Vol-2124/.

[31]

R. Verma and N. Rai. 2015. Phish-IDetector: Message-ID based automatic phishing detection. In 2015 12th International Joint Conference on e-Business and Telecommunications (ICETE), Vol. 04. IEEE, Colmar, France, 427--434.

[32]

Rakesh Verma, Narasimha Shashidhar, and Nabil Hossain. 2012. Detecting Phishing Emails the Natural Language Way. In Computer Security -- ESORICS 2012, Sara Foresti, Moti Yung, and Fabio Martinelli (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 824--841.

[33]

Rakesh M. Verma and Nabil Hossain. 2014. Semantic Feature Selection for Text with Application to Phishing Email Detection. In Information Security and Cryptology -- ICISC 2013. Springer International Publishing, Seoul, Korea, 455--468.

[34]

Rakesh M. Verma and David Marchette. 2019. Cybersecurity Analytics .Chapman and Hall/CRC, Boca Raton/London.

[35]

Rakesh M. Verma, Victor Zeng, and Houtan Faridi. 2019. Data Quality for Security Challenges: Case Studies of Phishing, Malware and Intrusion Detection Datasets. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS '19). Association for Computing Machinery, New York, NY, USA, 2605--2607. https://doi.org/10.1145/3319535.3363267

Digital Library

[36]

Yue Wu, Steven C.H. Hoi, Chenghao Liu, Jing Lu, Doyen Sahoo, and Nenghai Yu. 2017. SOL: A library for scalable online learning algorithms. Neurocomputing, Vol. 260 (2017), 9--12. https://doi.org/10.1016/j.neucom.2017.03.077

Digital Library

[37]

J. Yearwood, M. Mammadov, and A. Banerjee. 2010. Profiling Phishing Emails Based on Hyperlink Information. In 2010 International Conference on Advances in Social Networks Analysis and Mining. IEEE, Odense, Denmark, 120--127. https://doi.org/10.1109/ASONAM.2010.56

Cited By

Mia MDerakhshan DPritom M(2024)Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AIProceedings of the 11th International Conference on Networking, Systems, and Security10.1145/3704522.3704532(137-145)Online publication date: 19-Dec-2024
https://dl.acm.org/doi/10.1145/3704522.3704532
Boumber DTuck BVerma RQachfar FHu HSung AVerma R(2024)LLMs for Explainable Few-shot Deception DetectionProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659898(37-47)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3643651.3659898
Loiseau GLefils VMeyer MRiquet DVilela JSchulmann HLi N(2024)WikiPhish: A Diverse Wikipedia-Based Dataset for Phishing Website Detection: Data/Toolset PaperProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy10.1145/3626232.3653283(361-366)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3626232.3653283
Show More Cited By

Index Terms

Diverse Datasets and a Customizable Benchmarking Framework for Phishing
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
2. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Social engineering attacks
      1. Phishing

Recommendations

PhishBench 2.0: A Versatile and Extendable Benchmarking Framework for Phishing
CCS '20: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security

We describe version 2.0 of our benchmarking framework, PhishBench. With the addition of the ability to dynamically load features, metrics, and classifiers, our new and improved framework allows researchers to rapidly evaluate new features and methods ...
How Experts Detect Phishing Scam Emails
CSCW

Phishing scam emails are emails that pretend to be something they are not in order to get the recipient of the email to undertake some action they normally would not. While technical protections against phishing reduce the number of phishing emails ...
Status Update on Phishing Emails Awareness: Jordanian Case
ICEMIS'21: The 7th International Conference on Engineering & MIS 2021

Abstract—This study is a response to the rapid proliferation of high-risk phishing emails, representing one of the most dangerous cybercrimes and the primary medium for the deception of online users. This study aims to investigate the various ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IWSPA '20: Proceedings of the Sixth International Workshop on Security and Privacy Analytics

March 2020

84 pages

ISBN:9781450371155

DOI:10.1145/3375708

General Chair:
Rakesh Verma
University of Houston, USA
,
Program Chairs:
Latifur Khan
UT Dallas, USA
,
Chilukuri K. Mohan
Syracuse University, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

CODASPY '20

Sponsor:

SIGSAC

CODASPY '20: Tenth ACM Conference on Data and Application Security and Privacy

March 18, 2020

LA, New Orleans, USA

Acceptance Rates

Overall Acceptance Rate 18 of 58 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,312
Total Downloads

Downloads (Last 12 months)471
Downloads (Last 6 weeks)54

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mia MDerakhshan DPritom M(2024)Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AIProceedings of the 11th International Conference on Networking, Systems, and Security10.1145/3704522.3704532(137-145)Online publication date: 19-Dec-2024
https://dl.acm.org/doi/10.1145/3704522.3704532
Boumber DTuck BVerma RQachfar FHu HSung AVerma R(2024)LLMs for Explainable Few-shot Deception DetectionProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659898(37-47)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3643651.3659898
Loiseau GLefils VMeyer MRiquet DVilela JSchulmann HLi N(2024)WikiPhish: A Diverse Wikipedia-Based Dataset for Phishing Website Detection: Data/Toolset PaperProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy10.1145/3626232.3653283(361-366)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3626232.3653283
Alomar BTrabelsi ZQayyum TAmbali Parambil M(2024)AI and Network Security Curricula: Minding the Gap2024 IEEE Global Engineering Education Conference (EDUCON)10.1109/EDUCON60312.2024.10578588(1-7)Online publication date: 8-May-2024
https://doi.org/10.1109/EDUCON60312.2024.10578588
Khalid AHanif MHameed AAshraf ZAlnfiai MAlnefaie S(2024)LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization ApproachIEEE Access10.1109/ACCESS.2024.351892312(193807-193821)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3518923
MohamedAli RAbduhameed R(2024)Phishing Email Detection: SurveyRecent Trends and Advances in Artificial Intelligence10.1007/978-3-031-70924-1_42(551-570)Online publication date: 22-Nov-2024
https://doi.org/10.1007/978-3-031-70924-1_42
Dunđer ISeljan SOdak M(2023)Data Acquisition and Corpus Creation for Phishing Detection2023 46th MIPRO ICT and Electronics Convention (MIPRO)10.23919/MIPRO57284.2023.10159904(533-538)Online publication date: 22-May-2023
https://doi.org/10.23919/MIPRO57284.2023.10159904
Abdillah RShukur ZMohd MMurah TOh IYim K(2023)Performance Evaluation of Phishing Classification Techniques on Various Data Sources and SchemesIEEE Access10.1109/ACCESS.2022.322597111(38721-38738)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3225971
Zeng VLiu XVerma RJoshi AFernandez MVerma R(2022)Does Deception Leave a Content Independent Stylistic Trace?Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy10.1145/3508398.3519358(349-351)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.1145/3508398.3519358
Ariyadasa SFernando SFernando S(2022)Combining Long-Term Recurrent Convolutional and Graph Convolutional Networks to Detect Phishing Sites Using URL and HTMLIEEE Access10.1109/ACCESS.2022.319601810(82355-82375)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3196018
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten