[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3375708.3380313acmconferencesArticle/Chapter ViewAbstractPublication PagescodaspyConference Proceedingsconference-collections
research-article
Public Access

Diverse Datasets and a Customizable Benchmarking Framework for Phishing

Published: 16 March 2020 Publication History

Abstract

Phishing is a challenging problem that has been addressed by many researchers in several papers using many different datatsets and techniques~\citedas2019sok. Researchers usually test their proposed methods with limited metrics, datasets, and parameters when presenting new features or approach(es). Hence, the need arises for a benchmarking framework and dataset to evaluate such systems as comprehensively as possible. In this paper, we discuss: (i) our efforts on the creation and dissemination of diverse and representative datasets for phishing email, website and URL detection, and (ii) PhishBench, our framework for benchmarking phishing detection systems. PhishBench allows researchers to evaluate and compare features and classification approaches easily and efficiently on the provided data.

References

[1]
Ayman El Aassal, Luis Moraes, Shahryar Baki, Avisha Das, and Rakesh Verma. 2018. Anti-Phishing Pilot at ACM IWSPA 2018: Evaluating Performance with New Metrics for Unbalanced Datasets. In Proc. of IWSPA-AP: Anti-Phishing Shared Task Pilot at the 4th ACM IWSPA. 2--10. http://ceur-ws.org/Vol-2124/#anti-phishing-pilot
[2]
Neda Abdelhamid, Aladdin Ayesh, and Fadi Thabtah. 2014. Phishing detection based associative classification data mining. Expert Systems with Applications, Vol. 41, 13 (2014), 5948--5959.
[3]
Andronicus A. Akinyelu and Aderemi O. Adewumi. 2014. Classification of Phishing Email Using Random Forest Machine Learning Technique. Journal of Applied Mathematics, Vol. 2014 (2014), 1--6. https://doi.org/10.1155/2014/425731
[4]
Alexa. 2019. Alexa Top Sites. https://aws.amazon.com/alexa-top-sites/
[5]
Adithya Balaji and Alexander Allen. 2018. Benchmarking Automatic Machine Learning Frameworks. (2018).
[6]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for Data Quality Assessment and Improvement. ACM Comput. Surv., Vol. 41, 3 (July 2009), 16:1--16:52.
[7]
André Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard Paass, and Siehyun Strobel. 2010. New filtering approaches for phishing email. Journal of Computer Security, Vol. 18 (2010), 7--35.
[8]
A. Das, S. Baki, A. El Aassal, R. Verma, and A. Dunbar. 2019. SoK: A Comprehensive Reexamination of Phishing Research from the Security Perspective. IEEE Communications Surveys Tutorials (2019), 1--1. https://doi.org/10.1109/COMST.2019.2957750
[9]
D. G. Dobolyi and A. Abbasi. 2016. PhishMonger: A free and open source public archive of real-world phishing websites. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI) . IEEE, Tucson, AZ, USA, 31--36. https://doi.org/10.1109/ISI.2016.7745439
[10]
Ayman El Aassal, Shahryar Baki, Avisha Das, and Rakesh Verma. 2020. An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs. IEEE Access (2020).
[11]
Ian Fette, Norman Sadeh, and Anthony Tomasic. 2007. Learning to Detect Phishing Emails. In Proceedings of the 16th International Conference on World Wide Web (WWW '07). Association for Computing Machinery, New York, NY, USA, 649--656. https://doi.org/10.1145/1242572.1242660
[12]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'15). MIT Press, Cambridge, MA, USA, 2755--2763. http://dl.acm.org/citation.cfm?id=2969442.2969547
[13]
P. J. A. Gijsbers, Erin LeDell, Janek Thomas, Sé bastien Poirier, Bernd Bischl, and Joaquin Vanschoren. 2019. An Open Source AutoML Benchmark. (2019).
[14]
I. R. A. Hamid and J. Abawajy. 2011. Phishing Email Feature Selection Approach. In 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications . IEEE, Changsha, China, 916--921. https://doi.org/10.1109/TrustCom.2011.126
[15]
I. R. A. Hamid and J. H. Abawajy. 2013. Profiling Phishing Email Based on Clustering Approach. In 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications . IEEE, Melbourne, VIC, Australia, 628--635. https://doi.org/10.1109/TrustCom.2013.76
[16]
YuFei Han and Yun Shen. 2016. Accurate Spear Phishing Campaign Attribution and Early Detection. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). Association for Computing Machinery, New York, NY, USA, 2079--2086. https://doi.org/10.1145/2851613.2851801
[17]
Cheng Huang, Shuang Hao, Luca Invernizzi, Jiayong Liu, Yong Fang, Christopher Kruegel, and Giovanni Vigna. 2017. Gossip: Automatically Identifying Malicious Domains from Mailing List Discussions. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (ASIA CCS '17). Association for Computing Machinery, New York, NY, USA, 494--505. https://doi.org/10.1145/3052973.3053017
[18]
M. Khonji, Y. Iraqi, and A. Jones. 2011. Lexical URL analysis for discriminating phishing and legitimate e-mail messages. In 2011 International Conference for Internet Technology and Secured Transactions . IEEE, Abu Dhabi, United Arab Emirates, 422--427.
[19]
Hung Le, Quang Pham, Doyen Sahoo, and Steven CH Hoi. 2018. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. (2018).
[20]
Guillaume Lemaî tre, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, Vol. 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365
[21]
Wei Liu, Sanjay Chawla, David A Cieslak, and Nitesh V Chawla. 2010. A robust decision tree algorithm for imbalanced data sets., bibinfonumpages766--777 pages.
[22]
Jose Nazario. 2004. The online phishing corpus. https://monkey.org/ jose/phishing/
[23]
Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. 2016. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO '16). ACM, New York, NY, USA, 485--492. https://doi.org/10.1145/2908812.2908918
[24]
OpenDNS-PhishTank. 2012. The PhishTank Database. http://www.phishtank.com/developer_info.php .
[25]
OpenPhish. 2019. OpenPhish. https://openphish.com/index.html
[26]
Mustafa A Mohammad Rami, McCluskey Lee, and Thabtah Fadi. 2015. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/phishing
[27]
[28]
Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval .McGraw-Hill, Inc., New York, NY, USA.
[29]
Choon Lin Tan. 2018. Phishing Dataset for Machine Learning: Feature Evaluation. http://dx.doi.org/10.17632/h3cgnj8hft.1#file-286768bb-83f2--4e59--9210--6fed84e3c7fd
[30]
Rakesh Verma and Avisha Das (Eds.). 2018. Proceedings of the 1st Anti-phishing Shared Pilot at 4th ACM IWSPA (IWSPA-AP). CEUR. http://ceur-ws.org/Vol-2124/.
[31]
R. Verma and N. Rai. 2015. Phish-IDetector: Message-ID based automatic phishing detection. In 2015 12th International Joint Conference on e-Business and Telecommunications (ICETE), Vol. 04. IEEE, Colmar, France, 427--434.
[32]
Rakesh Verma, Narasimha Shashidhar, and Nabil Hossain. 2012. Detecting Phishing Emails the Natural Language Way. In Computer Security -- ESORICS 2012, Sara Foresti, Moti Yung, and Fabio Martinelli (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 824--841.
[33]
Rakesh M. Verma and Nabil Hossain. 2014. Semantic Feature Selection for Text with Application to Phishing Email Detection. In Information Security and Cryptology -- ICISC 2013. Springer International Publishing, Seoul, Korea, 455--468.
[34]
Rakesh M. Verma and David Marchette. 2019. Cybersecurity Analytics .Chapman and Hall/CRC, Boca Raton/London.
[35]
Rakesh M. Verma, Victor Zeng, and Houtan Faridi. 2019. Data Quality for Security Challenges: Case Studies of Phishing, Malware and Intrusion Detection Datasets. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS '19). Association for Computing Machinery, New York, NY, USA, 2605--2607. https://doi.org/10.1145/3319535.3363267
[36]
Yue Wu, Steven C.H. Hoi, Chenghao Liu, Jing Lu, Doyen Sahoo, and Nenghai Yu. 2017. SOL: A library for scalable online learning algorithms. Neurocomputing, Vol. 260 (2017), 9--12. https://doi.org/10.1016/j.neucom.2017.03.077
[37]
J. Yearwood, M. Mammadov, and A. Banerjee. 2010. Profiling Phishing Emails Based on Hyperlink Information. In 2010 International Conference on Advances in Social Networks Analysis and Mining. IEEE, Odense, Denmark, 120--127. https://doi.org/10.1109/ASONAM.2010.56

Cited By

View all
  • (2024)Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AIProceedings of the 11th International Conference on Networking, Systems, and Security10.1145/3704522.3704532(137-145)Online publication date: 19-Dec-2024
  • (2024)LLMs for Explainable Few-shot Deception DetectionProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659898(37-47)Online publication date: 21-Jun-2024
  • (2024)WikiPhish: A Diverse Wikipedia-Based Dataset for Phishing Website Detection: Data/Toolset PaperProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy10.1145/3626232.3653283(361-366)Online publication date: 19-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
IWSPA '20: Proceedings of the Sixth International Workshop on Security and Privacy Analytics
March 2020
84 pages
ISBN:9781450371155
DOI:10.1145/3375708
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automatic framework
  2. deception
  3. machine learning
  4. phishing
  5. social engineering

Qualifiers

  • Research-article

Funding Sources

Conference

CODASPY '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 18 of 58 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)471
  • Downloads (Last 6 weeks)54
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AIProceedings of the 11th International Conference on Networking, Systems, and Security10.1145/3704522.3704532(137-145)Online publication date: 19-Dec-2024
  • (2024)LLMs for Explainable Few-shot Deception DetectionProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659898(37-47)Online publication date: 21-Jun-2024
  • (2024)WikiPhish: A Diverse Wikipedia-Based Dataset for Phishing Website Detection: Data/Toolset PaperProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy10.1145/3626232.3653283(361-366)Online publication date: 19-Jun-2024
  • (2024)AI and Network Security Curricula: Minding the Gap2024 IEEE Global Engineering Education Conference (EDUCON)10.1109/EDUCON60312.2024.10578588(1-7)Online publication date: 8-May-2024
  • (2024)LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization ApproachIEEE Access10.1109/ACCESS.2024.351892312(193807-193821)Online publication date: 2024
  • (2024)Phishing Email Detection: SurveyRecent Trends and Advances in Artificial Intelligence10.1007/978-3-031-70924-1_42(551-570)Online publication date: 22-Nov-2024
  • (2023)Data Acquisition and Corpus Creation for Phishing Detection2023 46th MIPRO ICT and Electronics Convention (MIPRO)10.23919/MIPRO57284.2023.10159904(533-538)Online publication date: 22-May-2023
  • (2023)Performance Evaluation of Phishing Classification Techniques on Various Data Sources and SchemesIEEE Access10.1109/ACCESS.2022.322597111(38721-38738)Online publication date: 2023
  • (2022)Does Deception Leave a Content Independent Stylistic Trace?Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy10.1145/3508398.3519358(349-351)Online publication date: 14-Apr-2022
  • (2022)Combining Long-Term Recurrent Convolutional and Graph Convolutional Networks to Detect Phishing Sites Using URL and HTMLIEEE Access10.1109/ACCESS.2022.319601810(82355-82375)Online publication date: 2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media