More Web Proxy on the site http://driver.im/

research-article

CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

Authors:

Carolyn P. Rose,

Lorrie CranorAuthors Info & Claims

ACM Transactions on Information and System Security (TISSEC), Volume 14, Issue 2

Article No.: 21, Pages 1 - 28

https://doi.org/10.1145/2019599.2019606

Published: 01 September 2011 Publication History

Abstract

Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms.

Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate.

We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.

References

[1]

3sharp report. 2006. Gone phishing: Evaluating anti-phishing tools for windows. http://www.3sharp.com/projects/antiphishing/gone-phishing.pdf.

[2]

Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 2, 121--167.

Digital Library

[3]

Chen, T.-C., Dick, S., and Miller, J. 2010. Detecting visually similar web pages: Application to phishing detection. ACM Trans. Intern. Tech. 10, 2.

Digital Library

[4]

Chou, N., Ledesma, R., Teraguchi, Y., and Mitchell, J. C. 2004. Client-side defense against web-based identity theft. In Proceedings of the 11th Annual Network and Distributed System Security Symposium (NDSS’04).

[5]

Cortes, C. and Mohri, M. 2003. Auc optimization vs. error rate minimization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’03).

[6]

Cova, M., Kruegel, C., and Vigna, G. 2008. There is no free phish: An analysis of “free” and live phishing kits. In Proceedings of the 2nd USENIX Workshop on Offensive Technologies (WOOT’08).

Digital Library

[7]

Dhamija, R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the 2005 Symposium on Usable privacy and security (SOUPS’05). 77--88.

Digital Library

[8]

Fawcett, T. 2006. An introduction to roc analysis. Patt. Recog. Lett. 27, 861--874.

Digital Library

[9]

Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 649--656.

Digital Library

[10]

Fetterly, D., Manasse, M., and Najork, M. 2003. On the evolution of clusters of near-duplicate web pages. In Proceedings of the 1st Conference on Latin American Web Congress (LA-WEB’03). 37--45.

Digital Library

[11]

Garera, S., Provos, N., Chew, M., and Rubin, A. D. 2007. A framework for detection and measurement of phishing attacks. In Proceedings of the 2007 ACM Workshop on Recurring Malcode (WORM’07). 1--8.

Digital Library

[12]

Le, A., Markopoulou, A., and Faloutsos, M. 2010. Phishdef: Url names say it all. CoRR abs/1009.2275.

[13]

Liu, W., Huang, G., Liu, X., Zhang, M., and Deng, X. 2005. Detection of phishing Web pages based on visual similarity. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). (Special Interest Tracks and Posters). 1060--1061.

Digital Library

[14]

Ludl, C., McAllister, S., Kirda, E., and Kruegel, C. 2007. On the effectiveness of techniques to detect phishing sites. In Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Lecture Notes in Computer Science, vol. 4579, 20--39.

Digital Library

[15]

McCall, T. 2007. Gartner survey. http://www.gartner.com/it/page.jsp?id=565125.

[16]

Medvet, E., Eurecom, E. K., and Kruegel, C. 2008. Visual-similarity-based phishing detection. In Proceedings of the 4th International Conference on Security and Privacy in Communication Networks (SecureComm’08). 30--36.

Digital Library

[17]

Moore, T. and Clayton, R. 2007. Examining the impact of Web site take-down on phishing. In Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit. 1--13.

Digital Library

[18]

NIST. 1995. Secure hash standard. Federal Information Processing Standards Publication 180-1. National Institute of Standards and Technology (NIST).

[19]

Pan, Y. and Ding, X. 2006. Anomaly based web phishing page detection. In Proceedings of the 22nd Annual Computer Security Applications Conference (ACSAC’06). 381--392.

Digital Library

[20]

PhishTank. http://www.phishtank.com/stats.php.

[21]

PhishTank. http://data.phishtank.com/data/online-valid/.

[22]

Sheng, S., Kumaraguru, P., Acquisti, A., Cranor, L., and Hong, J. 2009. Improving phishing countermeasures: An analysis of expert interviews. In Proceedings of the 4th APWG eCrime Researchers Summit.

[23]

Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., and Zhang, C. 2009. An empirical analysis of phishing blacklists. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09).

[24]

Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann.

Digital Library

[25]

Xiang, G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 571--580.

Digital Library

[26]

Xiang, G., Pendleton, B. A., Hong, J. I., and Rose, C. P. 2010. A hierarchical adaptive probabilistic approach for zero hour phish detection. In Proceedings of the 15th European Symposium on Research in Computer Security (ESORICS’10). 268--285.

Digital Library

[27]

Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining. 435--442.

Digital Library

[28]

Zhang, Y., Hong, J., and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 639--648.

Digital Library

Cited By

Yuan YApruzzese GConti M(2025)Beyond the west: Revealing and bridging the gap between Western and Chinese phishing website detectionComputers & Security10.1016/j.cose.2024.104115148(104115)Online publication date: Jan-2025
https://doi.org/10.1016/j.cose.2024.104115
Aslam SAslam HManzoor AChen HRasool A(2024)AntiPhishStack: LSTM-Based Stacked Generalization Model for Optimized Phishing URL DetectionSymmetry10.3390/sym1602024816:2(248)Online publication date: 17-Feb-2024
https://doi.org/10.3390/sym16020248
Aung EYamana H(2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.973
Show More Cited By

Index Terms

CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Classification of Anti-phishing Solutions
Abstract
Phishing is an online fraud through which phisher gains unauthorized access to the user system to lure the personal credentials (such as username, password, credit/debit card number, validity, CVV number, and pin) for financial gain. Phishing can ...
A Hybrid System to Find & Fight Phishing Attacks Actively
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Traditional anti-phishing methods and tools always worked in a passive way to receive users' submission and determine phishing URLs. Usually, they are not fast and efficient enough to find and take down phishing attacks. We analyze phishing reports from ...
Itrustpage: a user-assisted anti-phishing tool
EuroSys '08

Despite the many solutions proposed by industry and the research community to address phishing attacks, this problem continues to cause enormous damage. Because of our inability to deter phishing attacks, the research community needs to develop new ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information and System Security

ACM Transactions on Information and System Security Volume 14, Issue 2

September 2011

199 pages

ISSN:1094-9224

EISSN:1557-7406

DOI:10.1145/2019599

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2011

Accepted: 01 May 2011

Revised: 01 December 2010

Received: 01 May 2010

Published in TISSEC Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

360
Total Citations
View Citations
2,820
Total Downloads

Downloads (Last 12 months)137
Downloads (Last 6 weeks)21

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yuan YApruzzese GConti M(2025)Beyond the west: Revealing and bridging the gap between Western and Chinese phishing website detectionComputers & Security10.1016/j.cose.2024.104115148(104115)Online publication date: Jan-2025
https://doi.org/10.1016/j.cose.2024.104115
Aslam SAslam HManzoor AChen HRasool A(2024)AntiPhishStack: LSTM-Based Stacked Generalization Model for Optimized Phishing URL DetectionSymmetry10.3390/sym1602024816:2(248)Online publication date: 17-Feb-2024
https://doi.org/10.3390/sym16020248
Aung EYamana H(2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.973
. TBHANDARI D(2024)ML-Based Methods for Detecting Phishing Websites: A Comprehensive Survey and AnalysisSSRN Electronic Journal10.2139/ssrn.4829465Online publication date: 2024
https://doi.org/10.2139/ssrn.4829465
NAKANO HCHIBA DKOIDE TFUKUSHI NYAGI THARIU TYOSHIOKA KMATSUMOTO T(2024)Understanding Characteristics of Phishing Reports from Experts and Non-Experts on TwitterIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7221E107.D:7(807-824)Online publication date: 1-Jul-2024
https://doi.org/10.1587/transinf.2023EDP7221
Bayer MKuehn PShanehsaz RReuter C(2024)CySecBERT: A Domain-Adapted Language Model for the Cybersecurity DomainACM Transactions on Privacy and Security10.1145/365259427:2(1-20)Online publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1145/3652594
Femi-Oyewole FOsamor VOkunbor D(2024)A Systematic Review of Social Engineering Attacks & Techniques: The Past, Present, and Future2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG)10.1109/SEB4SDG60871.2024.10629836(1-12)Online publication date: 2-Apr-2024
https://doi.org/10.1109/SEB4SDG60871.2024.10629836
He DLiu ZLv XChan SGuizani M(2024)On Phishing URL Detection Using Feature ExtensionIEEE Internet of Things Journal10.1109/JIOT.2024.344689411:24(39527-39536)Online publication date: 15-Dec-2024
https://doi.org/10.1109/JIOT.2024.3446894
He DLv XZhu SChan SChoo K(2024)A Method for Detecting Phishing Websites Based on Tiny-Bert StackingIEEE Internet of Things Journal10.1109/JIOT.2023.329217111:2(2236-2243)Online publication date: 15-Jan-2024
https://doi.org/10.1109/JIOT.2023.3292171
Priya KBala Chandrika JLakshmi M(2024)Machine Learning-Based Phishing Website Detection A Comprehensive Approach for Cyber security2024 5th International Conference on Recent Trends in Computer Science and Technology (ICRTCST)10.1109/ICRTCST61793.2024.10578472(344-349)Online publication date: 9-Apr-2024
https://doi.org/10.1109/ICRTCST61793.2024.10578472
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents