[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

Published: 01 September 2011 Publication History

Abstract

Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms.
Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate.
We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.

References

[1]
3sharp report. 2006. Gone phishing: Evaluating anti-phishing tools for windows. http://www.3sharp.com/projects/antiphishing/gone-phishing.pdf.
[2]
Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 2, 121--167.
[3]
Chen, T.-C., Dick, S., and Miller, J. 2010. Detecting visually similar web pages: Application to phishing detection. ACM Trans. Intern. Tech. 10, 2.
[4]
Chou, N., Ledesma, R., Teraguchi, Y., and Mitchell, J. C. 2004. Client-side defense against web-based identity theft. In Proceedings of the 11th Annual Network and Distributed System Security Symposium (NDSS’04).
[5]
Cortes, C. and Mohri, M. 2003. Auc optimization vs. error rate minimization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’03).
[6]
Cova, M., Kruegel, C., and Vigna, G. 2008. There is no free phish: An analysis of “free” and live phishing kits. In Proceedings of the 2nd USENIX Workshop on Offensive Technologies (WOOT’08).
[7]
Dhamija, R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the 2005 Symposium on Usable privacy and security (SOUPS’05). 77--88.
[8]
Fawcett, T. 2006. An introduction to roc analysis. Patt. Recog. Lett. 27, 861--874.
[9]
Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 649--656.
[10]
Fetterly, D., Manasse, M., and Najork, M. 2003. On the evolution of clusters of near-duplicate web pages. In Proceedings of the 1st Conference on Latin American Web Congress (LA-WEB’03). 37--45.
[11]
Garera, S., Provos, N., Chew, M., and Rubin, A. D. 2007. A framework for detection and measurement of phishing attacks. In Proceedings of the 2007 ACM Workshop on Recurring Malcode (WORM’07). 1--8.
[12]
Le, A., Markopoulou, A., and Faloutsos, M. 2010. Phishdef: Url names say it all. CoRR abs/1009.2275.
[13]
Liu, W., Huang, G., Liu, X., Zhang, M., and Deng, X. 2005. Detection of phishing Web pages based on visual similarity. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). (Special Interest Tracks and Posters). 1060--1061.
[14]
Ludl, C., McAllister, S., Kirda, E., and Kruegel, C. 2007. On the effectiveness of techniques to detect phishing sites. In Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Lecture Notes in Computer Science, vol. 4579, 20--39.
[15]
McCall, T. 2007. Gartner survey. http://www.gartner.com/it/page.jsp?id=565125.
[16]
Medvet, E., Eurecom, E. K., and Kruegel, C. 2008. Visual-similarity-based phishing detection. In Proceedings of the 4th International Conference on Security and Privacy in Communication Networks (SecureComm’08). 30--36.
[17]
Moore, T. and Clayton, R. 2007. Examining the impact of Web site take-down on phishing. In Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit. 1--13.
[18]
NIST. 1995. Secure hash standard. Federal Information Processing Standards Publication 180-1. National Institute of Standards and Technology (NIST).
[19]
Pan, Y. and Ding, X. 2006. Anomaly based web phishing page detection. In Proceedings of the 22nd Annual Computer Security Applications Conference (ACSAC’06). 381--392.
[20]
PhishTank. http://www.phishtank.com/stats.php.
[21]
PhishTank. http://data.phishtank.com/data/online-valid/.
[22]
Sheng, S., Kumaraguru, P., Acquisti, A., Cranor, L., and Hong, J. 2009. Improving phishing countermeasures: An analysis of expert interviews. In Proceedings of the 4th APWG eCrime Researchers Summit.
[23]
Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., and Zhang, C. 2009. An empirical analysis of phishing blacklists. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09).
[24]
Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann.
[25]
Xiang, G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 571--580.
[26]
Xiang, G., Pendleton, B. A., Hong, J. I., and Rose, C. P. 2010. A hierarchical adaptive probabilistic approach for zero hour phish detection. In Proceedings of the 15th European Symposium on Research in Computer Security (ESORICS’10). 268--285.
[27]
Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining. 435--442.
[28]
Zhang, Y., Hong, J., and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 639--648.

Cited By

View all
  • (2025)Beyond the west: Revealing and bridging the gap between Western and Chinese phishing website detectionComputers & Security10.1016/j.cose.2024.104115148(104115)Online publication date: Jan-2025
  • (2024)AntiPhishStack: LSTM-Based Stacked Generalization Model for Optimized Phishing URL DetectionSymmetry10.3390/sym1602024816:2(248)Online publication date: 17-Feb-2024
  • (2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information and System Security
      ACM Transactions on Information and System Security  Volume 14, Issue 2
      September 2011
      199 pages
      ISSN:1094-9224
      EISSN:1557-7406
      DOI:10.1145/2019599
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 September 2011
      Accepted: 01 May 2011
      Revised: 01 December 2010
      Received: 01 May 2010
      Published in TISSEC Volume 14, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Anti-phishing
      2. information retrieval
      3. machine learning

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)137
      • Downloads (Last 6 weeks)21
      Reflects downloads up to 11 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Beyond the west: Revealing and bridging the gap between Western and Chinese phishing website detectionComputers & Security10.1016/j.cose.2024.104115148(104115)Online publication date: Jan-2025
      • (2024)AntiPhishStack: LSTM-Based Stacked Generalization Model for Optimized Phishing URL DetectionSymmetry10.3390/sym1602024816:2(248)Online publication date: 17-Feb-2024
      • (2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
      • (2024)ML-Based Methods for Detecting Phishing Websites: A Comprehensive Survey and AnalysisSSRN Electronic Journal10.2139/ssrn.4829465Online publication date: 2024
      • (2024)Understanding Characteristics of Phishing Reports from Experts and Non-Experts on TwitterIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7221E107.D:7(807-824)Online publication date: 1-Jul-2024
      • (2024)CySecBERT: A Domain-Adapted Language Model for the Cybersecurity DomainACM Transactions on Privacy and Security10.1145/365259427:2(1-20)Online publication date: 15-Mar-2024
      • (2024)A Systematic Review of Social Engineering Attacks & Techniques: The Past, Present, and Future2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG)10.1109/SEB4SDG60871.2024.10629836(1-12)Online publication date: 2-Apr-2024
      • (2024)On Phishing URL Detection Using Feature ExtensionIEEE Internet of Things Journal10.1109/JIOT.2024.344689411:24(39527-39536)Online publication date: 15-Dec-2024
      • (2024)A Method for Detecting Phishing Websites Based on Tiny-Bert StackingIEEE Internet of Things Journal10.1109/JIOT.2023.329217111:2(2236-2243)Online publication date: 15-Jan-2024
      • (2024)Machine Learning-Based Phishing Website Detection A Comprehensive Approach for Cyber security2024 5th International Conference on Recent Trends in Computer Science and Technology (ICRTCST)10.1109/ICRTCST61793.2024.10578472(344-349)Online publication date: 9-Apr-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media