JP2024537659A

JP2024537659A - Machine learning based system and method using URL feature hashing, HTML encoding, and content page embedded images to detect phishing websites

Info

Publication number: JP2024537659A
Application number: JP2024516418A
Authority: JP
Inventors: イーファリャオ，; アリアザラルーズ，; ナジェメミラミルカニ，; ジースー，
Original assignee: Netskope Inc
Current assignee: Netskope Inc
Priority date: 2021-09-14
Filing date: 2022-09-13
Publication date: 2024-10-16
Also published as: DE112022004398T5; WO2023043750A1

Abstract

A phishing classifier is disclosed for classifying URLs and content pages as phishing or not, comprising a URL feature hasher that parses and hashes URLs into feature hashes, and a headless browser that visits and internally renders the pages of the URLs, extracts HTML tokens, and captures an image of the rendering. Also disclosed is a phishing classifier for classifying URLs and content pages accessed via the URLs as phishing or not, comprising a URL feature hasher that parses and hashes URLs into feature hashes, and a headless browser that visits and internally renders the pages of the URLs, extracts words from the rendering, and captures an image of the pages. Classifying URLs and content pages accessed via the URLs as phishing or not is further disclosed. In addition to one or more of the disclosures, there is a phishing classification layer, a URL embedder, and an HTML encoder.
[Selection diagram] None

Description

関連出願の相互参照
本出願は、以下の優先権及び利益を主張する： CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to and the benefit of:

現在は２０２２年９月１３日に発行された米国特許第１１，４４４，９７８号（代理人整理番号ＮＳＫＯ１０５２－１）である、２０２１年９月１４日に出願された「ＡＭａｃｈｉｎｅＬｅａｒｎｉｎｇ－ＢａｓｅｄｓｙｓｔｅｍｆｏｒＤｅｔｅｃｔｉｎｇＰｈｉｓｈｉｎｇＷｅｂｓｉｔｅｓＵｓｉｎｇｔｈｅＵＲＬｓ，ＷｏｒｄｅｎｃｏｄｉｎｇｓａｎｄＩｍａｇｅｓｏｆＣｏｎｔｅｎｔＰａｇｅｓ」と題された米国出願第１７／４７５，２３６号、及び U.S. Application No. 17/475,236, entitled "A Machine Learning-Based System for Detecting Phishing Websites Using the URLs, Word encodings and Images of Content Pages," filed on September 14, 2021, which is now U.S. Patent No. 11,444,978 (Attorney Docket No. NSKO1052-1), issued on September 13, 2022; and

現在は２０２２年５月１７日に発行された米国特許第１１，３３６，６８９号（代理人整理番号ＮＳＫＯ１０６０－１）である、２０２１年９月１４日に出願された「ＤｅｔｅｃｔｉｎｇＰｈｉｓｈｉｎｇＷｅｂｓｉｔｅｓｖｉａａＭａｃｈｉｎｅＬｅａｒｎｉｎｇ－ＢａｓｅｄＳｙｓｔｅｍＵｓｉｎｇＵＲＬＦｅａｔｕｒｅＨａｓｈｅｓ，ＨＴＭＬＥｎｃｏｄｉｎｇｓａｎｄＥｍｂｅｄｄｅｄＩｍａｇｅｓｏｆＣｏｎｔｅｎｔＰａｇｅｓ」と題された米国出願第１７／４７５，２３３号、及び U.S. Application No. 17/475,233, entitled "Detecting Phishing Websites via a Machine Learning-Based System Using URL Feature Hashes, HTML Encodings and Embedded Images of Content Pages," filed September 14, 2021, which is now U.S. Patent No. 11,336,689 (Attorney Docket No. NSKO1060-1), issued May 17, 2022; and

現在は２０２２年９月６日に発行された米国特許第１１，４３８，３７７号（代理人整理番号：ＮＳＫＯ１０６１－１）である、２０２１年９月１４日に出願された「ＭａｃｈｉｎｅＬｅａｒｎｉｎｇ－ＢａｓｅｄＳｙｓｔｅｍｓａｎｄＭｅｔｈｏｄｓｏｆＵｓｉｎｇＵＲＬｓａｎｄＨＴＭＬＥｎｃｏｄｉｎｇｓｆｏｒＤｅｔｅｃｔｉｎｇＰｈｉｓｈｉｎｇＷｅｂｓｉｔｅｓ」と題された米国出願第１７／４７５，２３０号。 U.S. Application No. 17/475,230, entitled "Machine Learning-Based Systems and Methods of Using URLs and HTML Encodings for Detecting Phishing Websites," filed on September 14, 2021, which is now U.S. Patent No. 11,438,377, issued on September 6, 2022 (Attorney Docket No. NSKO1061-1).

関連事例
本出願はまた、本明細書に完全に記載されているかのように、あらゆる目的で参照により援用される以下の出願にも関連する。 RELATED CASES This application is also related to the following applications, which are incorporated by reference for all purposes as if fully set forth herein:

現在は２０２１年８月３日に発行された米国特許第１１，０８２，４４５号（代理人整理番号：ＮＳＫＯ１０３７－１）である２０２１年１月２１日に出願された「ＰｒｅｖｅｎｔｉｎｇＰｈｉｓｈｉｎｇＡｔｔａｃｋｓＶｉａＤｏｃｕｍｅｎｔＳｈａｒｉｎｇ」と題された米国出願第１７／１５４，９７８号の継続である、２０２１年７月３０日に出願された「ＰｒｅｖｅｎｔｉｎｇＣｌｏｕｄ－ＢａｓｅｄＰｈｉｓｈｉｎｇＡｔｔａｃｋｓＵｓｉｎｇＳｈａｒｅｄＤｏｃｕｍｅｎｔｓｗｉｔｈＭａｌｉｃｉｏｕｓＬｉｎｋｓ」と題された米国出願第１７／３９０，８０３号（代理人整理番号１０３７－２）。 U.S. Application No. 17/390,803 entitled "Preventing Cloud-Based Phishing Attacks Using Shared Documents with Malicious Links" filed on July 30, 2021 (Attorney Docket No. 1037-2), which is a continuation of U.S. Application No. 17/154,978 entitled "Preventing Phishing Attacks Via Document Sharing" filed on January 21, 2021, which is now U.S. Application No. 11,082,445 issued on August 3, 2021 (Attorney Docket No. NSKO1037-1).

援用
以下の材料は、本出願において参考として援用される。
“ＫＤＥＨｙｐｅｒＰａｒａｍｅｔｅｒＤｅｔｅｒｍｉｎａｔｉｏｎ，”ＹｉＺｈａｎｇｅｔａｌ．，Ｎｅｔｓｋｏｐｅ，Ｉｎｃ．、
２０１６年９月２日に出願された「ＭａｃｈｉｎｅＬｅａｒｎｉｎｇＢａｓｅｄＡｎｏｍａｌｙＤｅｔｅｃｔｉｏｎ」と題する米国非仮出願第１５／２５６，４８３号（代理人整理番号ＮＳＫＯ１００４－２）（現在は２０１９年４月２３日に発行された米国特許第１０，２７０，７８８号）、
２０１９年４月１９日に出願された「ＭａｃｈｉｎｅＬｅａｒｎｉｎｇＢａｓｅｄＡｎｏｍａｌｙＤｅｔｅｃｔｉｏｎ」と題する米国非仮出願第１６／３８９，８６１号（代理人整理番号ＮＳＫＯ１００４－３）（現在は２０２１年６月１日に発行された米国特許第１１，０２５，６５３号）、
２０１４年３月０５日に出願された「ＳｅｃｕｒｉｔｙＦｏｒＮｅｔｗｏｒｋＤｅｌｉｖｅｒｅｄＳｅｒｖｉｃｅｓ」と題する米国非仮出願第１４／１９８，５０８号（代理人整理番号ＮＳＫＯ１０００－３）（現在は２０１６年２月２３日に発行された米国特許第９，２７０，７６５号）、
２０１６年１２月０２日に出願された「ＳｙｓｔｅｍｓａｎｄＭｅｔｈｏｄｓｏｆＥｎｆｏｒｃｉｎｇＭｕｌｔｉ－ＰａｒｔＰｏｌｉｃｉｅｓｏｎＤａｔａ－ＤｅｆｉｃｉｅｎｔＴｒａｎｓａｃｔｉｏｎｓｏｆＣｌｏｕｄＣｏｍｐｕｔｉｎｇＳｅｒｖｉｃｅｓ」と題する米国非仮出願第１５／３６８，２４０号（代理人整理番号ＮＳＫＯ１００３－２）（現在は２０２０年１１月０３日に発行された米国特許第１０，８２６，９４０号）、及び２０１６年３月１１日に出願された「ＳｙｓｔｅｍｓａｎｄＭｅｔｈｏｄｓｏｆＥｎｆｏｒｃｉｎｇＭｕｌｔｉ－ＰａｒｔＰｏｌｉｃｉｅｓｏｎＤａｔａ－ＤｅｆｉｃｉｅｎｔＴｒａｎｓａｃｔｉｏｎｓｏｆＣｌｏｕｄＣｏｍｐｕｔｉｎｇＳｅｒｖｉｃｅｓ」と題する米国仮出願第６２／３０７，３０５号（代理人整理番号ＮＳＫＯ１００３－１）、
“ＣｌｏｕｄＳｅｃｕｒｉｔｙｆｏｒＤｕｍｍｉｅｓ，ＮｅｔｓｋｏｐｅＳｐｅｃｉａｌＥｄｉｔｉｏｎ”ｂｙＣｈｅｎｇ，Ｉｔｈａｌ，Ｎａｒａｙａｎａｓｗａｍｙ，ａｎｄＭａｌｍｓｋｏｇ，ＪｏｈｎＷｉｌｅｙ＆Ｓｏｎｓ，Ｉｎｃ．２０１５；
“ＮｅｔｓｋｏｐｅＩｎｔｒｏｓｐｅｃｔｉｏｎ”ｂｙＮｅｔｓｋｏｐｅ，Ｉｎｃ．；
“ＤａｔａＬｏｓｓＰｒｅｖｅｎｔｉｏｎａｎｄＭｏｎｉｔｏｒｉｎｇｉｎｔｈｅＣｌｏｕｄ”ｂｙＮｅｔｓｋｏｐｅ，Ｉｎｃ．；
“Ｔｈｅ５ＳｔｅｐｓｔｏＣｌｏｕｄＣｏｎｆｉｄｅｎｃｅ”ｂｙＮｅｔｓｋｏｐｅ，Ｉｎｃ．；
“ＮｅｔｓｋｏｐｅＡｃｔｉｖｅＣｌｏｕｄＤＬＰ”ｂｙＮｅｔｓｋｏｐｅ，Ｉｎｃ．；
「ＲｅｐａｖｅｔｈｅＣｌｏｕｄ－ＤａｔａＢｒｅａｃｈＣｏｌｌｉｓｉｏｎＣｏｕｒｓｅ」ｂｙＮｅｔｓｋｏｐｅ，Ｉｎｃ．、及び
「ＮｅｔｓｋｏｐｅＣｌｏｕｄＣｏｎｆｉｄｅｎｃｅＩｎｄｅｘ（商標）」ｂｙＮｅｔｓｋｏｐｅ，Ｉｎｃ． INCORPORATION The following materials are incorporated by reference in this application.
“KDE Hyper Parameter Determination,” Yi Zhang et al. , Netskop, Inc. ,
U.S. Non-provisional Application No. 15/256,483, entitled “Machine Learning Based Anomaly Detection,” filed September 2, 2016 (Attorney Docket No. NSKO1004-2) (now U.S. Pat. No. 10,270,788, issued April 23, 2019);
U.S. Non-provisional Application No. 16/389,861, entitled “Machine Learning Based Anomaly Detection,” filed April 19, 2019 (Attorney Docket No. NSKO1004-3) (now U.S. Pat. No. 11,025,653, issued June 1, 2021);
U.S. Non-provisional Application No. 14/198,508, entitled “Security For Network Delivered Services,” filed March 05, 2014 (Attorney Docket No. NSKO1000-3) (now U.S. Pat. No. 9,270,765, issued February 23, 2016);
No. 15/368,240 (Attorney Docket No. NSKO1003-2), filed December 2, 2016 (now U.S. Pat. No. 10,826,940, issued November 3, 2020), entitled “Systems and Methods of Enforcing Multi-Part Policies on Data-Deficient Transactions of Cloud Computing Services” and U.S. Pat. No. 62/307,305 (Attorney Docket No. NSKO1003-1), entitled "Transactions of Cloud Computing Services";
“Cloud Security for Dummies, Netskope Special Edition” by Cheng, Ithal, Narayanaswamy, and Malmskog, John Wiley & Sons, Inc. 2015;
“Netskope Introspection” by Netskop, Inc. ;
“Data Loss Prevention and Monitoring in the Cloud” by Netskop, Inc. ;
“The 5 Steps to Cloud Confidence” by Netskope, Inc. ;
“Netskope Active Cloud DLP” by Netskope, Inc. ;
"Repave the Cloud-Data Breach Collision Course" by Netskope, Inc., and "Netskope Cloud Confidence Index™" by Netskope, Inc.

開示される技術は、概して、クラウドベースのセキュリティに関し、より具体的には、コンテンツページのＵＲＬ、単語エンコーディング、及び画像を使用して、フィッシングウェブサイトを検出するためのシステム及び方法に関する。また、コンテンツページのＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び埋め込まれた画像を使用するための方法及びシステムが開示される。開示される技術は、機械学習及び統計分析を通じて、ＵＲＬリンク及びダウンロードされたＨＴＭＬを介してリアルタイムでフィッシングを検出することに更に関する。 The disclosed technology relates generally to cloud-based security, and more specifically to systems and methods for detecting phishing websites using URLs, word encodings, and images of content pages. Also disclosed are methods and systems for using URL feature hashes, HTML encodings, and embedded images of content pages. The disclosed technology further relates to detecting phishing in real-time via URL links and downloaded HTML through machine learning and statistical analysis.

このセクションで考察される主題は、単にこのセクションにおけるその言及の結果として先行技術であると想定されるべきではない。同様に、このセクションで言及される問題、又は背景として提供される主題と関連する問題は、先行技術において以前に認識されていたと仮定されるべきではない。このセクションの主題は、異なるアプローチを表すにすぎず、それ自体、特許請求される技術の実装態様に対応することもできる。 The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, it should not be assumed that the problems mentioned in this section, or related to the subject matter provided as background, have been previously recognized in the prior art. The subject matter in this section merely represents different approaches and may, as such, correspond to implementation aspects of the claimed technology.

スピアヘッドフィッシングと呼ばれることもあるフィッシングが増加している。フィッシングによって盗まれたパスワードを使用して取得された文書の悪用によって、全国ニュースが中断されている。通常、電子メールには正当に見えるリンクが含まれており、正当に見えるページにつながり、ユーザは、フィッシング攻撃の危険にさらされるパスワードを入力する。クレジットカードスキマー又はガソリンポンプ若しくはＡＴＭのシムのようなクリーバーフィッシングサイトは、入力されたパスワードを実際のウェブサイトに転送し、経路から外れ得るため、ユーザは、パスワード盗難が発生したときにパスワード盗難を検出しない。近年、在宅勤務は、フィッシング攻撃の大幅な増加につながっている。 Phishing, sometimes called spearhead phishing, is on the rise. National news has been punctuated by exploits of documents obtained using passwords stolen through phishing. Typically, emails contain legitimate-looking links that lead to legitimate-looking pages where users enter their passwords, exposing them to a phishing attack. Cleaver phishing sites, such as credit card skimmers or gas pump or ATM sims, can forward the entered password to a real website, taking it out of the way so that users do not detect the password theft when it occurs. Work-from-home efforts in recent years have led to a significant increase in phishing attacks.

フィッシングという用語は、無防備なユーザからウェブ上で機密情報を不正に取得するためのいくつかの方法を指す。フィッシングは、部分的には、ますます洗練された誘導を使用して会社の極秘情報を引き出すことから生じる。これらの方法は、一般に、フィッシング攻撃と称される。レンダリングされたウェブページが正当なログインページの外観を模倣している場合、ウェブサイトユーザは、フィッシング攻撃の被害に遭う。フィッシング攻撃の被害者は、不正なウェブサイトに誘導され、銀行口座、ログインパスワード、社会保障ＩＤなどの機密情報の暴露をもたらす。 The term phishing refers to several methods for fraudulently obtaining confidential information from unsuspecting users over the web. Phishing stems, in part, from the use of increasingly sophisticated lures to elicit confidential company information. These methods are commonly referred to as phishing attacks. Website users fall victim to phishing attacks when the rendered web page mimics the appearance of a legitimate login page. Victims of phishing attacks are lured to fraudulent websites, resulting in the disclosure of confidential information such as bank account, login passwords, and social security IDs.

最近のデータ侵害調査報告書によると、ソーシャルエンジニアリングに基づく大規模な攻撃の風潮が高まっている。これは、一部には、エクスプロイトの難易度の増加に起因している可能性があり、また一部には、そのようなエクスプロイトを防止及び検出するための機械学習（ＭＬ）アルゴリズムの進歩の利用の賜物である。したがって、フィッシング攻撃は、より頻繁かつ洗練なものになっている。新しい防御ソリューションが必要とされている。 Recent data breach investigation reports indicate that there is a growing trend of large-scale attacks based on social engineering. This can be attributed in part to the increasing difficulty of exploits and in part to the use of advances in machine learning (ML) algorithms to prevent and detect such exploits. Phishing attacks are therefore becoming more frequent and sophisticated. New defensive solutions are needed.

ＭＬ／ＤＬを使用して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するための機会が生じる。また、ＵＲＬと、このＵＲＬリンクを介してアクセスされ、かつダウンロードされたＨＴＭＬと、をリアルタイムでフィッシングか、又はフィッシングでないとして分類する機会も出現する。 Using ML/DL, an opportunity arises to classify URLs and content pages accessed via the URL as phishing or not phishing. There is also an opportunity to classify URLs and HTML accessed and downloaded via the URL links as phishing or not phishing in real time.

図面において、同様の参照文字は、概して、異なる図全体を通して同様の部分を指す。また、図面は、必ずしも縮尺通りではなく、代わりに、概して、開示される技術の原理を図示することに重点が置かれている。以下の説明では、開示される技術の種々の実装態様が、以下の図面を参照して説明される。 In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the disclosed technology. In the following description, various implementations of the disclosed technology are described with reference to the following drawings:

開示される技術の実装態様に従って、システムのアーキテクチャレベルの概略図が、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類することを例示するものである。According to an implementation of the disclosed technology, an architecture level schematic diagram of a system illustrates classifying URLs and content pages accessed via the URLs as phishing or non-phishing. ＵＲＬ特徴量ハッシュのＭＬ／ＤＬエンコーディングと、自然言語（ＮＬ）単語のエンコーディングと、フィッシングサイトを検出するためのキャプチャされたウェブサイト画像のエンコーディングと、を利用する、開示されるフィッシング検出エンジンの高レベルブロック図を例示するものである。1 illustrates a high-level block diagram of the disclosed phishing detection engine that utilizes ML/DL encoding of URL feature hashes, encoding of natural language (NL) words, and encoding of captured website images to detect phishing sites. 参照のための、画像分類のための例示的なＲｅｓＮｅｔ残差ＣＮＮブロック図を例示するものである。1 illustrates an exemplary ResNet residual CNN block diagram for image classification for reference. 各例示的なＵＲＬが、フィッシングサイトを検出するために、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う、ＵＲＬ特徴量ハッシュと、コンテンツページから抽出されたＨＴＭＬのエンコーディングと、例示的なＵＲＬのコンテンツページからキャプチャされた画像の埋め込みと、を用いるＭＬ／ＤＬを利用する、開示されるフィッシング検出エンジンの高レベルブロック図を例示するものである。FIG. 1 illustrates a high-level block diagram of the disclosed phishing detection engine, where each example URL uses ML/DL with URL feature hashing with ground truth classification as phishing or not phishing, encoding of HTML extracted from the content page, and embedding of an image captured from the example URL's content page to detect phishing sites. フィッシング検出エンジンで使用する前に、画像の分類のために事前訓練された参照残差ニューラルネットワーク（ＲｅｓＮｅｔ）のブロック図を例示するものである。1 illustrates a block diagram of a reference residual neural network (ResNet) that is pre-trained for image classification prior to use in a phishing detection engine. ＵＲＬ埋め込み器及びＨＴＭＬエンコーダを用いるＭＬ／ＤＬを利用する、開示されるフィッシング検出エンジン６０２の高レベルブロック図を例示するものである。6 illustrates a high level block diagram of the disclosed phishing detection engine 602 utilizing ML/DL with URL embedder and HTML encoder. 複数の開示されるフィッシング検出システムの精度再現率グラフを示す。1 shows precision-recall graphs for several of the disclosed phishing detection systems. 本明細書に記載されるフィッシングウェブサイト検出のための開示されるシステムの受信者動作特性曲線（ＲＯＣ）を例示するものである。1 illustrates an exemplary Receiver Operating Characteristic Curve (ROC) of the disclosed system for phishing website detection described herein. ＵＲＬ埋め込み器及びＨＴＭＬエンコーダを用いるＭＬ／ＤＬを利用する、開示されるフィッシング検出エンジンのフィッシングウェブサイト検出のための受信者動作特性曲線（ＲＯＣ）を例示するものである。1 illustrates an example Receiver Operating Characteristic Curve (ROC) for phishing website detection of the disclosed phishing detection engine utilizing ML/DL with URL embedder and HTML encoder. オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表されるＣ＋＋コードを用いて、ＵＲＬ埋め込みを生成する一次元１Ｄ畳み込みニューラルネットワーク（Ｃｏｎｖ１Ｄ）ＵＲＬ埋め込み器の機能性のブロック図を例示するものである。FIG. 1 illustrates a block diagram of the functionality of a one-dimensional 1D convolutional neural network (Conv1D) URL embedder that generates URL embeddings using C++ code expressed in the Open Neural Network Exchange (ONNX) format. フィッシング分類器層に入力されるｈｔｍｌエンコーディングをもたらす、開示されるｈｔｍｌエンコーダの機能性のブロック図を示す。1 shows a block diagram of the functionality of the disclosed html encoder, which results in html encoding that is input to a phishing classifier layer. フィッシング分類器層に入力されるｈｔｍｌエンコーディングをもたらす、開示されるｈｔｍｌエンコーダの概要ブロック図を示す。1 shows a schematic block diagram of the disclosed html encoder, which provides the html encoding that is input to the phishing classifier layer. オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表されるＣ＋＋コードを用いて、フィッシング分類器層６７５に入力されるｈｔｍｌエンコーディングをもたらす、ｈｔｍｌエンコーダの機能性の計算データフローグラフを合わせて例示するものである。入力エンコーディング及び位置埋め込みを例示する、左列の下部にある結合子が右列の上部に流れ込む、点線で区切られた２つの列におけるデータフローグラフの１つのセクションを示す。6 illustrates, with C++ code expressed in Open Neural Network Exchange (ONNX) format, a computational dataflow graph of the functionality of the html encoder that results in an html encoding that is input to the phishing classifier layer 675. A section of the dataflow graph is shown in two columns separated by a dotted line, with connectors at the bottom of the left column flowing into the top of the right column, illustrating the input encoding and positional embedding. オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表されるＣ＋＋コードを用いて、フィッシング分類器層６７５に入力されるｈｔｍｌエンコーディングをもたらす、ｈｔｍｌエンコーダの機能性の計算データフローグラフを合わせて例示するものである。データ接続に沿ってデータを非同期的に伝送する計算ノードを有するデータフローグラフの一例を示す、マルチヘッドアテンションの単一の反復を例示するものである。6 illustrates a computational dataflow graph of the functionality of the html encoder, using C++ code expressed in Open Neural Network Exchange (ONNX) format, resulting in html encoding input to the phishing classifier layer 675. FIG. 7 illustrates a single iteration of multi-headed attention, showing an example of a dataflow graph with computational nodes that asynchronously transmit data along data connections. オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表されるＣ＋＋コードを用いて、フィッシング分類器層６７５に入力されるｈｔｍｌエンコーディングをもたらす、ｈｔｍｌエンコーダの機能性の計算データフローグラフを合わせて例示するものである。点線で区切られた３つの列を使用して例示される、ＯＮＮＸ演算を使用する加算、正規化、及びフィードフォワード機能性を示す。6 illustrates, together with a computational dataflow graph of the functionality of the html encoder using C++ code expressed in Open Neural Network Exchange (ONNX) format, resulting in an html encoding that is input to the phishing classifier layer 675. The summation, normalization, and feedforward functionality using ONNX operations is illustrated using three columns separated by dotted lines. オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表されるコードを用いて、特定のウェブサイトがフィッシングウェブサイトである可能性がどのくらいあるかを信号伝達する尤度スコア（複数可）を生成する、開示されるフィッシング分類器層の機能性の計算データフローグラフを例示するものである。1 illustrates a computational data flow graph of the disclosed phishing classifier layer functionality, using code expressed in Open Neural Network Exchange (ONNX) format, which generates a likelihood score(s) signaling how likely a particular website is to be a phishing website. 開示される技術の一実装態様による、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するために使用され得るコンピュータシステムの簡略化されたブロック図である。FIG. 1 is a simplified block diagram of a computer system that may be used to classify a URL and a content page accessed via the URL as phishing or non-phishing, according to one implementation of the disclosed technology.

以下の詳細な説明は、図面を参照して行われる。特許請求の範囲によって定義されるその範囲を限定するためではなく、開示される技術を例示するために、サンプルの実装態様が説明される。この議論は、当業者が開示される技術を作製及び使用することを可能にするために提示され、特定の用途及びその要件の文脈において提供される。開示される実施形態に対する様々な修正は、当業者には容易に明白であり、本明細書で定義される一般的な原理は、本発明の趣旨及び範囲から逸脱することなく、他の実装態様態及び用途に適用され得る。したがって、開示される技術は、示される実装態様に限定されるものではなく、本明細書で開示される原理及び特徴と一致する最も広い範囲を与えられるべきである。 The following detailed description is provided with reference to the drawings. Sample implementations are described to illustrate the disclosed technology, but not to limit its scope as defined by the claims. The discussion is presented to enable those skilled in the art to make and use the disclosed technology, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the invention. Thus, the disclosed technology is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

開示される技術によって対処される問題は、フィッシングウェブサイトの検出である。セキュリティ部門は、フィッシングが生じるとフィッシングキャンペーンをカタログ化しようと試みる。セキュリティベンダは、フィッシングウェブサイトのリストに依存して、セキュリティエンジンを動かす。フィッシングリンクをカタログ化する独自のソース及びオープンソースの両方が利用可能である。フィッシングユニバーサルリソースロケータ（ＵＲＬ）リストの２つのオープンソースのコミュニティの例は、ＰｈｉｓｈＴａｎｋ及びＯｐｅｎＰｈｉｓｈである。セキュリティ部門がリストを使用して、悪意のあるリンクを分析し、かつ悪意のあるＵＲＬからシグネチャを生成する。シグネチャを使用して、典型的には、ＵＲＬの一部若しくは全て、又はそのコンパクトハッシュを一致させることによって、悪意のあるリンクを検出する。シグネチャからの一般化が、ハッカーがシステムを攻撃するために使用することができるゼロデイフィッシング攻撃を阻止するための主なアプローチとなっている。ゼロデイとは、ベンダ又は開発者が知ったばかりであり、かつ修正するための日数がゼロである、最近発見されたセキュリティ脆弱性を指す。 The problem addressed by the disclosed technology is the detection of phishing websites. Security departments attempt to catalog phishing campaigns as they occur. Security vendors rely on lists of phishing websites to power their security engines. Both proprietary and open source sources are available to catalog phishing links. Two open source community examples of phishing Universal Resource Locator (URL) lists are PhishTank and OpenPhish. Security departments use the lists to analyze malicious links and generate signatures from malicious URLs. Signatures are used to detect malicious links, typically by matching part or all of the URL or a compact hash of it. Generalization from signatures has become the primary approach to thwart zero-day phishing attacks that hackers can use to attack systems. Zero-day refers to a recently discovered security vulnerability that a vendor or developer has just learned about and has zero days to fix.

フィッシング詐欺師が捕まるのを回避するため、フィッシングキャンペーンは、フィッシングリンクのウェブサイトが分析され得る前に終了することがある。セキュリティ部門によってウェブサイトがリストに掲載されるとすぐに、フィッシング詐欺師によってウェブサイトが解体され得る。収集されたＵＲＬの分析は、アクティブなフィッシングサイトへの悪意のあるＵＲＬを追跡することよりも確実に永続する。サイトは、サイトが現れた時と同じように突然消失する。一部には消失するサイトに起因して、最先端の技術は、ＵＲＬを分析することとなっている。 To avoid getting caught, phishing campaigns may end before the website of the phishing link can be analyzed. The website can be taken down by the phisher as soon as it is listed by security departments. Analysis of collected URLs is more likely to persist than tracing malicious URLs to active phishing sites. Sites disappear as suddenly as they appeared. Due in part to disappearing sites, state of the art technology has become URL analysis.

開示される技術は、機械学習／深層学習（ＭＬ／ＤＬ）を、非常に低い偽陽性率及び良好な再現率でフィッシング検出に適用する。テキスト／画像解析に基づく、かつＨＴＭＬ解析に基づく、３つの転移学習技法が提示される。 The disclosed technology applies machine learning/deep learning (ML/DL) to phishing detection with very low false positive rates and good recall. Three transfer learning techniques are presented, based on text/image analysis and based on HTML analysis.

第１の技法では、我々は、ウェブページのテキストコンテンツ及び視覚コンテンツを埋め込むために、多言語自然言語理解及びコンピュータビジョンのための新しい深層学習アーキテクチャを利用することによって、転移学習を使用する。フィッシング検出にＭＬ／ＤＬを適用する第１の世代は、ウェブページテキスト及びウェブページ画像の連結埋め込みを使用する。我々は、テキスト及び画像の埋め込みに関する一般的な訓練からの転移学習を利用して、Ｔｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（ＢＥＲＴ）及び残差ニューラルネットワーク（ＲｅｓＮｅｔ）のデコーダ関数などの、モデルのエンコーダ関数を使用する検出分類器を訓練する。大量のデータで訓練されるため、そのようなモデルの最終層は、ウェブページの視覚コンテンツ及びテキストコンテンツのための信頼性の高いエンコーディングとして機能する。良性の、フィッシングでないリンクはフィッシングサイトよりもはるかに豊富であり、フィッシングでないリンクのブロックは迷惑であるため、偽陽性を低減するように注意が払われる。 In the first technique, we use transfer learning by leveraging a new deep learning architecture for multilingual natural language understanding and computer vision to embed the textual and visual content of web pages. The first generation of applying ML/DL to phishing detection uses concatenated embeddings of web page text and web page images. We use transfer learning from general training on text and image embeddings to train a detection classifier that uses the model's encoder function, such as the Bidirectional Encoder Representation (BERT) from Transformer and the decoder function of a Residual Neural Network (ResNet). Because it is trained on a large amount of data, the final layer of such a model serves as a reliable encoding for the visual and textual content of a web page. Care is taken to reduce false positives, since benign, non-phishing links are much more abundant than phishing sites, and blocking non-phishing links is a nuisance.

フィッシング検出にＭＬ／ＤＬを適用する第２の技法は、ブラウザによるディスプレイへのレンダリングを複製するためのＨＴＭＬの埋め込みを反直感的にデコードする新しいエンコーダ－デコーダペアを作成する。埋め込みは、もちろん情報損失がある。デコードは、ブラウザが達成するものよりもはるかに正確でない。ＨＴＭＬコードを埋め込むことへのエンコーダ－デコーダアプローチは、転移学習を容易にする。エンコーダがＨＴＭＬを埋め込むように訓練されると、分類器が、デコーダを置き換える。埋め込みに基づく転移学習は、比較的小さな訓練コーパスを用い、実用的である。現在、フィッシングページの例のうちのわずか２０ｋ又は４０ｋの例が、埋め込みを処理する２つの全結合層の分類器を訓練するのに十分であることが証明されている。ＨＴＭＬの第２世代の埋め込みは、ＲｅｓＮｅｔ画像埋め込み、ＵＲＬ特徴量埋め込み、又はＲｅｓＮｅｔ画像埋め込み及びＵＲＬ特徴量埋め込みの両方などの、他の埋め込みを連結することによって強化され得る。 The second technique for applying ML/DL to phishing detection creates a new encoder-decoder pair that counter-intuitively decodes HTML embeddings to replicate the browser's rendering to the display. Embeddings are of course information-lossy; the decoding is much less accurate than what the browser achieves. The encoder-decoder approach to embedding HTML code facilitates transfer learning. Once the encoder is trained to embed HTML, a classifier replaces the decoder. Embedding-based transfer learning uses a relatively small training corpus and is practical. Currently, it has been proven that as few as 20k or 40k examples of phishing pages are sufficient to train a classifier with two fully connected layers that process embeddings. The second generation embeddings of HTML can be augmented by concatenating other embeddings, such as ResNet image embeddings, URL feature embeddings, or both ResNet image embeddings and URL feature embeddings.

しかしながら、新しいＵＲＬのスケールは、深層学習アーキテクチャの高い計算複雑性、及びウェブページのコンテンツのレンダリング時間及びパース時間に起因して、これらのコンテンツを使用するウェブページのリアルタイム検出を妨げる可能性がある。 However, the scale of new URLs can prevent real-time detection of web pages that use these contents due to the high computational complexity of deep learning architectures and the rendering and parsing times of web page content.

ＭＬ／ＤＬをフィッシング検出に適用する第３の世代は、ＵＲＬ埋め込み器、ＨＴＭＬエンコーダ、及びフィッシング分類器層を使用して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類し、悪意のあるウェブページが検出されたときにリアルタイムで反応することができる。この第３の技術は、ウェブサイトを訪問する必要がなく、訓練されたより高速のモデルを使用して、疑わしいＵＲＬを効果的にフィルタリングする。疑わしいＵＲＬを、最終検出のために後で第１又は第２の技術にルーティングすることもできる。 The third generation of applying ML/DL to phishing detection uses a URL embedder, HTML encoder, and phishing classifier layer to classify URLs and content pages accessed through the URLs as phishing or not, and can react in real-time when malicious web pages are detected. This third technique effectively filters suspicious URLs without the need to visit the website and uses faster trained models. Suspicious URLs can also be routed later to the first or second techniques for final detection.

次に、オフラインモードで、及びリアルタイムでＵＲＬリンク及びダウンロードされたＨＴＭＬを介してフィッシングを検出するための例示的なシステムについて説明する。 Next, we describe an exemplary system for detecting phishing via URL links and downloaded HTML in offline mode and in real time.

アーキテクチャ
図１は、ＵＲＬリンク及びダウンロードされたＨＴＭＬを介してフィッシングを検出するためのシステム１００のアーキテクチャレベルの概略図を示す。システム１００はまた、リダイレクトされた、又は隠された、ＵＲＬリンク及びリアルタイムでダウンロードされたＨＴＭＬを介してフィッシングを検出するための機能性を含む。図１はアーキテクチャ図であるため、説明をより明確にするために、一部の詳細は意図的に省略されている。図１の説明は以下のように構成される。最初に、図の要素について説明し、その後、それらの相互接続について説明する。次に、システムにおける要素の使用をより詳細に説明する。 Architecture Figure 1 shows an architecture level schematic diagram of a system 100 for detecting phishing via URL links and downloaded HTML. System 100 also includes functionality for detecting phishing via redirected or hidden URL links and downloaded HTML in real time. Because Figure 1 is an architecture diagram, some details have been intentionally omitted for greater clarity. The description of Figure 1 is structured as follows: First, the elements of the diagram are described, followed by a description of their interconnections. Next, the use of the elements in the system is described in more detail.

図１は、エンドポイント１６６を含むシステム１００を含む。ユーザエンドポイント１６６は、クラウドベースのストア１３６及びクラウドベースのサービス１３８上に記憶されたデータへのアクセス及び対話を提供する、コンピュータ１７４、スマートフォン１７６、及びコンピュータタブレット１７８などのデバイスを含み得る。別の組織ネットワークでは、組織ユーザは、追加のデバイスを利用し得る。インラインプロキシ１４４が、ネットワーク１５５を通じて、特に、ネットワーク管理者１２２、ネットワークポリシー１３２、評価エンジン１５２、及びデータストア１６４を含むネットワークセキュリティシステム１１２を通じて、ユーザエンドポイント１６６とクラウドベースのサービス１３８との間に介在する。インラインプロキシ１４４は、ネットワークセキュリティシステム１１２の一部として、ネットワーク１５５を通じてアクセス可能である。インラインプロキシ１４４は、ユーザエンドポイント１６６と、クラウドベースのストア１３６と、他のクラウドベースのサービス１３８との間のトラフィックの監視及び制御を提供する。インラインプロキシ１４４は、ＨＴＭＬ及びウェブページのスナップショットを収集し、データセットをデータストア１６４に記憶するアクティブスキャナ１５４を有する。トラフィックからリアルタイムで特徴量を抽出することができ、かつスナップショットがライブトラフィックから収集されない場合、アクティブスキャナ１５４は、フィッシング検出にＭＬ／ＤＬを適用する第３の世代のシステムにおけるように、ＵＲＬのウェブページコンテンツをクロールするために必要とされない。フィッシングウェブサイトを検出するための３つのＭＬ／ＤＬシステムについて、以下で詳細に説明する。インラインプロキシ１４４は、特に、データ損失防止（ＤＬＰ）ポリシー及びプロトコルを含むネットワークセキュリティポリシーを実施するために、ユーザエンドポイント１６６とクラウドベースのサービス１３８との間のネットワークトラフィックを監視する。評価エンジン１５２は、フィッシングウェブサイトの開示される検出を介して、悪意があるとみなされるＵＲＬのデータベースレコードをチェックし、これらのフィッシングＵＲＬは、自動的かつ恒久的にブロックされる。 FIG. 1 includes a system 100 that includes an endpoint 166. The user endpoint 166 may include devices such as a computer 174, a smartphone 176, and a computer tablet 178 that provide access to and interaction with data stored on the cloud-based store 136 and the cloud-based services 138. In another organizational network, organizational users may utilize additional devices. An inline proxy 144 is interposed between the user endpoint 166 and the cloud-based services 138 through a network 155, and in particular through a network security system 112 that includes a network administrator 122, a network policy 132, a rating engine 152, and a data store 164. The inline proxy 144 is accessible through the network 155 as part of the network security system 112. The inline proxy 144 provides monitoring and control of traffic between the user endpoint 166, the cloud-based store 136, and other cloud-based services 138. The inline proxy 144 has an active scanner 154 that collects HTML and web page snapshots and stores the data set in a data store 164. If features can be extracted from traffic in real time and snapshots are not collected from live traffic, active scanner 154 is not required to crawl web page content of URLs as in third generation systems that apply ML/DL for phishing detection. Three ML/DL systems for detecting phishing websites are described in detail below. Inline proxy 144 monitors network traffic between user endpoints 166 and cloud-based services 138 to enforce network security policies, including data loss prevention (DLP) policies and protocols, among others. Evaluation engine 152 checks database records for URLs deemed malicious through disclosed detection of phishing websites, and these phishing URLs are automatically and permanently blocked.

ＵＲＬリンク及びダウンロードされたＨＴＭＬを介してリアルタイムでフィッシングを検出するために、ユーザエンドポイント１６６とクラウドベースのストレージプラットフォームとの間に位置付けられたインラインプロキシ１４４は、着信トラフィックを検査し、以下に説明されるフィッシング検出エンジン２０２、４０４、６０２に転送する。インラインプロキシ１４４は、ユーザがプロキシを介してページにアクセスすることを可能にする前に、リンクに対応するコンテンツをサンドボックス化し、リンクを検査／探索して、ＵＲＬによって指し示されるページが安全であることを確認するように構成され得る。次いで、悪意のあるものとして識別されたリンクを隔離し、セキュアなサンドボックス化を含む、知られている技法を利用して脅威を検査することができる。 To detect phishing in real-time via URL links and downloaded HTML, an in-line proxy 144 positioned between the user endpoint 166 and the cloud-based storage platform inspects incoming traffic and forwards it to a phishing detection engine 202, 404, 602 described below. The in-line proxy 144 may be configured to sandbox the content corresponding to the link and inspect/probe the link to ensure that the page pointed to by the URL is safe before allowing the user to access the page through the proxy. Links identified as malicious may then be quarantined and inspected for threats utilizing known techniques, including secure sandboxing.

図１の説明を続けると、クラウドベースのサービス１３８は、クラウドベースのホスティングサービス、ウェブ電子メールサービス、ビデオ、メッセージング、及び音声通話サービス、ストリーミングサービス、ファイル転送サービス、並びにクラウドベースのストレージサービスを含む。ネットワークセキュリティシステム１１２は、公共ネットワーク１５５を介してユーザエンドポイント１６６及びクラウドベースのサービス１３８に接続する。データストア１６４は、悪意のあるＵＲＬからの悪意のあるリンク及びシグネチャのリストを記憶する。シグネチャは、典型的には、ＵＲＬの一部若しくは全て、又はそのコンパクトハッシュを一致させることによって、悪意のあるリンクを検出するために使用され、データストア１６４は、１つ以上のテナントから共通データベースイメージのテーブル内に情報を記憶して、マルチテナントデータベースシステム（ＭＴＤＳ）などの多くの方法で実装され得るオンデマンドデータベースサービス（ＯＤＤＳ）を形成する。データベース画像は、１つ以上のデータベースオブジェクトを含むことができる。他の実装態様では、データベースは、リレーショナルデータベース管理システム（ＲＤＢＭＳ）、オブジェクト指向データベース管理システム（ＯＯＤＢＭＳ）、分散ファイルシステム（ＤＦＳ）、ノースキーマデータベース、又は任意の他のデータ記憶システム若しくはコンピューティングデバイスであり得る。一部の実装態様では、収集されたメタデータは、処理及び／又は正規化される。場合によっては、メタデータは、構造化データを含み、機能性は、クラウドベースのサービス１３８によって提供される特定のデータ構造をターゲットとする。フリーテキストなどの非構造化データもまた、クラウドベースのサービス１３８によって提供され、クラウドベースのサービス１３８に戻ってターゲットにされ得る。構造化データ及び非構造化データの両方が、ＪＳＯＮ（ＪａｖａＳｃｒｉｐｔオブジェクト表記）、ＢＳＯＮ（バイナリＪＳＯＮ）、ＸＭＬ、Ｐｒｏｔｏｂｕｆ、Ａｖｒｏ、又はＴｈｒｉｆｔオブジェクトのような半構造化データ形式で記憶されることが可能であり、半構造化データ形式は、文字列フィールド（又は列）と、数字、文字列、配列、オブジェクトなどのような潜在的に異なる型の対応する値と、からなる。他の実装態様では、ＪＳＯＮオブジェクトは、ネストされることができ、フィールドは、多値、例えば、配列、ネストされた配列などであり得る。これらのＪＳＯＮオブジェクトは、ＡｐａｃｈｅＣａｓｓａｎｄｒａ（商標）、ＧｏｏｇｌｅのＢｉｇｔａｂｌｅ（商標）、ＨＢａｓｅ（商標）、Ｖｏｌｄｅｍｏｒｔ（商標）、ＣｏｕｃｈＤＢ（商標）、ＭｏｎｇｏＤＢ（商標）、Ｒｅｄｉｓ（商標）、Ｒｉａｋ（商標）、Ｎｅｏ４ｊ（商標）などのようなスキーマレス又はＮｏＳＱＬキーバリューメタデータストア１７８に記憶され、これは、ＳＱＬのデータベースと同等のキースペースを使用して、パースされたＪＳＯＮオブジェクトを記憶する。各キースペースは、テーブルに類似し、かつ行と列のセットとで構成される列ファミリに分割される。 Continuing with the description of FIG. 1, the cloud-based services 138 include cloud-based hosting services, web email services, video, messaging, and voice calling services, streaming services, file transfer services, and cloud-based storage services. The network security system 112 connects to the user endpoints 166 and the cloud-based services 138 via a public network 155. The data store 164 stores a list of malicious links and signatures from malicious URLs. The signatures are typically used to detect malicious links by matching a portion or all of the URL, or a compact hash thereof, and the data store 164 stores information from one or more tenants in tables of a common database image to form an on-demand database service (ODDS), which may be implemented in many ways, such as a multi-tenant database system (MTDS). The database image may include one or more database objects. In other implementations, the database may be a relational database management system (RDBMS), an object-oriented database management system (OODBMS), a distributed file system (DFS), a no-schema database, or any other data storage system or computing device. In some implementations, the collected metadata is processed and/or normalized. In some cases, the metadata includes structured data and functionality is targeted to a particular data structure provided by cloud-based services 138. Unstructured data, such as free text, may also be provided by cloud-based services 138 and targeted back to cloud-based services 138. Both structured and unstructured data can be stored in semi-structured data formats, such as JSON (Javascript Object Notation), BSON (binary JSON), XML, Protobuf, Avro, or Thrift objects, which consist of string fields (or columns) and corresponding values of potentially different types, such as numbers, strings, arrays, objects, etc. In other implementations, JSON objects can be nested and fields can be multi-valued, e.g., arrays, nested arrays, etc. These JSON objects are stored in a schemaless or NoSQL key-value metadata store 178, such as Apache Cassandra™, Google's Bigtable™, HBase™, Voldemort™, CouchDB™, MongoDB™, Redis™, Riak™, Neo4j™, etc., which uses SQL database-like keyspaces to store the parsed JSON objects. Each keyspace is divided into column families, which are similar to tables and consist of a set of rows and columns.

図１の説明を更に続けると、システム１００は、任意の数のクラウドベースのサービス１３８、すなわち、ポイントツーポイントストリーミングサービス、ホストサービス、クラウドアプリケーション、クラウドストア、クラウドコラボレーション及びメッセージングプラットフォーム、並びにクラウド顧客関係管理（ＣＲＭ）プラットフォームを含むことができる。サービスには、ＢｉｔＴｏｒｒｅｎｔ（ＢＴ）、ユーザデータグラムプロトコル（ＵＤＰ）ストリーミング及びファイル転送プロトコル（ＦＴＰ）などのポータルトラフィックのためのプロトコルを介したピアツーピアファイル共有（Ｐ２Ｐ）と、セッション開始プロトコル（ＳＩＰ）及びＳｋｙｐｅを介したインスタントメッセージオーバーインターネットプロトコル（ＩＰ）及びモバイルフォンコーリングオーバーＬＴＥ（ＶｏＬＴＥ）などの音声、ビデオ及びメッセージングマルチメディア通信セッションとが含まれ得る。サービスは、インターネットトラフィック、クラウドアプリケーションデータ、及び汎用ルーティングカプセル化（ＧＲＥ）データを処理することができる。ネットワークサービス又はアプリケーションは、ウェブベース（例えば、ユニフォームリソースロケータ（ＵＲＬ）を介してアクセスされる）又は同期クライアントなどのネイティブとすることができる。例としては、ＳａａＳ（ｓｏｆｔｗａｒｅ－ａｓ－ａ－ｓｅｒｖｉｃｅ）提供物、ＰａａＳ（ｐｌａｔｆｏｒｍ－ａｓ－ａ－ｓｅｒｖｉｃｅ）提供物、及びＩａａＳ（ｉｎｆｒａｓｔｒｕｃｔｕｒｅ－ａｓ－ａ－ｓｅｒｖｉｃｅ）提供物、並びにＵＲＬを介して公開される内部エンタープライズアプリケーションが挙げられる。今日の一般的なクラウドベースのサービスの例として、Ｓａｌｅｓｆｏｒｃｅ．ｃｏｍ（商標）、Ｂｏｘ（商標）、Ｄｒｏｐｂｏｘ（商標）、ＧｏｏｇｌｅＡｐｐｓ（商標）、ＡｍａｚｏｎＡＷＳ（商標）、ＭｉｃｒｏｓｏｆｔＯｆｆｉｃｅ３６５（商標）、Ｗｏｒｋｄａｙ（商標）、ＯｒａｃｌｅｏｎＤｅｍａｎｄ（商標）、Ｔａｌｅｏ（商標）、Ｙａｍｍｅｒ（商標）、Ｊｉｖｅ（商標）、及びＣｏｎｃｕｒ（商標）が挙げられる。 Continuing with the description of FIG. 1, the system 100 can include any number of cloud-based services 138, namely, point-to-point streaming services, host services, cloud applications, cloud stores, cloud collaboration and messaging platforms, and cloud customer relationship management (CRM) platforms. The services can include peer-to-peer file sharing (P2P) via protocols for portal traffic such as BitTorrent (BT), User Datagram Protocol (UDP) streaming and File Transfer Protocol (FTP), and voice, video and messaging multimedia communication sessions such as instant messages over Internet Protocol (IP) and mobile phone calling over LTE (VoLTE) via Session Initiation Protocol (SIP) and Skype. The services can handle Internet traffic, cloud application data, and generic routing encapsulation (GRE) data. The network services or applications can be web-based (e.g., accessed via a Uniform Resource Locator (URL)) or native, such as a sync client. Examples include software-as-a-service (SaaS), platform-as-a-service (PaaS), and infrastructure-as-a-service (IaaS) offerings, as well as internal enterprise applications exposed via URLs. Examples of common cloud-based services today include Salesforce.com™, Box™, Dropbox™, Google Apps™, Amazon AWS™, Microsoft Office 365™, Workday™, Oracle on Demand™, Taleo™, Yammer™, Jive™, and Concur™.

システム１００の要素の相互接続において、ネットワーク１５５は、コンピュータ、タブレット及びモバイルデバイス、クラウドベースのホスティングサービス、ウェブ電子メールサービス、ビデオ、メッセージング、及び音声通話サービス、ストリーミングサービス、ファイル転送サービス、クラウドベースのストレージサービス１３６、並びにネットワークセキュリティシステム１１２を通信可能に結合する。通信経路は、パブリック及び／又はプライベートネットワーク上のポイントツーポイントであり得る。通信は、種々のネットワーク、例えば、プライベートネットワーク、ＶＰＮ、ＭＰＬＳ回路、又はインターネットを経由して生じることができ、適切なアプリケーションプログラムインターフェース（ＡＰＩ）及びデータ交換フォーマット、例えば、ＲＥＳＴ、ＪＳＯＮ、ＸＭＬ、ＳＯＡＰ、及び／又はＪＭＳを使用することができる。全ての通信は暗号化することができる。この通信は、概して、ＥＤＧＥ、３Ｇ、４ＧＬＴＥ、Ｗｉ－Ｆｉ、及びＷｉＭＡＸなどのプロトコルを介して、ローカルエリアネットワーク（ＬＡＮ）、ＷＡＮ（広域通信網）、電話ネットワーク（パブリック交換電話網（ＰＳＴＮ））、セッション開始プロトコル（ＳＩＰ）、無線ネットワーク、ポイントツーポイントネットワーク、スター型ネットワーク、トークンリングネットワーク、ハブネットワーク、モバイルインターネットを含むインターネットなどのネットワークを介する。加えて、ユーザ名／パスワード、ＯＡｕｔｈ、Ｋｅｒｂｅｒｏｓ、ＳｅｃｕｒｅＩＤ、デジタル証明書などの種々の認可及び認証技術が、通信をセキュアにするために使用され得る。 In interconnecting the elements of the system 100, the network 155 communicatively couples computers, tablets and mobile devices, cloud-based hosting services, web email services, video, messaging and voice calling services, streaming services, file transfer services, cloud-based storage services 136, and the network security system 112. The communication paths can be point-to-point over public and/or private networks. Communications can occur over a variety of networks, e.g., private networks, VPNs, MPLS circuits, or the Internet, and can use appropriate application program interfaces (APIs) and data exchange formats, e.g., REST, JSON, XML, SOAP, and/or JMS. All communications can be encrypted. This communication is generally over networks such as local area networks (LANs), wide area networks (WANs), telephone networks (public switched telephone networks (PSTN)), session initiation protocol (SIP), wireless networks, point-to-point networks, star networks, token ring networks, hub networks, and the Internet, including mobile Internet, via protocols such as EDGE, 3G, 4G LTE, Wi-Fi, and WiMAX. In addition, various authorization and authentication technologies, such as username/password, OAuth, Kerberos, SecureID, digital certificates, and the like, may be used to secure the communication.

図１のシステムアーキテクチャの説明を更に続けると、ネットワークセキュリティシステム１１２は、互いに通信するように結合された１つ以上のコンピュータ及びコンピュータシステムを含むことができるデータストア１６４を含む。それらはまた、１つ以上の仮想コンピューティング及び／又はストレージリソースであり得る。例えば、ネットワークセキュリティシステム１１２は、１つ以上のＡｍａｚｏｎＥＣ２インスタンスとすることができ、データストア１６４は、ＡｍａｚｏｎＳ３（商標）ストレージとすることができる。直接物理コンピュータ又は従来の仮想マシン上でネットワークセキュリティシステム１１２を実装するのではなく、Ｓａｌｅｓｆｏｒｃｅ製のＲａｃｋｓｐａｃｅ、Ｈｅｒｏｋｕ、又はＦｏｒｃｅ．ｃｏｍなどの他のサービスとしてのコンピューティングプラットフォームを使用することができる。加えて、セキュリティ機能を実装するために、１つ以上のエンジンを使用することができ、１つ以上のポイントオブプレゼンス（ＰＯＰ）を確立することができる。図１のエンジン又はシステムコンポーネントは、種々のタイプのコンピューティングデバイス上で実行されるソフトウェアによって実装される。例示的なデバイスは、ワークステーション、サーバ、コンピューティングクラスタ、ブレードサーバ、及びサーバファーム、又は任意の他のデータ処理システム若しくはコンピューティングデバイスである。エンジンは、異なるネットワーク接続を介してデータベースに通信可能に結合することができる。 Continuing with the description of the system architecture of FIG. 1, the network security system 112 includes a data store 164 that can include one or more computers and computer systems coupled to communicate with each other. They can also be one or more virtual computing and/or storage resources. For example, the network security system 112 can be one or more Amazon EC2 instances, and the data store 164 can be Amazon S3™ storage. Rather than implementing the network security system 112 directly on a physical computer or traditional virtual machine, other computing-as-a-service platforms such as Salesforce's Rackspace, Heroku, or Force.com can be used. In addition, one or more engines can be used and one or more points of presence (POPs) can be established to implement the security functions. The engines or system components of FIG. 1 are implemented by software running on various types of computing devices. Exemplary devices are workstations, servers, computing clusters, blade servers, and server farms, or any other data processing system or computing device. The engine can be communicatively coupled to the database via different network connections.

システム１００は、特定のブロックを参照して本明細書で説明されるが、ブロックは、説明の便宜のために定義され、構成部品の特定の物理的配置を必要とすることを意図するものではないことを理解されたい。更に、ブロックは、物理的に別個のコンポーネントに対応する必要はない。物理的に別個のコンポーネントが使用される限りにおいて、コンポーネント間の接続は、必要に応じて有線及び／又は無線とすることができる。異なる要素又はコンポーネントは、単一のソフトウェアモジュールに組み合わせることができ、複数のソフトウェアモジュールは、同じプロセッサ上で実行することができる。 While system 100 is described herein with reference to particular blocks, it should be understood that the blocks are defined for convenience of description and are not intended to require a particular physical arrangement of components. Moreover, the blocks need not correspond to physically separate components. To the extent that physically separate components are used, connections between the components can be wired and/or wireless as appropriate. Different elements or components can be combined into a single software module, and multiple software modules can execute on the same processor.

悪意のある行為者の最善の試みにもかかわらず、フィッシングウェブサイトのコンテンツ及び外観は、開示される深層学習モデルがフィッシングウェブサイトを確実に検出するために利用することができる特徴量を提供する。次に説明される開示されるシステムでは、我々は、ウェブページのテキストコンテンツ及び視覚コンテンツを埋め込むために、多言語自然言語理解及びコンピュータビジョンのための新しい深層学習アーキテクチャを利用することによって、転移学習を使用する。 Despite the best attempts of malicious actors, the content and appearance of phishing websites provide features that the disclosed deep learning models can exploit to reliably detect phishing websites. In the disclosed system described next, we use transfer learning by leveraging a novel deep learning architecture for multilingual natural language understanding and computer vision to embed the textual and visual content of web pages.

図２は、ＵＲＬ特徴量ハッシュと、自然言語（ＮＬ）単語のエンコーディングと、フィッシングサイトを検出するためのキャプチャされたウェブサイト画像の埋め込みと、を用いるＭＬ／ＤＬを利用する、開示されるフィッシング検出エンジン２０２の高レベルブロック図２００を例示している。開示されるフィッシング分類器層２７５は、特定のウェブサイトがフィッシングウェブサイトである可能性がどのくらいあるかを表す尤度スコア２８５を生成する。一実施形態では、フィッシング検出エンジン２０２は、エンコーダ２６４として１００を超える言語をサポートするＴｒａｎｓｆｏｒｍｅｒ（ＢＥＲＴ）モデルからの多言語双方向エンコーダ表現を利用し、画像に対する残差ニューラルネットワーク（ＲｅｓＮｅｔ５０）を埋め込み器２５６として利用する。ＵＲＬ特徴量ハッシュ２４２、単語エンコーディング２６５、及び画像埋め込み２５７は、次いで、以下で説明されるように、最終訓練及び推論のためにニューラルネットワークフィッシング分類器層２７５に渡される。 2 illustrates a high-level block diagram 200 of the disclosed phishing detection engine 202 that utilizes ML/DL with URL feature hashes, natural language (NL) word encodings, and embeddings of captured website images to detect phishing sites. The disclosed phishing classifier layer 275 generates a likelihood score 285 that represents how likely a particular website is to be a phishing website. In one embodiment, the phishing detection engine 202 utilizes a multilingual bidirectional encoder representation from a transformer (BERT) model supporting over 100 languages as the encoder 264 and a residual neural network for images (ResNet50) as the embedder 256. The URL feature hashes 242, word encodings 265, and image embeddings 257 are then passed to the neural network phishing classifier layer 275 for final training and inference, as described below.

エンコーダを、エンコーダとデコーダをペアリングすることによって訓練することができる。エンコーダ及びデコーダを、埋め込み空間に入力を圧縮し、次いで、埋め込みから入力を再構築するように訓練することができる。エンコーダが訓練されると、本明細書で説明されるように、エンコーダを再利用することができる。フィッシング分類器層２７５は、ＵＲＬｎ－ｇｒａｍのＵＲＬ特徴量ハッシュ２４２と、コンテンツページから抽出された単語の単語エンコーディング２６５と、ＵＲＬ２１４ウェブアドレスでコンテンツページ２１６からキャプチャされた画像の画像埋め込み２５７と、を利用する。 An encoder can be trained by pairing an encoder and a decoder. The encoder and decoder can be trained to compress the input into an embedding space and then reconstruct the input from the embedding. Once the encoder is trained, it can be reused as described herein. The phishing classifier layer 275 utilizes URL feature hashes 242 of the URL n-grams, word encodings 265 of words extracted from the content page, and image embeddings 257 of images captured from the content page 216 at the URL 214 web address.

一実施形態では、フィッシング検出エンジン２０２は、ウェブページのコンテンツ、及び応答ヘッダに存在するセキュリティ情報の特徴量ハッシュ化を利用して、良性のウェブページ及びフィッシングウェブページの両方で利用可能な特徴量を補完する。コンテンツは、一実装態様ではＪａｖａＳｃｒｉｐｔで表現される。別の実施形態では、Ｐｙｔｈｏｎなどの異なる言語を使用することができる。ＵＲＬ特徴量ハッシャ２２２は、ＵＲＬ２１４を受信し、ＵＲＬをパースして特徴量にし、特徴量をハッシュ化してＵＲＬ特徴量ハッシュ２４２を生成し、ＵＲＬｎ－ｇｒａｍの次元削減をもたらす。ヘッダ＋セキュリティ情報を有するＵＲＬのドメイン特徴量の一例を次に列挙する。
“ｓｃａｎｎｅｄ＿ｕｒｌ”：［
“ｈｔｔｐ：／／ａｌｆａｂｅｅｋ．ｃｏｍ／”
］，
“ｈｅａｄｅｒ”：｛
“ｄａｔｅ”：”Ｔｕｅ，０２Ｍａｒ２０２１１５：３０：２７ＧＭＴ”，
“ｓｅｒｖｅｒ”：”Ａｐａｃｈｅ”，
“ｌａｓｔ－ｍｏｄｉｆｉｅｄ”：”Ｔｕｅ，０８Ｓｅｐ２０２００２：０９：４９ＧＭＴ”，
“ａｃｃｅｐｔ－ｒａｎｇｅｓ”：”ｂｙｔｅｓ”，
“ｖａｒｙ”：”Ａｃｃｅｐｔ－Ｅｎｃｏｄｉｎｇ”，
“ｃｏｎｔｅｎｔ－ｅｎｃｏｄｉｎｇ”：”ｇｚｉｐ”，
“ｃｏｎｔｅｎｔ－ｌｅｎｇｔｈ”：”２３８５９”，
“ｃｏｎｔｅｎｔ－ｔｙｐｅ”：”ｔｅｘｔ／ｈｔｍｌ”｝，
“ｓｅｃｕｒｉｔｙ＿ｉｎｆｏ”：［
｛
“＿ｓｕｂｊｅｃｔＮａｍｅ”：”ａｌｆａｂｅｅｋ．ｃｏｍ”，
“＿ｉｓｓｕｅｒ”：”ＳｅｃｔｉｇｏＲＳＡＤｏｍａｉｎＶａｌｉｄａｔｉｏｎＳｅｃｕｒｅＳｅｒｖｅｒＣＡ”，
“＿ｖａｌｉｄＦｒｏｍ”：１５８３１０７２００，
“＿ｖａｌｉｄＴｏ”：１６１４７２９５９９，
“＿ｐｒｏｔｏｃｏｌ”：”ＴＬＳ１．３”，
“＿ｓａｎＬｉｓｔ”：［
“ａｌｆａｂｅｅｋ．ｃｏｍ”，
“ｗｗｗ．ａｌｆａｂｅｅｋ．ｃｏｍ”
］ In one embodiment, phishing detection engine 202 utilizes feature hashes of web page content and security information present in response headers to supplement the features available on both benign and phishing web pages. The content is expressed in JavaScript in one implementation. In another embodiment, a different language such as Python can be used. URL feature hasher 222 receives URL 214, parses the URL into features, and hashes the features to generate URL feature hashes 242, providing dimensionality reduction of the URL n-gram. An example of domain features for a URL with headers + security information is listed below:
“scanned_url”: [
“http://alfabeek.com/”
］,
“header”: {
“date”:”Tue, 02 Mar 2021 15:30:27 GMT”,
“server”:”Apache”,
“last-modified”:”Tue,08 Sep 2020 02:09:49 GMT”,
“accept-ranges”: “bytes”,
“vary”:”Accept-Encoding”,
“content-encoding”: “gzip”,
“content-length”:”23859”,
“content-type”:”text/html”},
“security_info”: [
{
“_subjectName”:”alfabeek.com”,
“_issuer”: “Sectigo RSA Domain Validation Secure Server CA”,
“_validFrom”:1583107200,
“_validTo”:1614729599,
“_protocol”:”TLS 1.3”,
“_sanList”: [
“alfabeek.com”,
“www.alfabeek.com”
］

図２の説明を続けると、ヘッドレスブラウザ２２６は、ＵＲＬのコンテンツにアクセスし、コンテンツページを内部的にレンダリングし、コンテンツページのレンダリングから単語を抽出し、かつコンテンツページのレンダリングの少なくとも一部の画像をキャプチャするように構成されている。ヘッドレスブラウザ２２６は、コンテンツページ２１６のウェブアドレスであるＵＲＬ２１４を受信し、コンテンツページ２１６から単語を抽出する。ヘッドレスブラウザ２２６は、抽出された単語２４６を自然言語エンコーダ２６４に提供し、自然言語エンコーダ２６４は、抽出された単語からエンコーディングを生成する：ブロック図２００の単語エンコーディング２６５。自然言語（ＮＬ）エンコーダ２６４は、自然言語に関して事前訓練され、コンテンツページから抽出された単語のエンコーディングを生成する。エンコーダ２６４は、例示的な実施形態では、標準エンコーダである自然言語のためのＢＥＲＴを利用する。エンコーダは、エンコーダが比較的低次元の埋め込み空間で処理する入力を埋め込む。ＢＥＲＴは、４００～８００次元の埋め込み空間に自然言語パッセージを埋め込む。Ｔｒａｎｓｆｏｒｍｅｒロジックは、自然言語入力を受け入れ、一例では、入力をエンコーディングして埋め込む７６８次元ベクトルを生成する。事前訓練されたデコーダ２６６の破線ブロック輪郭は、事前訓練されたものとして区別される。すなわち、ＢＥＲＴは、ＵＲＬ２１４のフィッシングを検出するために使用される前に訓練される。エンコーダ２６４は、フィッシングを検出するためにフィッシング分類器層２７５によって使用するために、スクリーニングされているコンテンツページから抽出された単語の単語エンコーディング２６５を生成する。異なる実装態様では、ユニバーサルセンテンスエンコーダなどの異なるＭＬ／ＤＬエンコーダを利用することができる。異なる実施形態では、長期短期記憶（ＬＳＴＭ）モデルを利用することができる。 Continuing with FIG. 2, the headless browser 226 is configured to access content at URLs, internally render the content page, extract words from the rendering of the content page, and capture images of at least a portion of the rendering of the content page. The headless browser 226 receives the URL 214, which is the web address of the content page 216, and extracts words from the content page 216. The headless browser 226 provides the extracted words 246 to a natural language encoder 264, which generates an encoding from the extracted words: word encoding 265 of block diagram 200. The natural language (NL) encoder 264 is pre-trained on natural language and generates encodings of the words extracted from the content page. The encoder 264 utilizes, in an exemplary embodiment, a standard encoder, BERT for natural language. The encoder embeds inputs that the encoder processes in a relatively low-dimensional embedding space. BERT embeds natural language passages in an embedding space of 400-800 dimensions. The Transformer logic accepts natural language input and, in one example, generates a 768-dimensional vector that encodes and embeds the input. The dashed block contour of the pre-trained decoder 266 is distinguished as pre-trained, i.e., the BERT is trained before being used to detect phishing of the URL 214. The encoder 264 generates word encodings 265 of words extracted from the content page being screened for use by the phishing classifier layer 275 to detect phishing. Different implementations may utilize different ML/DL encoders, such as a universal sentence encoder. Different embodiments may utilize a long short-term memory (LSTM) model.

更に図２の説明を続けると、ヘッドレスブラウザ２２６は、コンテンツページ２１６のウェブアドレスであるＵＲＬ２１４を受信し、ウェブページを訪問する実際のユーザを模倣し、かつレンダリングされたウェブページのスナップショットを撮ることによって、ウェブページの画像をキャプチャする。ヘッドレスブラウザ２２６は、スナップショットを撮り、キャプチャされた画像２４８を、画像に関して事前訓練された画像埋め込み器２５６に提供し、コンテンツページからキャプチャされた画像の埋め込みを生成する。画像埋め込みは、難読化されたケースに対して、効率を高め、フィッシング検出を改善することができる。埋め込み器２５６は、キャプチャされた画像２４８を画像埋め込み２５７としてエンコーディングする。一実施形態では、埋め込み器２５６は、標準埋め込み器である残差ニューラルネットワーク（ＲｅｓＮｅｔ５０）を、画像に対する事前訓練された分類器２５８とともに利用する。異なる実装態様では、Ｉｎｃｅｐｔｉｏｎ－ｖ３、ＶＧＧ－１６、ＲｅｓＮｅｔ３４、又はＲｅｓＮｅｔ－１０１などの異なるＭＬ／ＤＬで事前訓練された画像埋め込み器を利用することができる。例示的な実施形態を続けると、ＲｅｓＮｅｔ５０は、ＲＧＢ２２４ｘ２２４ピクセル画像などの画像を埋め込み、この画像を埋め込み空間にマッピングする２４８次元の埋め込みベクトルを生成する。埋め込み空間は、元の入力よりもはるかにコンパクトである。事前訓練されたＲｅｓＮｅｔ５０埋め込み器２５６は、フィッシングウェブサイトを検出するために使用される、スクリーニングされているコンテンツページのスナップショットの画像埋め込み２５７を生成する。 Continuing with FIG. 2, the headless browser 226 receives the URL 214, which is the web address of the content page 216, and captures an image of the web page by mimicking an actual user visiting the web page and taking a snapshot of the rendered web page. The headless browser 226 takes the snapshot and provides the captured image 248 to an image embedder 256, which is pre-trained on images, to generate an embedding of the captured image from the content page. Image embedding can increase efficiency and improve phishing detection for obfuscated cases. The embedder 256 encodes the captured image 248 as an image embedding 257. In one embodiment, the embedder 256 utilizes a standard embedder, a residual neural network (ResNet50), with a pre-trained classifier 258 on images. Different implementations may utilize image embedders pre-trained with different ML/DLs, such as Inception-v3, VGG-16, ResNet34, or ResNet-101. Continuing with the exemplary embodiment, ResNet50 embeds an image, such as an RGB 224x224 pixel image, and generates a 248-dimensional embedding vector that maps the image into an embedding space that is much more compact than the original input. The pre-trained ResNet50 embedder 256 generates image embeddings 257 of snapshots of the content pages being screened, which are used to detect phishing websites.

開示されるフィッシング検出エンジン２０２のフィッシング分類器層２７５は、ＵＲＬ特徴量ハッシュと、コンテンツページから抽出された単語のエンコーディングと、例示的なＵＲＬのコンテンツページからの画像キャプチャの埋め込みと、に関して訓練され、各例示的なＵＲＬは、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う。フィッシング分類器層２７５は、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込みを処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを表す少なくとも１つの尤度スコアを生成する。尤度スコア２８５は、特定のウェブサイトがフィッシングウェブサイトである可能性がどのくらいあるかを表す。一実施形態では、フィッシング分類器層２７５への入力サイズは、２０４８＋７６８＋１０２４であり、ＢＥＲＴの出力は、７６８であり、ＲｅｓＮｅｔ５０埋め込みサイズは、２０４８であり、ＵＲＬのｎ－ｇｒａｍにわたる特徴量ハッシュのサイズは、１０２４である。フィッシング検出エンジン２０２は、フィッシングウェブサイトの言語に関わらず、フィッシングウェブサイトの意味的に有意な検出に非常に好適である。開示されるニアリアルタイムのクローリングパイプラインは、これらが無効化される前に、新しい疑わしいウェブページのコンテンツを迅速にキャプチャし、したがって、フィッシング攻撃の短いライフサイクルの性質に対処し、このことが、所定の深層学習アーキテクチャの継続的な再訓練のためのより大きな訓練データセットを蓄積するのに役立つ。 The phishing classifier layer 275 of the disclosed phishing detection engine 202 is trained on URL feature hashes, encodings of words extracted from content pages, and embeddings of image captures from content pages of example URLs, with each example URL accompanied by a ground truth classification as phishing or not phishing. The phishing classifier layer 275 processes the URL feature hashes, word encodings, and image embeddings to generate at least one likelihood score representing the phishing risk of the URL and the content accessed via the URL. The likelihood score 285 represents how likely it is that a particular website is a phishing website. In one embodiment, the input size to the phishing classifier layer 275 is 2048+768+1024, the output of the BERT is 768, the ResNet50 embedding size is 2048, and the size of the feature hash over the n-gram of the URL is 1024. The phishing detection engine 202 is well suited for semantically meaningful detection of phishing websites, regardless of the language of the phishing website. The disclosed near real-time crawling pipeline rapidly captures new suspicious webpage content before they are neutralized, thus addressing the short lifecycle nature of phishing attacks, which helps accumulate larger training datasets for continuous retraining of a given deep learning architecture.

図３は、上記の図２に示されるブロック図に関連して説明されるような、ウェブコンテンツページから抽出された単語の自然言語分類に利用され得るＴｒａｎｓｆｏｒｍｅｒからの参照双方向エンコーダ表現（ＢＥＲＴ）のブロック図を例示するものである。 Figure 3 illustrates a block diagram of a Reference Bidirectional Encoder Representation (BERT) from a Transformer that may be utilized for natural language classification of words extracted from web content pages, as described in connection with the block diagram shown in Figure 2 above.

フィッシング検出にＭＬ／ＤＬを適用するための第２のシステムは、画像転移学習を利用し、また、ＨＴＭＬ埋め込みを学習するために生成事前訓練（ＧＰＴ）を使用する。これは、限られたフィッシングデータセットを有するという問題に対処し、また、ＨＴＭＬのコンテンツのより良い表現を提供する。第１のアプローチとは異なり、ＢＥＲＴテキストエンコーディングの必要性がない。ＨＴＭＬ埋め込みネットワークは、ＨＴＭＬコンテンツ（テキスト、ＪＳ、ＣＳＳなど）のマルチモーダルコンテンツ全体を２５６個の数のベクトルによって表すことを学習する。このＨＴＭＬ埋め込みネットワークの理論的な基礎は、ｔｈｅ３７ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ，ＰＭＬＲ１１９：１６９１－１７０３，２０２０の議事録に掲載された“ＯｐｅｎＡＩＧｅｎｅｒａｔｉｖｅＰｒｅｔｒａｉｎｉｎｇＦｒｏｍＰｉｘｅｌｓ”に内示されている。 The second system for applying ML/DL to phishing detection utilizes image transfer learning and also uses generative pre-training (GPT) to learn HTML embeddings. This addresses the problem of having a limited phishing dataset and also provides a better representation of the HTML content. Unlike the first approach, there is no need for BERT text encoding. The HTML embedding network learns to represent the entire multimodal content of the HTML content (text, JS, CSS, etc.) by a vector of 256 numbers. The theoretical foundations of this HTML embedding network are presented in "Open AI Generative Pretraining From Pixels" in the proceedings of the 37th International Conference on Machine Learning, PMLR 119:1691-1703, 2020.

図４は、各例示的なＵＲＬが、フィッシングサイトを検出するために、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う、ＵＲＬ特徴量ハッシュと、コンテンツページから抽出されたＨＴＭＬのエンコーディングと、例示的なＵＲＬのコンテンツページからキャプチャされた画像の埋め込みと、を用いるＭＬ／ＤＬを利用する、開示されるフィッシング検出エンジン４０２の高レベルブロック図４００を例示している。開示されるフィッシング分類器層４７５は、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコア４８５を生成する。 Figure 4 illustrates a high level block diagram 400 of the disclosed phishing detection engine 402, where each exemplary URL uses ML/DL with URL feature hashing with ground truth classification as phishing or not phishing, encoding of HTML extracted from the content page, and embedding of an image captured from the exemplary URL's content page to detect phishing sites. The disclosed phishing classifier layer 475 generates at least one likelihood score 485 that the URL and the content accessed via the URL presents a phishing risk.

フィッシング検出エンジン４０２は、ＵＲＬ４１４を特徴量にパースし、特徴量をハッシュ化して、ＵＲＬ特徴量ハッシュ４４２を生成し、ＵＲＬｎ－ｇｒａｍの次元削減をもたらすＵＲＬ特徴量ハッシャ４２２を使用する。ヘッダ＋セキュリティ情報を有するＵＲＬのドメイン特徴量の一例が、上に列挙されている。 The phishing detection engine 402 uses a URL feature hasher 422 that parses the URL 414 into features and hashes the features to generate URL feature hashes 442, resulting in dimensionality reduction of the URL n-gram. An example of domain features for a URL with header + security information is listed above.

ヘッドレスブラウザ４２６は、ＨＴＭＬトークン４４６を抽出し、ＨＴＭＬエンコーダ４６４に提供する。フィッシング検出エンジン２０２は、開示されるＨＴＭＬエンコーダ４６４を利用し、例示的なＵＲＬ４１６のコンテンツページから抽出され、エンコードされ、次いで、コンテンツページのレンダリングからキャプチャされた画像を再現するようにデコードされる、ＨＴＭＬトークン４４６に関して訓練される。破線は、キャプチャされた画像４４８からの破線を含む、アクティブなＵＲＬの後の処理から、エンコーダ埋め込みからページのレンダリングされた画像を生成する生成訓練デコーダ４６６の破線ブロック輪郭を区別する。データ分布Ｐ（Ｘ）を学習することは、Ｐ（Ｘ｜Ｙ）のその後の教師ありモデリングに非常に有益であり、Ｙは、フィッシング及び非フィッシングの二値クラスであり、Ｘは、ＨＴＭＬコンテンツである。ＨＴＭＬエンコーダ４６４は、生成事前訓練（ＧＰＴ）を使用して事前訓練され、これは、大量の教師なしデータにわたる教師なし事前訓練が、Ｐ（Ｙ｜Ｘ）を用いた後続の教師あり意思決定のためにデータ分布Ｐ（Ｘ）を学習するために利用される。ＨＴＭＬエンコーダ４６４が訓練されると、ＨＴＭＬエンコーダ４６４は、再利用され得る。ＨＴＭＬエンコーダ４６４は、コンテンツページ４１６から抽出されたＨＴＭＬトークン４４６のＨＴＭＬエンコーディング４６５を生成する。 The headless browser 426 extracts HTML tokens 446 and provides them to the HTML encoder 464. The phishing detection engine 202 utilizes the disclosed HTML encoder 464 and is trained on the HTML tokens 446 extracted from the content page of the exemplary URL 416, encoded, and then decoded to recreate the captured image from the rendering of the content page. The dashed lines distinguish the dashed block contour of the generative training decoder 466 that generates the rendered image of the page from the encoder embedding from subsequent processing of the active URL, including the dashed lines from the captured image 448. Learning the data distribution P(X) is highly beneficial for the subsequent supervised modeling of P(X|Y), where Y is the binary class of phishing and non-phishing, and X is the HTML content. The HTML encoder 464 is pre-trained using generative pre-training (GPT), in which unsupervised pre-training over large amounts of unsupervised data is utilized to learn a data distribution P(X) for subsequent supervised decision making with P(Y|X). Once the HTML encoder 464 is trained, it can be reused. The HTML encoder 464 generates an HTML encoding 465 of the HTML tokens 446 extracted from the content page 416.

ＨＴＭＬは、ルールに基づいてトークン化され、ＨＴＭＬトークン４４６は、ＨＴＭＬエンコーダ４６４に渡される。フィッシングＵＲＬリストのコミュニティ例である、オープンソースを提供する、インターネット上の、フィッシングに関するデータ及び情報の共同クリアリングハウスとしては、フィッシングウェブサイトとして識別されているＨＴＭＬファイルのソースとして機能するＰｈｉｓｈＴａｎｋ、ＯｐｅｎＰｈｉｓｈ、ＭａｌｗａｒｅＰａｔｒｏｌ、及びＫａｓｐｅｒｓｋｙが挙げられる。フィッシングを含まない陰性サンプルは、フィッシングウェブサイトの現在の傾向を表す割合でデータセットをバランスさせる。ＨＴＭＬエンコーダ４６４は、インハウスのアクティブスキャナ１５４によって集取された、ＨＴＭＬ及びページスナップショットのラベル付けされていない大規模なデータセットを使用して訓練される。ウェブサイトのユーザは、特に、悪意のあるレンダリングされたページが正当なログインページの外観を模倣しているときに、攻撃の犠牲になるため、訓練の目標は、ＨＴＭＬエンコーダに、これらのページのレンダリングされた画像に関してＨＴＭＬコンテンツを表すように学習することを強制する。 The HTML is tokenized based on the rules, and the HTML tokens 446 are passed to the HTML encoder 464. Community examples of phishing URL lists, open source, collaborative clearinghouses of phishing data and information on the Internet include PhishTank, OpenPhish, MalwarePatrole, and Kaspersky, which serve as sources of HTML files that have been identified as phishing websites. Negative samples that do not contain phishing balance the dataset in proportions that represent the current trends of phishing websites. The HTML encoder 464 is trained using a large unlabeled dataset of HTML and page snapshots collected by the in-house active scanner 154. Website users fall victim to attacks, especially when malicious rendered pages mimic the appearance of legitimate login pages, so the training goal is to force the HTML encoder to learn to represent HTML content in terms of rendered images of these pages.

訓練のために、ＨＴＭＬエンコーダ４６４は、ＨＴＭＬのランダムな初期パラメータ及びパラメータで初期化される。一実施形態では、データストア１６４内の７００Ｋ個のＨＴＭＬファイルがスキャンされ、コンテンツページを表す上位１０Ｋ個のトークンの結果として得られた抽出が、かなりの数の偽陽性結果に悩まされない分類のためにフィッシング検出エンジン４０２を構成するために使用された。１つの例示的なコンテンツページでは、８００個の有効なトークンが抽出された。別の例では、２Ｋ個の有効なトークンが認識され、第３の例では、およそ１Ｋ個のトークンが収集された。 For training, the HTML encoder 464 is initialized with random initial parameters and parameters of HTML. In one embodiment, 700K HTML files in the data store 164 were scanned and the resulting extraction of the top 10K tokens representing content pages was used to configure the phishing detection engine 402 for classification that does not suffer from a significant number of false positive results. In one exemplary content page, 800 valid tokens were extracted. In another example, 2K valid tokens were recognized, and in a third example, approximately 1K tokens were collected.

別の実施形態では、ＨＴＭＬパーサを使用して、ＵＲＬを介してアクセスされたコンテンツページからＨＴＭＬトークンを抽出することができる。ヘッドレスブラウザ及びＨＴＭＬパーサの両方を、所定のトークン語彙に属するＨＴＭＬトークンを抽出し、かつ所定のトークン語彙に属さないコンテンツの部分を無視するように構成することができる。一実施形態では、フィッシング検出エンジン４０２は、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出するように構成されたヘッドレスブラウザを含む。６４は、特定の、構成可能なシステムパラメータである。抽出は、いくつかの場合には、レンダリングに最大１０ミリ秒を使用することができる。別の実施形態では、ヘッドレスブラウザを、最大１２８、２５６、１０２４、又は４０９６個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出するように構成することができる。より多くのトークンの使用は、訓練を遅くする。最大２ｋ個トークンの実装が達成されている。訓練を利用して、ＨＴＭＬトークンのどのような順序パターンが特定のページビューを生じるかを学習することができる。数学的近似は、より良い分類のためにどのような分布のトークンを後に使用すべきかを学習するために使用可能である。 In another embodiment, an HTML parser can be used to extract HTML tokens from a content page accessed via a URL. Both the headless browser and the HTML parser can be configured to extract HTML tokens that belong to a predefined token vocabulary and to ignore portions of the content that do not belong to the predefined token vocabulary. In one embodiment, the phishing detection engine 402 includes a headless browser configured to extract for generation of an HTML encoding of up to 64 HTML tokens. 64 is a specific, configurable system parameter. The extraction can take up to 10 milliseconds to render in some cases. In another embodiment, the headless browser can be configured to extract for generation of an HTML encoding of up to 128, 256, 1024, or 4096 HTML tokens. The use of more tokens slows down training. Implementations of up to 2k tokens have been achieved. Training can be used to learn what ordering patterns of HTML tokens result in a particular page view. The mathematical approximation can be used to learn what distribution of tokens should be used later for better classification.

図４の説明を続けると、画像に関して事前訓練された画像埋め込み器は、コンテンツページからキャプチャされた画像の画像埋め込みを生成する。事前訓練された埋め込み器係数は、ニアリアルタイムでコスト効率の良い埋め込みを可能にする。ヘッドレスブラウザ４２６は、ＵＲＬのコンテンツにアクセスし、かつコンテンツページを内部的にレンダリングするように構成されている。ヘッドレスブラウザ２２６は、コンテンツページ４１６のウェブアドレスであるＵＲＬ４１４を受信し、ウェブページを訪問する実際のユーザを模倣し、かつレンダリングされたウェブページのスナップショットを撮ることによって、ウェブページの画像をキャプチャする。ヘッドレスブラウザ４２６は、スナップショットを撮り、キャプチャされた画像４４８を、画像に関して事前訓練されたレンダリング画像埋め込み器４５６に提供し、コンテンツページからキャプチャされた画像の埋め込みを生成する。画像埋め込みは、効率を高め、フィッシング検出を改善することができ、このことは、難読化されたケースに対して特に有用である。埋め込み器４５６は、キャプチャされた画像４４８を画像埋め込み４５７としてエンコーディングする。一実施形態では、レンダリング画像埋め込み器４５６は、標準埋め込み器である残差ニューラルネットワーク（ＲｅｓＮｅｔ５０）を、画像に対する事前訓練された分類器４５８とともに利用する。異なる実装態様では、Ｉｎｃｅｐｔｉｏｎ－ｖ３、ＶＧＧ－１６、ＲｅｓＮｅｔ３４、又はＲｅｓＮｅｔ－１０１などの異なるＭＬ／ＤＬで事前訓練された画像埋め込み器を利用することができる。例示的な実施形態を続けると、ＲｅｓＮｅｔ５０は、ＲＧＢ２２４ｘ２２４ピクセル画像などの画像を埋め込み、この画像を埋め込み空間にマッピングする２０４８次元の埋め込みベクトルを生成する。埋め込み空間は、元の入力よりもはるかにコンパクトである。事前訓練されたＲｅｓＮｅｔ５０埋め込み器４５６は、フィッシングウェブサイトを検出するために使用されるように、コンテンツページからキャプチャされた画像の画像埋め込み４５７を生成する。ＵＲＬ特徴量ハッシュ４４２、ＨＴＭＬエンコーディング４６５、及び画像埋め込み４５７は、以下で説明されるように、最終訓練及び推論のためにニューラルネットワークフィッシング分類器層４７５に渡される。一実施形態では、最終分類器の入力サイズは、２０４８（ＲｅｓＮｅｔ５０埋め込みサイズ）＋２５６（ＨＴＭＬエンコーダのエンコーディングサイズ）＋１０２４（ＵＲＬのｎ－ｇｒａｍにわたる特徴量ハッシュのサイズ）である。新しいフィッシングウェブサイトは、１つの本番システムで、セキュリティチームによって時間ごとに提出される。１つの例示的なフィッシングウェブサイトでは、ＨＴＭＬスクリプトが開始し、次いで、空白のセクションが検出され、次いで、ＨＴＭＬスクリプトが終了する。開示される技術は、新しいフィッシングウェブサイトのタイムリーな検出をサポートする。 Continuing with FIG. 4, the image embedder pre-trained on images generates an image embedding of the captured image from the content page. The pre-trained embedder coefficients allow for near real-time, cost-effective embedding. The headless browser 426 is configured to access the content of the URL and render the content page internally. The headless browser 226 receives the URL 414, which is the web address of the content page 416, and captures an image of the web page by mimicking a real user visiting the web page and taking a snapshot of the rendered web page. The headless browser 426 takes the snapshot and provides the captured image 448 to the rendered image embedder 456 pre-trained on images, which generates an embedding of the captured image from the content page. Image embedding can increase efficiency and improve phishing detection, which is particularly useful for obfuscated cases. The embedder 456 encodes the captured image 448 as an image embedding 457. In one embodiment, the rendered image embedder 456 utilizes a standard embedder, a residual neural network (ResNet50), with a pre-trained classifier 458 for the image. Different implementations can utilize image embedders pre-trained with different ML/DLs, such as Inception-v3, VGG-16, ResNet34, or ResNet-101. Continuing with the exemplary embodiment, ResNet50 embeds an image, such as an RGB 224x224 pixel image, and generates a 2048-dimensional embedding vector that maps the image into an embedding space that is much more compact than the original input. The pre-trained ResNet50 embedder 456 generates image embeddings 457 for images captured from content pages to be used to detect phishing websites. The URL feature hash 442, HTML encoding 465, and image embedding 457 are passed to a neural network phishing classifier layer 475 for final training and inference, as described below. In one embodiment, the input size of the final classifier is 2048 (ResNet50 embedding size) + 256 (HTML encoder encoding size) + 1024 (size of feature hash over n-grams of URL). New phishing websites are submitted hourly by the security team in one production system. In one exemplary phishing website, the HTML script starts, then a blank section is detected, then the HTML script ends. The disclosed technology supports timely detection of new phishing websites.

開示されるフィッシング検出エンジン４０２のフィッシング分類器層４７５は、ＵＲＬ特徴量ハッシュと、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングと、例示的なＵＲＬのコンテンツページからのキャプチャされた画像の埋め込みと、に関して訓練され、各例示的なＵＲＬは、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース４７２分類を伴う。訓練後、フィッシング分類器層２７５は、ＵＲＬ特徴量ハッシュ４４２、ＨＴＭＬエンコーディング４６５、及び画像埋め込み４５７を処理して、ＵＲＬと、このＵＲＬ４１４を介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコア４８５を生成する。尤度スコア４８５は、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する可能性がどのくらいあるかを表す。次に、分類損失と、ｃｌｆ損失と、二値のフィッシングか否かと、分類器が見ることを期待するものと分類器が見るものとの間の差Ｇｅｎ＿ｌｏｓｓについて、モデルを訓練するための例示的な擬似コードを列挙する。
ｄｅｆｔｒａｉｎｉｎｇ＿ｓｔｅｐ（ｓｅｌｆ，ｂａｔｃｈ，ｂａｔｃｈ＿ｉｄｘ）：
ｈｔｍｌ＿ｔｏｋｅｎｓ，ｓｎａｐｓｈｏｔ，ｌａｂｅｌ，ｒｅｓｎｅｔ＿ｅｍｂｅｄ，ｄｏｍａｉｎ＿ｆｅａｔｕｒｅｓ＝ｂａｔｃｈ
＃チューニング及び分類
ｉｆｓｅｌｆ．ｃｌａｓｓｉｆｙ：
埋め込み、ｌｏｇｉｔｓ＝ｓｅｌｆ．ｇｐｔ（ｘ，ｃｌａｓｓｉｆｙ＝Ｔｒｕｅ）
ｇｅｎ＿ｌｏｓｓ＝ｓｅｌｆ．ｃｒｉｔｅｒｉｏｎ（ｌｏｇｉｔｓ，ｙ）
ｃｌｆ＿ｌｏｇｉｔｓ＝ｓｅｌｆ．ｃｏｎｃａｔ＿ｌａｙｅｒ（ｔｏｒｃｈ．ｃａｔ（［ｅｍｂｅｄｄｉｎｇ，ｒｅｓｎｅｔ＿ｅｍｂｅｄ，ｄｏｍａｉｎ＿ｆｅａｔｕｒｅｓ］，ｄｉｍ＝１））
ｃｌｆ＿ｌｏｓｓ＝ｓｅｌｆ．ｃｌｆ＿ｃｒｉｔｅｒｉｏｎ（ｃｌｆ＿ｌｏｇｉｔｓ，ｌａｂｅｌ）
＃分類のための共同損失
ｌｏｓｓ＝ｃｌｆ＿ｌｏｓｓ＋ｇｅｎ＿ｌｏｓｓ
＃生成事前訓練
ｅｌｓｅ：
ｇｅｎｅｒａｔｅｄ＿ｉｍｇ＝ｓｅｌｆ．ｇｐｔ（ｈｔｍｌ＿ｔｏｋｅｎｓ）
ｌｏｓｓ＝ｓｅｌｆ．ｃｒｉｔｅｒｉｏｎ（ｇｅｎｅｒａｔｅｄ＿ｉｍｇ，ｓｎａｐｓｈｏｔ） The phishing classifier layer 475 of the disclosed phishing detection engine 402 is trained on the URL feature hashes, HTML encodings of HTML tokens extracted from the content pages, and captured image embeddings from the content pages of example URLs, with each example URL accompanied by a ground truth 472 classification as phishing or not phishing. After training, the phishing classifier layer 275 processes the URL feature hashes 442, HTML encodings 465, and image embeddings 457 to generate at least one likelihood score 485 that the URL and the content accessed via this URL 414 presents a phishing risk. The likelihood score 485 represents how likely it is that the URL and the content accessed via this URL presents a phishing risk. Below we list example pseudocode for training models for classification loss, clf loss, binary phishing or not, and Gen_loss, the difference between what the classifier expects to see and what it sees.
def training_step(self, batch, batch_idx):
html_tokens, snapshot, label, resnet_embed, domain_features = batch
# Tuning and Classification if self. classify:
Embedding, logits=self.gpt(x,classify=True)
gen_loss = self. Criterion (logits, y)
clf_logits = self. concat_layer(torch.cat([embedding, resnet_embed, domain_features], dim=1))
clf_loss = self. clf_criterion(clf_logits, label)
# Joint loss for classification loss = clf_loss + gen_loss
#Generate pretraining else:
generated_img = self. gpt(html_tokens)
loss = self. criteria(generated_img, snapshot)

図５は、フィッシング検出エンジン４０２で使用する前に、画像の分類のために事前訓練された参照残差ニューラルネットワーク（ＲｅｓＮｅｔ）のブロック図を例示している。 Figure 5 illustrates a block diagram of a reference residual neural network (ResNet) that is pre-trained for image classification prior to use in the phishing detection engine 402.

インラインフィッシングでは、ウェブページは、ユーザエンドポイント１６６においてユーザサイドでレンダリングされるため、ページのスナップショットは、利用可能ではないため、ＲｅｓＮｅｔは、利用されず、フィッシング検出分類器は、コンテンツページのヘッダ情報にアクセスしない。次に、ページから抽出されたＵＲＬ及びＨＴＭＬトークンを利用して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか否かとして分類する、開示される分類器システムについて説明する。この第３のシステムは、訪問されたコンテンツのスナップショットへの、及びヘッダ情報へのアクセスが利用可能でない場合、本番環境で特に有用であり、ネットワークセキュリティシステムで、リアルタイムで動作することができる。 In inline phishing, since the web page is rendered user-side at the user endpoint 166, a snapshot of the page is not available, so ResNet is not utilized and the phishing detection classifier does not have access to the header information of the content page. Next, a disclosed classifier system is described that utilizes the URL and HTML tokens extracted from the page to classify the URL and the content page accessed via the URL as phishing or not. This third system is particularly useful in production environments where access to a snapshot of the visited content and to the header information is not available, and can operate in real time in network security systems.

別の開示される分類器システムは、訪問されたコンテンツのスナップショットへの、及びヘッダ情報へのアクセスが利用可能でないときに、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか否かとして分類するためのＭＬ／ＤＬを適用する。図６は、ＵＲＬ埋め込み器及びＨＴＭＬエンコーダを用いるＭＬ／ＤＬを利用する、開示されるフィッシング検出エンジン６０２の高レベルブロック図６００を例示している。ＨＴＭＬのエンコーディングは、ＵＲＬによって指し示されるコンテンツページから抽出される。開示されるフィッシング分類器層６７５は、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコア６８５を生成する。 Another disclosed classifier system applies ML/DL to a snapshot of the visited content and when access to header information is not available to classify a URL and the content accessed via the URL as phishing or not. FIG. 6 illustrates a high level block diagram 600 of a disclosed phishing detection engine 602 that utilizes ML/DL with a URL embedder and HTML encoder. The HTML encoding is extracted from the content page pointed to by the URL. The disclosed phishing classifier layer 675 generates at least one likelihood score 685 that the URL and the content accessed via the URL presents a phishing risk.

フィッシング検出エンジン６０２は、ＵＲＬ６１４から所定の文字セット内の文字を抽出してＵＲＬ文字シーケンス６４２を生成する、ＵＲＬリンクシーケンス抽出器６２２を使用する。一次元１Ｄ畳み込みニューラルネットワーク（Ｃｏｎｖ１Ｄ）ＵＲＬ埋め込み器６５２は、ＵＲＬ埋め込み６５３を生成する。分類のために使用する前に、ＵＲＬ埋め込み器６５２及びＵＲＬ分類器６５４は、ＵＲＬをフィッシングか、又はフィッシングでないとして分類するグラウンドトゥルース６３２を伴う例示的なＵＲＬを使用して訓練される。訓練されたＵＲＬ分類器６５４の破線ブロック輪郭は、訓練をアクティブなＵＲＬの後の処理から区別する。ＵＲＬ埋め込み器６５２の訓練中、フィッシング分類器層を越えて、ＵＲＬ埋め込みを生成するために使用される埋め込み層までの差がバックプロパゲートされる。 The phishing detection engine 602 uses a URL link sequence extractor 622 that extracts characters in a predefined character set from a URL 614 to generate a URL character sequence 642. A one-dimensional 1D convolutional neural network (Conv1D) URL embedder 652 generates a URL embedding 653. Before being used for classification, the URL embedder 652 and URL classifier 654 are trained using example URLs with ground truth 632 that classify URLs as phishing or non-phishing. The dashed block outline of the trained URL classifier 654 distinguishes training from subsequent processing of active URLs. During training of the URL embedder 652, differences are backpropagated past the phishing classifier layer to the embedding layer used to generate the URL embedding.

図６に例示されるシステム６００の説明を続けると、フィッシング検出エンジン６０２はまた、開示されるＨＴＭＬエンコーダ６６４を利用し、これは、例示的なＵＲＬ６１６におけるコンテンツページからＨＴＭＬパーサ６３６によってパースされたＨＴＭＬトークン６４６を使用して訓練され、エンコーディングされ、次いで、デコードされて、コンテンツページのレンダリングからキャプチャされた画像を再現する。パースすることは、利用可能なメタデータから意味を抽出する。一実装態様では、トークン化は、メタデータのストリーム内のＨＴＭＬトークンを識別するためのパースすることの第１のステップとして動作し、パースすることは、次いで、トークンが見つかったコンテキストを使用して、参照されている情報の意味及び／又は種類を決定することに進む。ＨＴＭＬエンコーダ６６４は、コンテンツページ６１６から抽出されたＨＴＭＬトークン６４６のＨＴＭＬエンコーディング６６５を生成する。 Continuing with the description of the system 600 illustrated in FIG. 6, the phishing detection engine 602 also utilizes the disclosed HTML encoder 664, which is trained using HTML tokens 646 parsed by the HTML parser 636 from the content page at the exemplary URL 616, encoded, and then decoded to recreate the image captured from the rendering of the content page. Parsing extracts meaning from the available metadata. In one implementation, tokenization operates as the first step of parsing to identify HTML tokens in the stream of metadata, and parsing then proceeds to use the context in which the tokens are found to determine the meaning and/or type of information being referenced. The HTML encoder 664 generates an HTML encoding 665 of the HTML tokens 646 extracted from the content page 616.

訓練中、ヘッドレスブラウザ６２８は、事前訓練で使用するために、ＵＲＬのコンテンツページの画像をキャプチャする。キャプチャされた画像６４８から、エンコーダ埋め込みからページのレンダリングされた画像を生成する生成訓練デコーダ６６８の破線ブロック輪郭までの破線は、訓練をアクティブなＵＲＬの後の処理から区別する。訓練のために、ＨＴＭＬエンコーダ６６４は、ＨＴＭＬのランダムな初期パラメータ及びパラメータで初期化される。ＨＴＭＬエンコーダ６６４は、生成事前訓練を使用して事前訓練され、これは、大量の教師なしデータにわたる教師なし事前訓練が、Ｐ（Ｙ｜Ｘ）を用いた後続の教師あり意思決定のためにデータ分布Ｐ（Ｘ）を学習するために利用される。ＨＴＭＬエンコーダ６６４が訓練されると、本番での使用のために再利用される。エンコーダ６６４の訓練中、ＨＴＭＬエンコーディングを生成するために使用されるエンコーディング層の差は、フィッシング分類器層を越えてバックプロパゲートされる。訓練データは、一実施形態では、２０６，２２４個の良性ページ及び６９，８０８個のフィッシングページを含む。 During training, the headless browser 628 captures an image of the URL's content page for use in pre-training. The dashed line from the captured image 648 to the dashed block outline of the generative training decoder 668, which generates the rendered image of the page from the encoder embedding, distinguishes training from subsequent processing of the active URL. For training, the HTML encoder 664 is initialized with random initial parameters and parameters of HTML. The HTML encoder 664 is pre-trained using generative pre-training, where unsupervised pre-training over a large amount of unsupervised data is utilized to learn the data distribution P(X) for subsequent supervised decision making with P(Y|X). Once the HTML encoder 664 is trained, it is reused for use in production. During the training of the encoder 664, the differences of the encoding layers used to generate the HTML encodings are back-propagated across the phishing classifier layers. In one embodiment, the training data includes 206,224 benign pages and 69,808 phishing pages.

システム６００の説明を更に続けると、フィッシング分類器層６７５は、例示的なＵＲＬのＵＲＬ埋め込み及びＨＴＭＬエンコーディングに関して訓練され、各例示的なＵＲＬは、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類６３２を伴う。ＵＲＬ埋め込み器６５２の訓練中、ＵＲＬ埋め込み６５３を生成するために使用されるエンコーディング層の差は、フィッシング分類器層を越えてバックプロパゲートされる。すなわち、ＨＴＭＬエンコーダ６６４が事前訓練されると、ＵＲＬ埋め込み６５３ネットワークは、ネットワークの残りの部分（分類層６７５、及びＨＴＭＬエンコーダ６５４の微調整ステップ）とともに、損失関数と、入力情報についてのグラウンドトゥルース６３２を有するＵＲＬ例の助けと、を用いて訓練される。 Continuing with the description of the system 600, the phishing classifier layer 675 is trained on the URL embeddings and HTML encodings of example URLs, with each example URL accompanied by a ground truth classification 632 as phishing or not phishing. During training of the URL embedder 652, the encoding layer differences used to generate the URL embeddings 653 are backpropagated beyond the phishing classifier layer. That is, once the HTML encoder 664 is pre-trained, the URL embedding 653 network, along with the rest of the network (classification layer 675 and the fine-tuning step of the HTML encoder 654), is trained with the loss function and the help of URL examples with ground truth 632 for the input information.

訓練後、フィッシング分類器層６７５は、ＵＲＬ埋め込み６５３及びＨＴＭＬエンコーディング６６５の連結された入力を処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する。フィッシング検出エンジン６０２は、フィッシング分類器層６７５を、ＵＲＬ埋め込み及びＨＴＭＬエンコーディングの連結された入力に適用して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコア６８５を生成する。 After training, the phishing classifier layer 675 processes the concatenated input of the URL embeddings 653 and the HTML encodings 665 to generate at least one likelihood score that the URL and the content accessed via the URL present a phishing risk. The phishing detection engine 602 applies the phishing classifier layer 675 to the concatenated input of the URL embeddings and the HTML encodings to generate at least one likelihood score 685 that the URL and the content accessed via the URL present a phishing risk.

ＨＴＭＬパーサ６３６は、ＵＲＬを介してアクセスされたコンテンツページからＨＴＭＬトークンを抽出する。一例では、ＨＴＭＬパーサ６３６は、所定のトークン語彙に属するＨＴＭＬトークンをコンテンツから抽出し、改行及び行送りなどの、所定のトークン語彙に属さないコンテンツの一部を無視するように構成可能である。一実施形態では、ＨＴＭＬエンコーダ６６４を訓練するために指定するＨＴＭＬトークンの数を決定するための訓練において、データストア１６４における７００Ｋ個のＨＴＭＬファイルのスキャンと、コンテンツページを表す上位１０Ｋ個のトークンの結果として得られる抽出と、が、かなりの数の偽陽性結果に悩まされない分類のためのフィッシング検出エンジン６０２を構成するために使用された。１つの例示的なコンテンツページでは、８００個の有効なトークンが抽出され、別の例では、２Ｋ個の有効なトークンが認識され、第３の例では、およそ１Ｋ個のトークンが収集された。訓練を使用して、ＨＴＭＬトークンのどの順序パターンが特定のコンテンツページを生じるかを学習する。 The HTML parser 636 extracts HTML tokens from the content page accessed via the URL. In one example, the HTML parser 636 can be configured to extract HTML tokens from the content that belong to a predetermined token vocabulary and ignore parts of the content that do not belong to the predetermined token vocabulary, such as line breaks and line feeds. In one embodiment, in training to determine the number of HTML tokens to specify to train the HTML encoder 664, a scan of 700K HTML files in the data store 164 and the resulting extraction of the top 10K tokens representing the content page was used to configure the phishing detection engine 602 for classification that does not suffer from a significant number of false positive results. In one exemplary content page, 800 valid tokens were extracted, in another example, 2K valid tokens were recognized, and in a third example, approximately 1K tokens were collected. Training is used to learn which ordering patterns of HTML tokens result in a particular content page.

インライン実装を利用するリアルタイムフィッシング検出システムの場合、速度を考慮して、語彙のサイズを最小化するように注意が払われる。フィッシング検出エンジン６０２について、一実施形態では、ＨＴＭＬパーサ６３６は、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出するように構成されている。異なる実装態様では、異なる数のＨＴＭＬエンコーディングを利用することができる。別の実施形態では、ヘッドレスブラウザを、最大１２８、２５６、１０２４、又は４０９６個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出するように構成することができる。 For real-time phishing detection systems utilizing an inline implementation, care is taken to minimize the size of the vocabulary for speed considerations. For the phishing detection engine 602, in one embodiment, the HTML parser 636 is configured to extract for generation of an HTML encoding of up to 64 HTML tokens. Different implementations may utilize different numbers of HTML encodings. In another embodiment, the headless browser may be configured to extract for generation of an HTML encoding of up to 128, 256, 1024, or 4096 HTML tokens.

フィッシングパターンは、絶えず進化し、多くの場合、検出方法が低い偽陽性率（ＦＰＲ）を維持しながら高い真陽性率（ＴＰＲ）を達成することは困難である。精度再現率曲線は、可能なカットオフの精度（＝陽性予測値）と再現率（＝感度）との間の関係を示す。 Phishing patterns are constantly evolving and it is often difficult for detection methods to achieve a high true positive rate (TPR) while maintaining a low false positive rate (FPR). A precision-recall curve shows the relationship between precision (= positive predictive value) and recall (= sensitivity) for possible cutoffs.

図７は、複数の開示されるフィッシング検出システムの精度再現率グラフを示す。偽陽性検出がほとんどないことが要求されることに起因して、精度が１．０００に近い、グラフの最上部付近の結果の精度が興味深い。ＨＴＭＬ＋ＵＲＬ＋ヘッダコンテンツを有する曲線は、点線７４６によって表されている。長いダッシュ７３６を有する曲線として表されるスナップショット（ＲｅｓＮｅｔ）＋ＢＥＲＴは、ＨＴＭＬ＋ＵＲＬ＋ヘッダコンテンツよりも精度が良く、実線曲線７２６として表されるスナップショットは、最も精度が良いフィッシング技法である。ＢＥＲＴは、計算費用が高く、グラフは、精度が良いフィッシング検出結果を取得するためにＢＥＲＴが必要とされないことを示す。 Figure 7 shows a precision recall graph for several of the disclosed phishing detection systems. Of interest is the precision of the results near the top of the graph, where precision is close to 1.000, due to the requirement for few false positive detections. The curve with HTML+URL+header content is represented by the dotted line 746. Snapshot(ResNet)+BERT, represented as the curve with long dashes 736, is more precise than HTML+URL+header content, and Snapshot, represented as the solid curve 726, is the most precise phishing technique. BERT is computationally expensive, and the graph shows that BERT is not required to obtain precise phishing detection results.

図８は、上で説明されるフィッシングウェブサイト検出のための受信者動作特性曲線（ＲＯＣ）を例示している。ＲＯＣ曲線は、様々な閾値設定における偽陽性率（ＦＰＲ）の関数としての真陽性率（ＴＰＲ）のプロットである。関心の領域は、フィッシングウェブサイトとしてのコンテンツページの偽陽性識別が維持可能でないため、非常に低いＦＰＲを有する曲線の下の領域である。ＲＯＣ曲線は、上で説明されるフィッシングウェブサイトを検出するためにシステムを比較するのに有用である。フィッシング検出エンジン２０２について、＋スナップショット＋Ｂｅｒｔ８３６と標記され、かつ長いダッシュを有して例示されるＲＯＣ曲線は、ＵＲＬ特徴量ハッシュを有するＭＬ／ＤＬと、ＮＬ単語のエンコーディングと、キャプチャされたウェブサイト画像の埋め込みと、を利用するシステムの結果を示す。第２のシステムでは、フィッシング検出エンジン４０２は、ＵＲＬ特徴量ハッシュを有するＭＬ／ＤＬと、コンテンツページから抽出されたＨＴＭＬトークンのエンコーディングと、フィッシングサイトを検出するためにコンテンツページからキャプチャされた画像の埋め込みと、を利用する。＋スナップショット８２６と標記された曲線は、その曲線がドットで例示されるＨＴＭＬ－ＵＲＬ－ヘッダ８４６よりも少ない偽陽性で、より高いシステムの精度を示す。 FIG. 8 illustrates a receiver operating characteristic curve (ROC) for phishing website detection as described above. The ROC curve is a plot of true positive rate (TPR) as a function of false positive rate (FPR) at various threshold settings. The area of interest is the area under the curve with a very low FPR, since false positive identification of a content page as a phishing website is not sustainable. The ROC curve is useful for comparing systems for detecting phishing websites as described above. For the phishing detection engine 202, the ROC curve illustrated with +Snapshot+Bert836 and long dashes shows the results of a system that utilizes ML/DL with URL feature hash, encoding of NL words, and embedding of captured website images. In a second system, the phishing detection engine 402 utilizes ML/DL with URL feature hash, encoding of HTML tokens extracted from the content page, and embedding of images captured from the content page to detect phishing sites. The curve labeled +Snapshot 826 shows a higher system accuracy with fewer false positives than the HTML-URL-Header 846, whose curve is illustrated with dots.

多言語Ｂｅｒｔ埋め込みを含む特徴量の異なる組み合わせを有する、図８のＲＯＣ曲線の説明を続けると、比較は、ＢＥＲＴなどのテキスト埋め込みが、場合によっては、ＨＴＭＬエンコーダがすでにテキストコンテンツを考慮に入れていることに起因して、モデルの有効性を損ない、その結果、より多くのテキストエンコーディングを含めることが、ＨＴＭＬページのテキストへの過剰適合につながり得ることを例示している。＋スナップショット８２６は、より高い精度を有し、本番におけるＦＰの最小数につながる。更に、画像埋め込みを伴わないモデルのバージョンは、スナップショットが利用可能でないランタイム環境などの環境でアクティブスキャナ／ヘッドレスブラウザ機能をバイパスすることができるシステムをもたらす。一例では、ヘッドレスブラウザを実行することは、本番環境で直面する膨大な数のＵＲＬにはあまりにも高価であり、スケーラブルではない場合がある。更に、攻撃者は、そのような環境での検出を回避することができる。 Continuing with the description of the ROC curves in FIG. 8 with different combinations of features including multilingual Bert embeddings, the comparison illustrates that text embeddings such as BERT can in some cases impair the effectiveness of the model due to the fact that the HTML encoder already takes into account the text content, and as a result, including more text encodings can lead to overfitting to the text of the HTML page. +Snapshot 826 has higher accuracy and leads to the lowest number of FPs in production. Furthermore, the version of the model without image embeddings results in a system that can bypass the active scanner/headless browser functionality in environments such as runtime environments where snapshots are not available. In one example, running a headless browser may be too expensive and unscalable for the vast number of URLs faced in production environments. Furthermore, an attacker can avoid detection in such environments.

図９は、ＵＲＬ埋め込み器及びＨＴＭＬエンコーダを用いるＭＬ／ＤＬを利用するフィッシング検出エンジン６０２のフィッシングウェブサイト検出のための受信者動作特性曲線（ＲＯＣ）を例示している。ＲＯＣ曲線９３６は、様々な閾値設定における偽陽性率（ＦＰＲ）の関数としての真陽性率（ＴＰＲ）のプロットである。ＲＯＣ曲線９３６は、フィッシング検出エンジン６０２が、ＲＯＣ曲線が図８に示されるフィッシング検出システムよりも高い真陽性率を有することを例示している。 Figure 9 illustrates a receiver operating characteristic curve (ROC) for phishing website detection of the phishing detection engine 602 using ML/DL with URL embedder and HTML encoder. The ROC curve 936 is a plot of true positive rate (TPR) as a function of false positive rate (FPR) at various threshold settings. The ROC curve 936 illustrates that the phishing detection engine 602 has a higher true positive rate than the phishing detection system whose ROC curve is shown in Figure 8.

フィッシング検出エンジン６０２のＵＲＬ埋め込み器６５２及びｈｔｍｌエンコーダ６６４は、不規則なメモリアクセスパターン又はデータ依存フロー制御を有する高水準プログラムを備える。高水準プログラムは、Ｃ、Ｃ＋＋、Ｊａｖａ、Ｐｙｔｈｏｎ、及びＳｐａｔｉａｌのようなプログラミング言語で書かれたソースコードである。高水準プログラムは、ＡｌｅｘＮｅｔ、ＶＧＧＮｅｔ、ＧｏｏｇｌｅＮｅｔ、ＲｅｓＮｅｔ、ＲｅｓＮｅＸｔ、ＲＣＮＮ、ＹＯＬＯ、ＳｑｕｅｅｚｅＮｅｔ、ＳｅｇＮｅｔ、ＧＡＮ、ＢＥＲＴ、ＥＬＭｏ、ＵＳＥ、Ｔｒａｎｓｆｏｒｍｅｒ、及びＴｒａｎｓｆｏｒｍｅｒ－ＸＬのような機械学習モデルの計算構造及びアルゴリズムを実装することができる。一例では、高水準プログラムは、各処理層が１つ以上のネストされたループを含むことができるように、いくつかの処理層を有する畳み込みニューラルネットワークを実装することができる。高水準プログラムは、入力及び重みにアクセスすることと、入力と重みとの間の行列乗算を実行することと、を伴う不定期のメモリ演算を実行することができる。高水準プログラムは、後続の処理層の出力を生成するために、後続の処理層の重みを有する先行する処理層からの入力値をロード及び乗算する、高い反復回数を有するネストされたループと、ループボディと、を含むことができる。高水準プログラムは、最も外側のループボディのループレベルの並列性を有することができ、粗粒度のパイプライン化を使用してこれを利用することができる。高水準プログラムは、最も内側のループボディの命令レベルの並列性を有することができ、ループアンローリング、単一命令、複数データ（ＳＩＭＤ）ベクトル化、及びパイプライン化を使用してこれを利用することができる。 The URL embedder 652 and the html encoder 664 of the phishing detection engine 602 comprise high-level programs with irregular memory access patterns or data-dependent flow control. The high-level programs are source codes written in programming languages such as C, C++, Java, Python, and Spatial. The high-level programs can implement the computational structures and algorithms of machine learning models such as AlexNet, VGG Net, GoogleNet, ResNet, ResNeXt, RCNN, YOLO, SqueezeNet, SegNet, GAN, BERT, ELMo, USE, Transformer, and Transformer-XL. In one example, the high-level programs can implement a convolutional neural network with several processing layers, such that each processing layer can include one or more nested loops. The high-level program may perform occasional memory operations that involve accessing inputs and weights and performing matrix multiplication between the inputs and weights. The high-level program may include nested loops with high iteration counts and loop bodies that load and multiply input values from a previous processing layer with the weights of the subsequent processing layer to generate the output of the subsequent processing layer. The high-level program may have loop-level parallelism in the outermost loop body and may exploit this using coarse-grained pipelining. The high-level program may have instruction-level parallelism in the innermost loop body and may exploit this using loop unrolling, single instruction, multiple data (SIMD) vectorization, and pipelining.

図１０は、オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表されるＣ＋＋コードを用いて、ＵＲＬ埋め込み６５３を生成する一次元１Ｄ畳み込みニューラルネットワーク（Ｃｏｎｖ１Ｄ）ＵＲＬ埋め込み器６５２の機能性の計算データフローグラフを例示している。ＵＲＬは、ＵＲＬ１０１４の最初の１００文字を入力し、例示的な一実施形態では、ワンホットエンコーディングを使用し、重み２５６×５６×７を有する畳み込みブロック１０２４（スライドウィンドウのように、カーネルサイズ＝７）をもたらし、多次元特徴量の二値出力を生成する。次元１×３２×８（１×２５６）を有する出力１０６４は、フィッシング分類器層６７５への入力として生成された最終的なＵＲＬ埋め込み６５３を表す。 Figure 10 illustrates a computational data flow graph of the functionality of a one-dimensional 1D convolutional neural network (Conv1D) URL embedder 652 that generates a URL embedding 653 using C++ code expressed in Open Neural Network Exchange (ONNX) format. The URL input is the first 100 characters of the URL 1014, which in one exemplary embodiment uses one-hot encoding and results in a convolution block 1024 (like a sliding window, kernel size = 7) with weights 256 x 56 x 7, generating a binary output of multi-dimensional features. The output 1064 with dimensions 1 x 32 x 8 (1 x 256) represents the final URL embedding 653 generated as input to the phishing classifier layer 675.

図１１は、フィッシング分類器層６７５への入力としてのｈｔｍｌエンコーディング６６５を生成する、開示されるｈｔｍｌエンコーダ６６４のブロックの図を示す。ＨＴＭＬエンコーダアーキテクチャは、訓練データに見られる画像を再構築する畳み込みデコーダの助けを借りて事前訓練される。デコーダは、典型的には、畳み込みニューラルネットワーク（ＣＮＮである。訓練は、ＨＴＭＬコンテンツを、それらのレンダリングされた画像に関して表現するように学習することをＨＴＭＬエンコーダに強制し、したがって、ＨＴＭＬの無関係な部分をスキップする。この訓練は、フィッシング攻撃が開始される方法に合わせられており、レンダリングされたページが正当なページの外観を模倣し続ける限り、ユーザは、フィッシング攻撃の犠牲になる。次に、ブロックの機能性の概要を説明する。入力埋め込み１１１２は、先に説明されるように、ＨＴＭＬパーサ６３６によって抽出された６４個のＨＴＭＬトークンを取り込み、ＨＴＭＬトークンを語彙にマッピングする。位置エンコーディング１１２２は、ＨＴＭＬトークンのベクトルのコンテキスト情報を追加する。マルチヘッドアテンション１１３２は、入力のどの要素に焦点を当てるかを識別するための複数の自己アテンションベクトルを生成する。抽象ベクトルＱ、Ｋ、及びＶは、入力の異なる成分を抽出し、アテンションベクトルを計算するために使用される。複数のアテンションベクトルは、ＨＴＭＬベクトル間の関係を表す。マルチヘッドアテンション１１３２は、二次的なマルチヘッドアテンションを表す、以下の図１２Ａ～図１２Ｄに関して説明される例示される計算データフローグラフにおいて４回繰り返される。異なる実施形態では、ヘッドの数を２倍にするか、又は更に大きくすることができる。マルチヘッドアテンション１１３２は、アテンションベクトルを、一度に１つのベクトルで、フィードフォワードネットワーク１１６２に渡す。フィードフォワードネットワーク１１６２は、次のブロックのベクトルを変換する。各ブロックは、各特徴量にわたって層を平滑化及び正規化するための、加算及び正規化１１７２によって示される加算及び正規化演算で終了し、訓練データに見られる画像を再生成する２５６個の数値にＨＴＭＬ表現を圧縮する。出力は、フィッシング分類器層６７５への２５６個の数値の入力として生成された最終的なＨＴＭＬエンコーディング６６５を表す。 Figure 11 shows a block diagram of the disclosed html encoder 664, which generates html encodings 665 as input to the phishing classifier layer 675. The HTML encoder architecture is pre-trained with the help of a convolutional decoder that reconstructs the images seen in the training data. The decoder is typically a convolutional neural network (CNN). Training forces the HTML encoder to learn to represent HTML content in terms of their rendered images, thus skipping irrelevant parts of the HTML. This training is tailored to the way phishing attacks are launched: as long as the rendered pages continue to mimic the appearance of legitimate pages, users will fall victim to them. We now provide an overview of the functionality of the blocks. Input Embedding 1112 takes the 64 HTML tokens extracted by HTML Parser 636 as described above and maps the HTML tokens to a vocabulary. Positional Encoding 1122 adds contextual information for the vector of HTML tokens. Multi-head Attention 1132 generates multiple self-attention vectors to identify which elements of the input to focus on. Abstract Vectors Q, K, and V extract different components of the input and map them to attention. The attention vectors are used to compute the attention vector. The multiple attention vectors represent the relationships between the HTML vectors. Multi-head attention 1132 is repeated four times in the illustrated computational data flow graph described with respect to Figures 12A-12D below, which represents second-order multi-head attention. In different embodiments, the number of heads can be doubled or even larger. Multi-head attention 1132 passes the attention vectors, one vector at a time, to the feed-forward network 1162, which transforms the vectors of the next block. Each block ends with an addition and normalization operation, indicated by addition and normalization 1172, to smooth and normalize the layer across each feature, compressing the HTML representation to 256 numbers that recreate the images seen in the training data. The output represents the final HTML encoding 665 produced as a 256-number input to the phishing classifier layer 675.

図１２Ａは、フィッシング分類器層６７５に入力されるｈｔｍｌエンコーディング６６５をもたらす、開示されるｈｔｍｌエンコーダ６６４の概要ブロック図を示す。入力エンコーディング及び位置埋め込み１２０５が、上の図１１に関連して説明され、詳細なＯＮＮＸリストが、図１２Ｂに例示されている。マルチヘッドアテンション１２２５が、上の図１１に関連して説明され、図１２Ｃは、図１２Ｂからのブロックを実装するための演算子スキーマへの入力として示される入力エンコーディング及び位置埋め込みからの入力を伴う詳細なＯＮＮＸ画像を示す。加算及び正規化及びフィードフォワード１２４５も、図１２Ｄの詳細なＯＮＮＸリストを用いて、図１１に関して説明される。図１２Ａはまた、入力テンソルの次元を削減する削減平均値（ＲｅｄｕｃｅＭｅａｎ）１２６５演算子を含み、提供された軸に沿った入力テンソルの要素の平均値を計算する。ＨＴＭＬエンコーディング６６５出力は、生成されてフィッシング分類器層６７５への入力としてマッピングされる１×２５６ベクトル１２８５である。ＯＮＮＸ演算子についての入力、出力、及び実行される演算の詳細は、当業者に周知である。 12A shows a schematic block diagram of the disclosed html encoder 664, which results in an html encoding 665 that is input to the phishing classifier layer 675. Input encoding and position embedding 1205 is described in connection with FIG. 11 above, and a detailed ONNX listing is illustrated in FIG. 12B. Multi-head attention 1225 is described in connection with FIG. 11 above, and FIG. 12C shows a detailed ONNX image with inputs from the input encoding and position embedding shown as inputs to the operator schema for implementing the blocks from FIG. 12B. Addition and normalization and feedforward 1245 is also described in connection with FIG. 11, with a detailed ONNX listing in FIG. 12D. FIG. 12A also includes a Reduce Mean 1265 operator that reduces the dimensionality of the input tensor, and calculates the mean of the elements of the input tensor along a provided axis. The HTML encoding 665 output is a 1x256 vector 1285 that is generated and mapped as an input to the phishing classifier layer 675. The details of the inputs, outputs, and operations performed for the ONNX operator are well known to those skilled in the art.

図１２Ｂ、図１２Ｃ、及び図１２Ｄは、オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表されるＣ＋＋コードを用いて、フィッシング分類器層６７５に入力されるｈｔｍｌエンコーディング６６５をもたらす、ｈｔｍｌエンコーダ６６４の機能性の計算データフローグラフを合わせて例示している。別の実施形態では、ＯＮＮＸコードは、異なるプログラミング言語を表現することができる。 12B, 12C, and 12D together illustrate a computational data flow graph of the functionality of html encoder 664, using C++ code expressed in Open Neural Network Exchange (ONNX) format, resulting in html encoding 665 that is input to phishing classifier layer 675. In alternative embodiments, the ONNX code can express a different programming language.

図１２Ｂは、点線で区切られた２つの列におけるデータフローグラフの１つのセクションを示し、左列の最下部にある結合子は、右列の最上部に流れ込む。図１２Ｂの右列の最下部にある結果は、図１２Ｃ及び図１２Ｄに流れ込む。図１２Ｂは、ギャザー演算子１２６４によって示されるように、入力エンコーディング及び位置埋め込みを例示している。入力埋め込み１１１２について、ギャザーブロックが、次元６４×２５６のデータを集取する１２６４。 Figure 12B shows a section of the dataflow graph in two columns separated by a dotted line, with connectors at the bottom of the left column flowing into the top of the right column. The results at the bottom of the right column of Figure 12B flow into Figures 12C and 12D. Figure 12B illustrates input encoding and positional embedding, as indicated by the gather operator 1264. For the input embedding 1112, the gather block gathers data of dimensions 64x256 1264.

図１２Ｃは、データ接続に沿ってデータを非同期的に伝送する計算ノードを有するデータフローグラフの一例を示す、マルチヘッドアテンション１２２５の単一の反復を例示している。データフローグラフは、Ｔｒａｎｓｆｏｒｍｅｒモデルのいわゆるマルチヘッドアテンションモジュールを表す。一実施形態では、データフローグラフは、複数の処理パイプラインにわたって入力テンソルを処理するために別個の処理パイプラインとして並行に実行される複数のループを示し、第２のレベルのループが第１のレベルのループ内にあるように、ループがレベルの階層に配置されたループネストを伴い、ギャザー及び非圧縮及び連結演算を伴う。ギャザー演算（全体で３つ）は、マルチヘッドアテンション層におけるクエリ、キー、及びバリューベクトルの使用を指す。この開示されるモデルの例では、２つのヘッドが利用され、これらのベクトルの各々についての連結演算がもたらされた。例示される実施形態では、処理パイプラインの各々のそれぞれの出力が連結されて、連結された出力Ａ２、Ｂ２、Ｃ２、Ｄ２を生成する。図１２Ｃの下部、及び図１２Ｄの上部にあるＡ２、Ｂ２、Ｃ２、Ｄ２によって例示されるように、マルチヘッドアテンション機能性からの出力は、加算及び正規化及びフィードフォワードに流れ込む。 12C illustrates a single iteration of multi-head attention 1225, showing an example of a dataflow graph with computational nodes asynchronously transmitting data along data connections. The dataflow graph represents the so-called multi-head attention module of the Transformer model. In one embodiment, the dataflow graph shows multiple loops executed in parallel as separate processing pipelines to process input tensors across multiple processing pipelines, with loop nesting arranged in a hierarchy of levels such that the second level loop is within the first level loop, with gather and uncompress and concatenate operations. The gather operations (three in all) refer to the use of query, key, and value vectors in the multi-head attention layer. In this disclosed example model, two heads were utilized, resulting in a concatenation operation on each of these vectors. In the illustrated embodiment, the respective outputs of each of the processing pipelines are concatenated to produce concatenated outputs A2, B2, C2, D2. As illustrated by A2, B2, C2, and D2 in the bottom of FIG. 12C and the top of FIG. 12D, the outputs from the multi-head attention functionality flow into summation and normalization and feedforward.

図１２Ｄは、ＯＮＮＸ演算を使用する加算及び正規化及びフィードフォワード１２４５機能性を示す。図１２Ｄは、点線で区切られた３つの列を使用して例示されており、左列の最下部にある結合子は、中央列の最上部に流れ込み、中央列の最下部にある結合子は、右列の演算に流れ込む。マルチヘッドアテンション（図１２Ｃに示される）の出力は、加算及び正規化及びフィードフォワード演算に流れ込み、ソフトマックス演算１２３２は、入力ベクトルを変換し、行列乗算器ＭａｔＭｕｌ１２４２に流れ込ませる確率分布に正規化する。加算及び正規化及びフィードフォワード１２４５の出力Ａｘ、Ｂｘ、Ｃｘ、Ｄｘ（図１２Ｄの右下隅に示される）は、削減平均値１２６５演算子（図１２Ａ）に流れ込む。削減平均値１２６５演算子は、入力テンソル（Ａｘ、Ｂｘ、Ｃｘ、Ｄｘ）の次元を削減し、提供される軸に沿った入力テンソルの要素の平均値を計算する。ＨＴＭＬエンコーディング６６５の出力は、１ｘ２５６のベクトル１２８５である。 12D illustrates the summation and normalization and feedforward 1245 functionality using the ONNX operation. FIG. 12D is illustrated using three columns separated by dotted lines, with the connector at the bottom of the left column flowing into the top of the center column, and the connector at the bottom of the center column flowing into the operation in the right column. The output of the multi-head attention (shown in FIG. 12C) flows into the summation and normalization and feedforward operation, and the softmax operation 1232 transforms and normalizes the input vector into a probability distribution that flows into the matrix multiplier MatMul 1242. The output of the summation and normalization and feedforward 1245, Ax, Bx, Cx, Dx (shown in the lower right corner of FIG. 12D), flows into the reduced mean 1265 operator (FIG. 12A). The reduced mean 1265 operator reduces the dimensionality of the input tensor (Ax, Bx, Cx, Dx) and calculates the average of the elements of the input tensor along the axes provided. The output of HTML encoding 665 is a 1x256 vector 1285.

図１３は、特定のウェブサイトがフィッシングウェブサイトである可能性がどのくらいあるかを表す尤度スコア（複数可）６８５を生成するフィッシング分類器層６７５の機能性の計算データフローグラフを例示しており、Ｃ＋＋コードは、オープンニューラルネットワーク交換（ＯＮＮＸ）形式で表現されている。サイズ１×５１２のフィッシング分類器層６７５への入力１３１４は、ＵＲＬ埋め込み及びＨＴＭＬエンコーディングを連結することによって形成される。２つの連結されたベクトルは、先に説明されるように、１×２５６ベクトルＨＴＭＬエンコーディング及び次元１×３２×８（１×２５６）を有するＵＲＬ埋め込みである。バッチ正規化１３２４は、前の層の活性化に適用されたように、入力を標準化し、訓練を加速する。演算子ＧＥｎｅｒａｌ行列乗算（ＧＥＭＭ）１３４６、１３６６は、ＤＬにおける基本演算子である線形代数ルーチンを表す。サイズ１×２である最終的な２層フィードフォワード分類器の出力１３７４は、サイトをフィッシングか、又はフィッシングでないとして分類するための、ウェブページがフィッシングサイトである尤度、及びウェブページがフィッシングサイトではない尤度である。 Figure 13 illustrates a computational data flow graph of the functionality of the phishing classifier layer 675, which generates likelihood score(s) 685 that represent how likely a particular website is to be a phishing website, with C++ code expressed in Open Neural Network Exchange (ONNX) format. The input 1314 to the phishing classifier layer 675 of size 1x512 is formed by concatenating a URL embedding and an HTML encoding. The two concatenated vectors are the 1x256 vector HTML encoding and the URL embedding with dimensions 1x32x8 (1x256), as explained earlier. Batch normalization 1324, as applied to the activations of the previous layer, standardizes the inputs and accelerates training. The operators GEneral Matrix Multiplication (GEMM) 1346, 1366 represent linear algebra routines that are basic operators in DL. The output 1374 of the final two-layer feedforward classifier of size 1×2 is the likelihood that a web page is a phishing site and the likelihood that a web page is not a phishing site for classifying the site as phishing or not.

コンピュータシステム
図１４は、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するために使用され得るコンピュータシステム１０００の簡略化されたブロック図である。コンピュータシステム１４００は、バスサブシステム１４５５を介して一部の周辺デバイスと通信する少なくとも１つの中央処理装置（ＣＰＵ）１４７２と、本明細書で説明されるネットワークセキュリティサービスを提供するためのネットワークセキュリティシステム１１２と、を含む。これらの周辺デバイスは、例えば、メモリデバイス及びファイルストレージサブシステム１４３６を含むストレージサブシステム１４１０と、ユーザインターフェース入力デバイス１４３８と、ユーザインターフェース出力デバイス１４７６と、ネットワークインターフェースサブシステム１４７４とを含むことができる。入力及び出力デバイスは、コンピュータシステム１４００とのユーザ対話を可能にする。ネットワークインターフェースサブシステム１４７４は、他のコンピュータシステム内の対応するインターフェースデバイスへのインターフェースを含む、外部ネットワークへのインターフェースを提供する。 Computer System Figure 14 is a simplified block diagram of a computer system 1000 that may be used to classify URLs and content pages accessed via the URLs as phishing or non-phishing. The computer system 1400 includes at least one central processing unit (CPU) 1472 that communicates with some peripheral devices via a bus subsystem 1455, and a network security system 112 for providing the network security services described herein. These peripheral devices may include, for example, a storage subsystem 1410 including memory devices and a file storage subsystem 1436, a user interface input device 1438, a user interface output device 1476, and a network interface subsystem 1474. The input and output devices enable user interaction with the computer system 1400. The network interface subsystem 1474 provides an interface to external networks, including interfaces to corresponding interface devices in other computer systems.

一実装態様では、図１のクラウドベースのセキュリティシステム１５３は、ストレージサブシステム１４１０及びユーザインターフェース入力デバイス１４３８に通信可能にリンクされる。 In one implementation, the cloud-based security system 153 of FIG. 1 is communicatively linked to the storage subsystem 1410 and the user interface input device 1438.

ユーザインターフェース入力デバイス１４３８は、キーボード、マウス、トラックボール、タッチパッド、又はグラフィックタブレットなどのポインティングデバイス、スキャナ、ディスプレイに組み込まれたタッチスクリーン、音声認識システム及びマイクロフォンなどのオーディオ入力デバイス、並びに他のタイプの入力デバイスを含むことができる。概して、「入力デバイス」という用語の使用は、コンピュータシステム１４００に情報を入力するための全ての可能なタイプのデバイス及び方法を含むことが意図される。 User interface input devices 1438 may include pointing devices such as keyboards, mice, trackballs, touchpads, or graphic tablets, scanners, touch screens integrated into displays, audio input devices such as voice recognition systems and microphones, and other types of input devices. In general, use of the term "input device" is intended to include all possible types of devices and methods for inputting information into computer system 1400.

ユーザインターフェース出力デバイス１４７６は、ディスプレイサブシステム、プリンタ、ファックス機、又は音声出力デバイスなどの非視覚的ディスプレイを含むことができる。ディスプレイサブシステムは、ＬＥＤディスプレイ、陰極線管（ＣＲＴ）、液晶ディスプレイ（ＬＣＤ）などのフラットパネルデバイス、投影デバイス、又は可視画像を作成するための何らかの他のメカニズムを含むことができる。ディスプレイサブシステムは、音声出力デバイスなどの非視覚的ディスプレイを提供することもできる。概して、「出力デバイス」という用語の使用は、コンピュータシステム１４００からユーザ又は別のマシン若しくはコンピュータシステムに情報を出力するための全ての可能なタイプのデバイス及び方法を含むことが意図される。 The user interface output devices 1476 may include a display subsystem, a printer, a fax machine, or a non-visual display such as an audio output device. The display subsystem may include a flat panel device such as an LED display, a cathode ray tube (CRT), a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide a non-visual display such as an audio output device. In general, use of the term "output device" is intended to include all possible types of devices and methods for outputting information from the computer system 1400 to a user or to another machine or computer system.

ストレージサブシステム１４１０は、本明細書に記載のモジュール及び方法の一部又は全部の機能を提供するプログラミング及びデータ構造を記憶する。サブシステム１４７８は、グラフィックス処理ユニット（ＧＰＵ）又はフィールドプログラマブルゲートアレイ（ＦＰＧＡ）であり得る。 Storage subsystem 1410 stores programming and data structures that provide some or all of the functionality of the modules and methods described herein. Subsystem 1478 may be a graphics processing unit (GPU) or a field programmable gate array (FPGA).

ストレージサブシステム１４１０内で使用されるメモリサブシステム１４２２は、プログラム実行中に命令及びデータを記憶するためのメインランダムアクセスメモリ（ＲＡＭ）１４３２と、固定命令が記憶される読み取り専用メモリ（ＲＯＭ）１４３４とを含む、一部のメモリを含むことができる。ファイルストレージサブシステム１４３６は、プログラム及びデータファイルのための永続的なストレージを提供することができ、ハードディスクドライブ、関連するリムーバブルメディアを伴うフロッピーディスクドライブ、ＣＤ－ＲＯＭドライブ、光学ドライブ、又はリムーバブルメディアカートリッジを含むことができる。特定の実装態様の機能を実装するモジュールは、ファイルストレージサブシステム１４３６によって、ストレージサブシステム１４１０内に、又はプロセッサによってアクセス可能な他のマシン内に記憶され得る。 The memory subsystem 1422 used within the storage subsystem 1410 may include some memory, including a main random access memory (RAM) 1432 for storing instructions and data during program execution, and a read-only memory (ROM) 1434 in which fixed instructions are stored. The file storage subsystem 1436 may provide persistent storage for program and data files and may include a hard disk drive, a floppy disk drive with associated removable media, a CD-ROM drive, an optical drive, or a removable media cartridge. Modules implementing the functionality of a particular implementation may be stored by the file storage subsystem 1436, within the storage subsystem 1410, or within another machine accessible by the processor.

バスサブシステム１４５５は、コンピュータシステム１４００の種々のコンポーネント及びサブシステムに、意図されるように互いに通信させるための機構を提供する。バスサブシステム１４５５は、単一のバスとして概略的に示されているが、バスサブシステムの代替の実装態様は、複数のバスを使用することができる。 Bus subsystem 1455 provides a mechanism for allowing the various components and subsystems of computer system 1400 to communicate with each other as intended. Although bus subsystem 1455 is shown generally as a single bus, alternative implementations of the bus subsystem may use multiple buses.

コンピュータシステム１４００自体は、パーソナルコンピュータ、携帯型コンピュータ、ワークステーション、コンピュータターミナル、ネットワークコンピュータ、テレビ、メインフレーム、サーバファーム、疎にネットワーク化されたコンピュータの広く分散されたセット、又は任意の他のデータ処理システム若しくはユーザデバイスを含む、種々のタイプのものであることができる。コンピュータ及びネットワークの絶えず変化する性質に起因して、図１４に示されるコンピュータシステム１４００の説明は、本発明の好ましい実施形態を例示するための特定の例としてのみ意図される。図１４に示すコンピュータシステムよりも多い又は少ないコンポーネントを有するコンピュータシステム１４００の多くの他の構成が可能である。 The computer system 1400 itself can be of various types, including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of the computer system 1400 shown in FIG. 14 is intended only as a specific example to illustrate a preferred embodiment of the invention. Many other configurations of computer system 1400 are possible, having more or fewer components than the computer system shown in FIG. 14.

特定の実装態様
ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するためのいくつかの特定の実装態様及び特徴量が、以下の議論で説明される。 Specific Implementations Several specific implementations and features for classifying URLs, and content accessed via those URLs, as phishing or non-phishing are described in the following discussion.

開示される一実装態様では、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するフィッシング分類器は、ＵＲＬを特徴量にパースし、かつ特徴量をハッシュ化してＵＲＬ特徴量ハッシュを生成する、ＵＲＬ特徴量ハッシャと、レンダリングされたコンテンツページから単語を抽出し、かつレンダリングされたコンテンツページの少なくとも一部の画像をキャプチャするように構成された、ヘッドレスブラウザと、を含む。開示される実装態様はまた、抽出された単語の単語エンコーディングを生成する自然言語エンコーダと、キャプチャされた画像の画像埋め込みを生成する画像埋め込み器と、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込みの連結された入力を処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層と、を含む。 In one disclosed implementation, a phishing classifier that classifies a URL and content accessed via the URL as phishing or non-phishing includes a URL feature hasher that parses the URL into features and hashes the features to generate a URL feature hash, and a headless browser configured to extract words from a rendered content page and capture an image of at least a portion of the rendered content page. The disclosed implementation also includes a natural language encoder that generates word encodings of the extracted words, an image embedder that generates image embeddings of the captured image, and a phishing classifier layer that processes the concatenated input of the URL feature hash, the word encoding, and the image embedding to generate at least one likelihood score that the URL and content accessed via the URL presents a phishing risk.

開示される別の実装態様では、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するフィッシング分類器は、ＵＲＬを特徴量にパースし、かつ特徴量をハッシュ化してＵＲＬ特徴量ハッシュを生成する、ＵＲＬ特徴量ハッシャと、ＵＲＬのコンテンツにアクセスしてコンテンツページを内部的にレンダリングし、コンテンツページのレンダリングから単語を抽出し、かつコンテンツページのレンダリングの少なくとも一部の画像をキャプチャするように構成された、ヘッドレスブラウザと、を含む。開示される実装態様はまた、コンテンツページから抽出された単語の単語エンコーディングを生成する、自然言語に関して事前訓練された自然言語エンコーダと、画像に関して事前訓練されており、コンテンツページからキャプチャされた画像の画像埋め込みを生成する、画像埋め込み器と、を含む。実装態様は、各例示的なＵＲＬが、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬの、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込みに関して訓練されており、ＵＲＬの、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込みの連結された入力を処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層を更に含む。 In another disclosed implementation, a phishing classifier that classifies a URL and content accessed via the URL as phishing or non-phishing includes a URL feature hasher that parses the URL into features and hashes the features to generate a URL feature hash, and a headless browser configured to access the content of the URL to internally render a content page, extract words from the rendering of the content page, and capture an image of at least a portion of the rendering of the content page. The disclosed implementation also includes a natural language encoder pre-trained on natural language that generates word encodings of the words extracted from the content page, and an image embedder pre-trained on images that generates image embeddings of the images captured from the content page. The implementation further includes a phishing classifier layer, trained on the URL feature hashes, word encodings, and image embeddings of the example URL with each example URL having a ground truth classification as phishing or not phishing, that processes the concatenated input of the URL feature hashes, word encodings, and image embeddings of the URL to generate at least one likelihood score that the URL and the content accessed via the URL present a phishing risk.

フィッシング分類器のいくつかの開示される実装態様では、自然言語エンコーダは、Ｔｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（略してＢＥＲＴ）及びユニバーサルセンテンスエンコーダのうちの１つである。画像埋め込み器は、残差ニューラルネットワーク（略してＲｅｓＮｅｔ）、Ｉｎｃｅｐｔｉｏｎ－ｖ３、ＶＧＧ－１６のうちの１つである。 In some disclosed implementations of the phishing classifier, the natural language encoder is one of Bidirectional Encoder Representation (abbreviated BERT) and Universal Sentence Encoder from Transformer. The image embedder is one of Residual Neural Network (abbreviated ResNet), Inception-v3, and VGG-16.

一実装態様では、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類する開示されるコンピュータ実装方法は、ＵＲＬ特徴量ハッシャを適用し、ＵＲＬから特徴量を抽出し、特徴量をハッシュ化してＵＲＬ特徴量ハッシュを生成することを含む。開示される方法はまた、コンテンツのレンダリングからパースされた単語の単語エンコーディングを生成する、自然言語に関して事前訓練された自然言語エンコーダを適用することと、レンダリングの少なくとも一部からキャプチャされた画像の画像埋め込みを生成する、画像に関して事前訓練された画像エンコーダを適用することと、を含む。開示される方法は、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬの、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込み、の連結に関して訓練されたフィッシング分類器層を適用することと、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込みを処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を更に含む。 In one implementation, a disclosed computer-implemented method for classifying a URL and content accessed via the URL as phishing or non-phishing includes applying a URL feature hasher to extract features from the URL and hash the features to generate a URL feature hash. The disclosed method also includes applying a pre-trained natural language encoder for the natural language to generate word encodings of words parsed from a rendering of the content, and applying a pre-trained image encoder for the image to generate image embeddings of images captured from at least a portion of the rendering. The disclosed method further includes applying a phishing classifier layer trained on a concatenation of the URL feature hashes, word encodings, and image embeddings of an example URL with a ground truth classification as phishing or non-phishing, and processing the URL feature hashes, word encodings, and image embeddings to generate at least one likelihood score that the URL and content accessed via the URL presents a phishing risk.

開示される技術のこのセクション及び他のセクションで説明される方法は、以下の特徴及び／又は開示される追加の方法に関連して説明される特徴のうちの１つ以上を含むことができる。簡潔にするために、本出願に開示される特徴の組み合わせは、個々に列挙されず、特徴の各基本セットとともに繰り返されない。読者は、この方法で識別された特徴を、実装態様として識別された基本特徴のセットとどのように容易に組み合わせることができるかを理解するであろう。 The methods described in this and other sections of the disclosed technology may include one or more of the following features and/or features described in connection with the additional methods disclosed. For brevity, combinations of features disclosed in this application are not individually recited and are not repeated with each base set of features. The reader will understand how features identified in this method can be readily combined with the set of base features identified as implementation aspects.

開示される一コンピュータ実装方法は、ＵＲＬを介してコンテンツにアクセスし、かつコンテンツを内部的にレンダリングするためのヘッドレスブラウザを適用することと、レンダリングされたコンテンツから単語をパースすることと、レンダリングされたコンテンツの少なくとも一部の画像をキャプチャすることと、を更に含む。 A disclosed computer-implemented method further includes applying a headless browser to access the content via a URL and internally render the content, parsing words from the rendered content, and capturing an image of at least a portion of the rendered content.

開示されるコンピュータ実装方法の一実施形態は、Ｔｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（ＢＥＲＴ）及びユニバーサルセンテンスエンコーダのうちの１つとしての自然言語エンコーダを含む。開示されるコンピュータ実装方法のいくつかの実施形態はまた、残差ニューラルネットワーク（ＲｅｓＮｅｔ）、Ｉｎｃｅｐｔｉｏｎ－ｖ３、及びＶＧＧ－１６のうちの１つとしての画像埋め込み器を含む。 One embodiment of the disclosed computer-implemented method includes a natural language encoder as one of Bidirectional Encoder Representation (BERT) from Transformer and Universal Sentence Encoder. Some embodiments of the disclosed computer-implemented method also include an image embedder as one of Residual Neural Network (ResNet), Inception-v3, and VGG-16.

ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するようにフィッシング分類器層を訓練する開示される一コンピュータ実装方法は、例示的なＵＲＬについて、ＵＲＬ特徴量ハッシュと、コンテンツページから抽出された単語の単語エンコーディングと、コンテンツのレンダリングからキャプチャされた画像の画像埋め込みと、を受信及び処理して、各例示的なＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することを含む。この方法はまた、各例示的なＵＲＬについての尤度スコアと、例示的なＵＲＬ及びコンテンツページがフィッシングであるか、又はフィッシングでないという各対応するグラウンドトゥルースと、の間の差を計算することと、例示的なＵＲＬについての差を使用して、フィッシング分類器層の係数を訓練することと、を含む。この方法は、本番ＵＲＬと、本番ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する使用のために、訓練された係数を保存することを更に含む。 One disclosed computer-implemented method for training a phishing classifier layer to classify URLs and content accessed via the URLs as phishing or non-phishing includes receiving and processing URL feature hashes, word encodings of words extracted from the content pages, and image embeddings of images captured from renderings of the content for example URLs to generate at least one likelihood score that each example URL and content accessed via the URL presents a phishing risk. The method also includes calculating a difference between the likelihood score for each example URL and a corresponding ground truth that the example URL and content page are phishing or non-phishing, and using the difference for the example URLs to train coefficients of the phishing classifier layer. The method further includes storing the trained coefficients for use in classifying production URLs and content pages accessed via the production URLs as phishing or non-phishing.

開示されるコンピュータ実装方法は、差を、フィッシング分類器層を越えて、単語エンコーディングを生成するために使用されるエンコーディング層にバックプロパゲートしないことと、差を、フィッシング分類器層を越えて、画像埋め込みを生成するために使用される埋め込み層にバックプロパゲートしないことと、を更に含む。 The disclosed computer-implemented method further includes not backpropagating the differences beyond the phishing classifier layer to an encoding layer used to generate word encodings, and not backpropagating the differences beyond the phishing classifier layer to an embedding layer used to generate image embeddings.

開示されるコンピュータ実装方法はまた、例示的なＵＲＬの各々についてＵＲＬ特徴量ハッシュを生成することと、コンテンツページのレンダリングから抽出された単語の単語エンコーディングを生成することと、レンダリングからキャプチャされた画像の画像埋め込みを生成することと、を含む。 The disclosed computer-implemented method also includes generating URL feature hashes for each of the example URLs, generating word encodings of words extracted from the rendering of the content page, and generating image embeddings of images captured from the rendering.

多くの開示されるコンピュータ実装態様について、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するようにフィッシング分類器層を訓練する、開示される方法は、Ｔｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（ＢＥＲＴ）エンコーダ又はＢＥＲＴエンコーダの変形を使用して、単語エンコーディングを生成することと、残差ニューラルネットワーク（ＲｅｓＮｅｔ）、Ｉｎｃｅｐｔｉｏｎ－ｖ３、及びＶＧＧ－１６のうちの１つを使用して画像埋め込みを生成することと、を含む。 For many of the disclosed computer implementations, the disclosed method of training a phishing classifier layer to classify URLs and content accessed through the URLs as phishing or non-phishing includes generating word encodings using a Bidirectional Encoder Representation (BERT) encoder or a variant of the BERT encoder from Transformer, and generating image embeddings using one of a Residual Neural Network (ResNet), Inception-v3, and VGG-16.

開示される一実装態様では、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するフィッシング分類器は、ＵＲＬを特徴量にパースし、かつ特徴量をハッシュ化してＵＲＬ特徴量ハッシュを生成する、ＵＲＬ特徴量ハッシャと、レンダリングされたコンテンツページからＨＴＭＬトークンを抽出し、かつレンダリングされたコンテンツページの少なくとも一部の画像をキャプチャするように構成された、ヘッドレスブラウザと、を含む。開示される分類器はまた、抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダと、キャプチャされた画像の画像埋め込みを生成する画像埋め込み器と、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みを処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成するフィッシング分類器層と、を含む。いくつかの実装態様では、ＨＴＭＬトークンは、ＨＴＭＬトークンの認識された語彙に属する。 In one disclosed implementation, a phishing classifier that classifies a URL and content accessed via the URL as phishing or non-phishing includes a URL feature hasher that parses the URL into features and hashes the features to generate a URL feature hash, and a headless browser configured to extract HTML tokens from a rendered content page and capture an image of at least a portion of the rendered content page. The disclosed classifier also includes an HTML encoder that generates an HTML encoding of the extracted HTML tokens, an image embedder that generates an image embedding of the captured image, and a phishing classifier layer that processes the URL feature hash, the HTML encoding, and the image embedding to generate at least one likelihood score that the URL and content accessed via the URL present a phishing risk. In some implementations, the HTML tokens belong to a recognized vocabulary of HTML tokens.

一実装態様では、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類する、開示されるフィッシング分類器は、ＵＲＬを特徴量にパースし、かつ特徴量をハッシュ化してＵＲＬ特徴量ハッシュを生成する、ＵＲＬ特徴量ハッシャと、ＵＲＬのコンテンツにアクセスしてコンテンツページを内部的にレンダリングし、コンテンツページからＨＴＭＬトークンを抽出し、かつコンテンツページのレンダリングの少なくとも一部の画像をキャプチャするように構成された、ヘッドレスブラウザと、を含む。開示されるフィッシング分類器はまた、例示的なＵＲＬのコンテンツページから抽出され、エンコードされ、次いで、コンテンツページのレンダリングからキャプチャされた画像を再現するようにデコードされるＨＴＭＬトークンに関して訓練されており、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダを含む。また、画像に関して事前訓練されており、コンテンツページからキャプチャされた画像の画像埋め込みを生成する、画像埋め込み器と、例示的なＵＲＬの、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みに関して訓練されており、各例示的なＵＲＬが、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬの、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みを処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層と、が含まれる。 In one implementation, a disclosed phishing classifier that classifies a URL and content accessed via the URL as phishing or non-phishing includes a URL feature hasher that parses the URL into features and hashes the features to generate a URL feature hash, and a headless browser configured to access the content of the URL to internally render a content page, extract HTML tokens from the content page, and capture an image of at least a portion of the rendering of the content page. The disclosed phishing classifier also includes an HTML encoder that is trained on HTML tokens extracted from the content page of an exemplary URL, encoded, and then decoded to recreate the image captured from the rendering of the content page, and generates an HTML encoding of the HTML tokens extracted from the content page. Also included is an image embedder that is pre-trained on images and generates image embeddings of images captured from content pages, and a phishing classifier layer that is trained on the URL feature hashes, HTML encodings, and image embeddings of example URLs and processes the URL feature hashes, HTML encodings, and image embeddings of example URLs, each example URL with a ground truth classification as phishing or not phishing, to generate at least one likelihood score that the URL and the content page accessed via the URL presents a phishing risk.

開示される方法のいくつかの実装態様は、所定のトークン語彙に属するＨＴＭＬトークンをコンテンツから抽出し、かつ所定のトークン語彙に属さないコンテンツの部分を無視するように構成されたヘッドレスブラウザを更に含む。いくつかの開示される実装態様は、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出するように構成されたヘッドレスブラウザを更に含む。 Some implementations of the disclosed methods further include a headless browser configured to extract HTML tokens from the content that belong to a predefined token vocabulary and ignore portions of the content that do not belong to the predefined token vocabulary. Some disclosed implementations further include a headless browser configured to extract up to 64 HTML tokens for generation of an HTML encoding.

ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する、開示される方法の一実装態様は、ＵＲＬ特徴量ハッシャを適用し、ＵＲＬから特徴量を抽出し、特徴量をハッシュ化してＵＲＬ特徴量ハッシュを生成することを含む。本方法はまた、自然言語に関して訓練されており、レンダリングされたコンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダを適用することを含む。方法は、画像に関して事前訓練されており、レンダリングされたコンテンツページの少なくとも一部からキャプチャされた画像の画像埋め込みを生成する、画像埋め込み器と、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴って分類された例示的なＵＲＬに関して、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みに関して訓練されたフィッシング分類器層を適用することと、ＵＲＬの、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みを処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を更に含む。開示される一実装態様では、方法は、ヘッドレスブラウザを適用することと、ＵＲＬを介してコンテンツページにアクセスしてコンテンツページを内部的にレンダリングすることと、レンダリングされたコンテンツからＨＴＭＬトークンをパースすることと、レンダリングされたコンテンツの少なくとも一部の画像をキャプチャすることと、を更に含む。 One implementation of the disclosed method for classifying a URL and a content page accessed via the URL as phishing or non-phishing includes applying a URL feature hasher to extract features from the URL and hashing the features to generate a URL feature hash. The method also includes applying an HTML encoder trained on natural language to generate an HTML encoding of the HTML tokens extracted from the rendered content page. The method further includes applying an image embedder, pre-trained on images, to generate an image embedding of an image captured from at least a portion of a rendered content page, and a phishing classifier layer trained on URL feature hashes, HTML encodings, and image embeddings for example URLs classified with a ground truth classification as phishing or not phishing, and processing the URL feature hashes, HTML encodings, and image embeddings of the URLs to generate at least one likelihood score that the URL and content accessed via the URL present a phishing risk. In one disclosed implementation, the method further includes applying a headless browser, accessing the content page via the URL and rendering the content page internally, parsing HTML tokens from the rendered content, and capturing an image of at least a portion of the rendered content.

いくつかの開示される実装態様は、ヘッドレスブラウザが、所定のトークン語彙に属するＨＴＭＬトークンをコンテンツからパースし、かつ所定のトークン語彙に属さないコンテンツの部分を無視することを更に含む。いくつかの実装態様はまた、ヘッドレスブラウザが、ＨＴＭＬエンコーディングの生成のために、最大６４個のＨＴＭＬトークンをパースすることを含む。 Some disclosed implementations further include the headless browser parsing HTML tokens from the content that belong to a predefined token vocabulary and ignoring portions of the content that do not belong to the predefined token vocabulary. Some implementations also include the headless browser parsing up to 64 HTML tokens for generation of the HTML encoding.

ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するようにフィッシング分類器層を訓練する開示される一コンピュータ実装方法の一実装態様は、例示的なＵＲＬについて、ＵＲＬ特徴量ハッシュと、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングと、コンテンツページのレンダリングからキャプチャされた画像の画像埋め込みと、を受信及び処理して、各例示的なＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することを含む。方法は、各例示的なＵＲＬについての尤度スコアと、例示的なＵＲＬ及びコンテンツページがフィッシングであるか、又はフィッシングでないかについての各対応するグラウンドトゥルースと、の間の差を計算することと、例示的なＵＲＬについての計算された差を使用して、フィッシング分類器層の係数を訓練することと、本番ＵＲＬと、本番ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングであるか、又はフィッシングでないとして分類する使用のために、訓練された係数を保存することと、を含む。 One implementation of a disclosed computer-implemented method for training a phishing classifier layer to classify URLs and content accessed via the URLs as phishing or non-phishing includes receiving and processing URL feature hashes, HTML encodings of HTML tokens extracted from the content page, and image embeddings of images captured from renderings of the content page for example URLs to generate at least one likelihood score that each example URL and content accessed via the URL presents a phishing risk. The method includes calculating a difference between the likelihood score for each example URL and a corresponding ground truth about whether the example URL and the content page are phishing or non-phishing, using the calculated difference for the example URLs to train coefficients of the phishing classifier layer, and storing the trained coefficients for use in classifying production URLs and content pages accessed via the production URLs as phishing or non-phishing.

開示される方法のいくつかの実装態様は、差を、フィッシング分類器層を越えて、ＨＴＭＬエンコーディングを生成するために使用されるエンコーディング層にバックプロパゲートすることを含む。いくつかの実装態様は、差を、フィッシング分類器層を越えて、画像埋め込みを生成するために使用される埋め込み層にバックプロパゲートしないことを更に含む。 Some implementations of the disclosed methods include backpropagating the differences beyond the phishing classifier layer to an encoding layer used to generate the HTML encoding. Some implementations further include not backpropagating the differences beyond the phishing classifier layer to an embedding layer used to generate the image embeddings.

いくつかの実装態様はまた、例示的なＵＲＬの各々についてＵＲＬ特徴量ハッシュを生成することと、コンテンツページのレンダリングから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することと、レンダリングからキャプチャされた画像の画像埋め込みを生成することと、を含む。いくつかの実装態様は、第２の例示的なＵＲＬについて、第２の例示的なＵＲＬのコンテンツページから抽出され、エンコードされ、次いで、第２の例示的なＵＲＬのコンテンツページからキャプチャされた画像を再現するようにデコードされるＨＴＭＬトークンを使用して、ＨＴＭＬエンコーディングを生成するようにＨＴＭＬエンコーダ－デコーダを訓練することを更に含む。開示される方法のいくつかの実装態様はまた、埋め込み空間に画像を埋め込むように事前訓練されたＲｅｓＮｅｔ埋め込み器又はＲｅｓＮｅｔ埋め込み器の変形を使用して画像埋め込みを生成することを含む。 Some implementations also include generating URL feature hashes for each of the example URLs, generating an HTML encoding of the HTML tokens extracted from the rendering of the content page, and generating an image embedding of the image captured from the rendering. Some implementations further include training an HTML encoder-decoder for a second example URL to generate an HTML encoding using the HTML tokens extracted from the content page of the second example URL, encoded, and then decoded to recreate the image captured from the content page of the second example URL. Some implementations of the disclosed method also include generating an image embedding using a ResNet embedder or a variant of a ResNet embedder pre-trained to embed the image in an embedding space.

ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する、開示されるフィッシング分類器の一実装態様は、分類のためのＵＲＬを受け入れる入力プロセッサと、ＵＲＬのＵＲＬ埋め込みを生成するＵＲＬ埋め込み器と、ＵＲＬを介してアクセスされたコンテンツページからＨＴＭＬトークンを抽出するＨＴＭＬパーサと、ＨＴＭＬトークンからＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダと、ＵＲＬ埋め込み及びＨＴＭＬエンコーディングに対して動作して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類するフィッシング分類器層と、を含む。 One implementation of the disclosed phishing classifier that classifies a URL and a content page accessed via the URL as phishing or non-phishing includes an input processor that accepts a URL for classification, a URL embedder that generates a URL embedding for the URL, an HTML parser that extracts HTML tokens from the content page accessed via the URL, an HTML encoder that generates an HTML encoding from the HTML tokens, and a phishing classifier layer that operates on the URL embedding and HTML encoding to classify the URL and the content accessed via the URL as phishing or non-phishing.

開示されるフィッシング分類器のいくつかの実装態様はまた、ＵＲＬから所定の文字セット内の文字を抽出して文字列を生成し、かつフィッシングか、又はフィッシングでないとしてのＵＲＬのグラウンドトゥルース分類を使用して訓練されており、ＵＲＬ埋め込みを生成する、ＵＲＬ埋め込み器を含む。分類器は、ＵＲＬのコンテンツにアクセスし、かつコンテンツページからＨＴＭＬトークンを抽出するように構成されたＨＴＭＬパーサを更に含む。また、各例示的なＵＲＬが、例示的なＵＲＬを介してアクセスされたコンテンツページからキャプチャされたグラウンドトゥルースイメージを伴う例示的なＵＲＬのコンテンツページから抽出されたＨＴＭＬトークンに関して訓練されており、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成する、開示されるＨＴＭＬエンコーダと、各例示的なＵＲＬが、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬのＵＲＬ埋め込み及びＨＴＭＬエンコーディングに関して訓練されており、ＵＲＬ埋め込み及びＨＴＭＬエンコーディングの連結された入力を処理して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層と、が含まれる。 Some implementations of the disclosed phishing classifier also include a URL embedder that extracts characters in a predefined character set from the URL to generate a string and is trained using a ground truth classification of the URL as phishing or non-phishing to generate a URL embedding. The classifier further includes an HTML parser configured to access the content of the URL and extract HTML tokens from the content page. Also included is a disclosed HTML encoder, where each example URL is trained on HTML tokens extracted from the content page of the example URL along with a ground truth image captured from the content page accessed via the example URL, generating an HTML encoding of the HTML tokens extracted from the content page; and a phishing classifier layer, where each example URL is trained on the URL embeddings and HTML encoding of the example URL along with a ground truth classification as phishing or not phishing, processing the concatenated input of the URL embeddings and HTML encoding to generate at least one likelihood score that the URL and the content accessed via the URL present a phishing risk.

開示されるフィッシング分類器のいくつかの実装態様について、入力プロセッサは、ＵＲＬをリアルタイムでの分類のために受け入れる。開示されるフィッシング分類器の多くの実装態様では、フィッシング分類器層は、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をリアルタイムでフィッシングか、又はフィッシングでないとして分類するように動作する。いくつかの実装態様では、開示されるフィッシング分類器は、所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、かつ所定のトークン語彙に属さないコンテンツページの部分を無視するように構成されたＨＴＭＬパーサを更に含み、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコードの生成のために抽出するように構成されたＨＴＭＬパーサを更に含むことができる。 For some implementations of the disclosed phishing classifiers, the input processor accepts URLs for classification in real time. In many implementations of the disclosed phishing classifiers, the phishing classifier layer operates to classify URLs and content accessed via the URLs as phishing or non-phishing in real time. In some implementations, the disclosed phishing classifiers may further include an HTML parser configured to extract HTML tokens from a content page that belong to a predefined token vocabulary and ignore portions of the content page that do not belong to the predefined token vocabulary, and may further include an HTML parser configured to extract up to 64 HTML tokens for generation of an HTML encoding.

ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類する、開示されるコンピュータ実装方法の一実装態様は、ＵＲＬから所定の文字セット内の文字を抽出して文字列を生成してＵＲＬ埋め込みを生成し、フィッシングか、又はフィッシングでないとしてのＵＲＬのグラウンドトゥルース分類を訓練及び使用する、ＵＲＬ埋め込みを生成する、ＵＲＬ埋め込み器を適用することを含む。方法はまた、ＨＴＭＬパーサを適用して、ＵＲＬのコンテンツにアクセスし、コンテンツページからＨＴＭＬトークンを抽出し、ＨＴＭＬエンコーダを適用して、抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することと、フィッシング分類器層をＵＲＬ埋め込み及びＨＴＭＬエンコーディングの連結された入力に適用して、ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を含む。いくつかの実装態様はまた、所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、所定のトークン語彙に属さないコンテンツページの部分を無視するＨＴＭＬパーサを含み、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコードの生成のために抽出するＨＴＭＬパーサを更に含むことができる。開示される方法はまた、リアルタイムで、ＵＲＬ埋め込み器、ＨＴＭＬパーサ、ＨＴＭＬエンコーダ、及びフィッシング分類器層を適用することを含むことができる。いくつかの場合には、フィッシング分類器層は、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がリアルタイムでフィッシングリスクを提示する少なくとも１つの尤度スコアを生成するように動作する。 One implementation of the disclosed computer-implemented method of classifying a URL and content accessed via the URL as phishing or non-phishing includes applying a URL embedder that extracts characters in a predefined character set from the URL to generate a string to generate a URL embedding, and trains and uses a ground truth classification of the URL as phishing or non-phishing to generate the URL embedding. The method also includes applying an HTML parser to access the content of the URL and extract HTML tokens from the content page, applying an HTML encoder to generate an HTML encoding of the extracted HTML tokens, and applying a phishing classifier layer to the concatenated input of the URL embedding and the HTML encoding to generate at least one likelihood score that the URL and the content accessed via the URL presents a phishing risk. Some implementations may also include an HTML parser that extracts HTML tokens from the content page that belong to a predefined token vocabulary and ignores portions of the content page that do not belong to the predefined token vocabulary, and may further include an HTML parser that extracts for generation of an HTML encoding of up to 64 HTML tokens. The disclosed method may also include applying the URL embedder, HTML parser, HTML encoder, and phishing classifier layer in real time. In some cases, the phishing classifier layer operates to generate at least one likelihood score that the URL and the content accessed via the URL present a phishing risk in real time.

ＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するようにフィッシング分類器層を訓練する、開示されるコンピュータ実装方法の一実装態様は、例示的なＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、について、ＵＲＬから抽出された文字のＵＲＬ埋め込みと、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングと、を受信及び処理して、各例示的なＵＲＬと、このＵＲＬを介してアクセスされたコンテンツページと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することを含む。開示される方法はまた、各例示的なＵＲＬについての尤度スコアと、例示的なＵＲＬ及びコンテンツページがフィッシングであるか、又はフィッシングでないかという各対応するグラウンドトゥルースと、の間の差を計算することと、例示的なＵＲＬについての差を使用して、フィッシング分類器層の係数を訓練することと、本番ＵＲＬと、本番ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングであるか、又はフィッシングでないとして分類する使用のために、訓練された係数を保存することと、を含む。 One implementation aspect of the disclosed computer-implemented method for training a phishing classifier layer to classify URLs and content pages accessed via the URLs as phishing or non-phishing includes receiving and processing URL embeddings of characters extracted from the URLs and HTML encodings of HTML tokens extracted from the content pages for example URLs and content pages accessed via the URLs to generate at least one likelihood score that each example URL and content page accessed via the URL presents a phishing risk. The disclosed method also includes calculating the difference between the likelihood score for each example URL and each corresponding ground truth that the example URL and content page is phishing or non-phishing, using the difference for the example URLs to train coefficients of a phishing classifier layer, and storing the trained coefficients for use in classifying production URLs and content pages accessed via the production URLs as phishing or non-phishing.

開示される方法のいくつかの実装態様は、ヘッドレスブラウザを適用することと、ＵＲＬのコンテンツにアクセスしてコンテンツページを内部的にレンダリングすることと、コンテンツページの少なくとも一部の画像をキャプチャすることと、を更に含む。開示される方法は、差を、フィッシング分類器層を越えて、ＨＴＭＬエンコーディングを生成するために使用されるエンコーディング層にバックプロパゲートすることを更に含むことができ、差を、フィッシング分類器層を越えて、ＵＲＬ埋め込みを生成するために使用される埋め込み層にバックプロパゲートすることを更に含むことができる。 Some implementations of the disclosed method further include applying a headless browser, accessing the content of the URL to internally render the content page, and capturing an image of at least a portion of the content page. The disclosed method may further include backpropagating the difference beyond the phishing classifier layer to an encoding layer used to generate the HTML encoding, and may further include backpropagating the difference beyond the phishing classifier layer to an embedding layer used to generate the URL embedding.

開示される方法のいくつかの実装態様は、例示的なＵＲＬから抽出された文字のＵＲＬ埋め込みを生成することと、例示的なＵＲＬを介してアクセスされたコンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することと、レンダリングからキャプチャされた画像の画像埋め込みを生成することと、を更に含む。 Some implementations of the disclosed method further include generating a URL embedding of characters extracted from the exemplary URL, generating an HTML encoding of HTML tokens extracted from the content page accessed via the exemplary URL, and generating an image embedding of an image captured from the rendering.

開示される方法のいくつかの実装態様は、第２の例示的なＵＲＬについて、第２の例示的なＵＲＬのコンテンツページから抽出され、エンコードされ、次いで、第２の例示的なＵＲＬのコンテンツページからキャプチャされた画像を再現するようにデコードされるＨＴＭＬトークンを使用して、ＨＴＭＬエンコーディングを生成するようにＨＴＭＬエンコーダ－デコーダを訓練することを更に含む。開示される方法は、所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、所定のトークン語彙に属さないコンテンツページの部分を無視することを更に含むことができる。方法はまた、抽出を所定の数のＨＴＭＬトークンに制限することを含むことができ、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することを更に含むことができる。 Some implementations of the disclosed method further include training an HTML encoder-decoder to generate an HTML encoding for the second exemplary URL using HTML tokens extracted from the content page of the second exemplary URL, encoded, and then decoded to recreate an image captured from the content page of the second exemplary URL. The disclosed method may further include extracting HTML tokens from the content page that belong to a predetermined token vocabulary and ignoring portions of the content page that do not belong to the predetermined token vocabulary. The method may also include limiting the extraction to a predetermined number of HTML tokens and may further include generating an HTML encoding of up to 64 HTML tokens.

このセクションで説明される方法の他の実装態様は、プロセッサ上で実行されるときに、プロセッサに上で説明された方法のいずれかを実行させるコンピュータプログラム命令で特徴付けられた有形の非一時的コンピュータ可読記憶媒体を含むことができる。このセクションで説明する方法の更に別の実装態様は、メモリと、上記で説明した方法のいずれかを実施するためにメモリに記憶されたコンピュータ命令を実行するように動作可能な１つ以上のプロセッサとを含むデバイスを含み得る。 Other implementations of the methods described in this section may include a tangible, non-transitory computer-readable storage medium characterized by computer program instructions that, when executed on a processor, cause the processor to perform any of the methods described above. Yet another implementation of the methods described in this section may include a device including a memory and one or more processors operable to execute computer instructions stored in the memory to perform any of the methods described above.

上記で説明又は参照された任意のデータ構造及びコードは、多くの実装態様によれば、コンピュータシステムによって使用するためのコード及び／又はデータを記憶することができる任意のデバイス又は媒体であり得るコンピュータ可読記憶媒体上に記憶される。これには、揮発性メモリ、非揮発性メモリ、特定用途向け集積回路（ＡＳＩＣ）、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、ディスクドライブ、磁気テープ、ＣＤ（コンパクトディスク）、ＤＶＤ（ディジタルバーサタイルディスク又はディジタルビデオディスク）などの磁気及び光記憶デバイス、あるいは現在知られている又は今後開発されるコンピュータ可読媒体を格納することができる他の媒体が含まれるが、これらに限定されるものではない。 Any data structures and code described or referenced above are, according to many implementations, stored on a computer-readable storage medium, which may be any device or medium capable of storing code and/or data for use by a computer system. This includes, but is not limited to, volatile memory, non-volatile memory, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), magnetic and optical storage devices such as disk drives, magnetic tape, compact discs (CDs), digital versatile discs (DVDs) or digital video discs (DVDs), or other media capable of storing computer-readable media now known or hereafter developed.

前述の説明は、開示される技術の作製及び使用を可能にするために提示される。開示される実装態様に対する種々の修正が明らかになり、本明細書で定義される一般原理は、開示される技術の趣旨及び範囲から逸脱することなく、他の実装態様及び適用例に適用され得る。したがって、開示される技術は、示される実装態様に限定されるように意図されておらず、本明細書で開示される原理及び特徴と一致する最も広い範囲を与えられるべきである。開示される技術の範囲は、添付の特許請求の範囲によって定義される。 The foregoing description is presented to enable making and using the disclosed technology. Various modifications to the disclosed implementations will become apparent, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the disclosed technology. Thus, the disclosed technology is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein. The scope of the disclosed technology is defined by the appended claims.

条項
以下の条項を開示する。 Provisions The following provisions are disclosed:

条項セット１
１．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する、フィッシング分類器であって、
ＵＲＬを特徴量にパースして、特徴量をハッシュ化して、ＵＲＬ特徴量ハッシュを生成する、ＵＲＬ特徴量ハッシャと、
ＵＲＬのコンテンツページにアクセスし、コンテンツページを内部的にレンダリングし、
コンテンツページのレンダリングから単語を抽出し、
コンテンツページのレンダリングの少なくとも一部の画像をキャプチャするように構成されたヘッドレスブラウザと、
自然言語に関して事前訓練されており、コンテンツページから抽出された単語の単語エンコーディングを生成する、自然言語エンコーダと、
画像に関して事前訓練されており、コンテンツページからキャプチャされた画像の画像埋め込みを生成する、画像埋め込み器と、
例示的なＵＲＬの、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込みに関して訓練されており、各例示的なＵＲＬが、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴い、
ＵＲＬの、ＵＲＬ特徴量ハッシュ、単語エンコーディング、及び画像埋め込みの連結された入力を処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層と、を含む、フィッシング分類器。
２．自然言語エンコーダが、Ｔｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（略してＢＥＲＴ）と、ユニバーサルセンテンスエンコーダと、のうちの１つである、条項１に記載のフィッシング分類器。
３．画像埋め込み器が、残差ニューラルネットワーク（略してＲｅｓＮｅｔ）と、Ｉｎｃｅｐｔｉｏｎ－ｖ３と、ＶＧＧ－１６と、のうちの１つである、条項１に記載のフィッシング分類器。
４．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するコンピュータ実装方法であって、
ＵＲＬ特徴量ハッシャを適用し、ＵＲＬから特徴量を抽出し、特徴量をハッシュ化して、ＵＲＬ特徴量ハッシュを生成することと、
自然言語に関して事前訓練されており、コンテンツページのレンダリングからパースされた単語の単語エンコーディングを生成する、自然言語エンコーダを適用することと、
画像に関して事前訓練されており、レンダリングの少なくとも一部からキャプチャされた画像の画像埋め込みを生成する、画像エンコーダを適用することと、
フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬについての、ＵＲＬ特徴量ハッシュと、単語エンコーディングと、画像埋め込みと、の連結に関して訓練されたフィッシング分類器層を適用することと、
ＵＲＬ特徴量ハッシュと、単語エンコーディングと、画像埋め込みと、を処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を含む、コンピュータ実装方法。
５．ヘッドレスブラウザを適用することと、
ＵＲＬを介してコンテンツページにアクセスし、コンテンツページを内部的にレンダリングすることと、
レンダリングされたコンテンツページから単語をパースすることと、
レンダリングされたコンテンツページの少なくとも一部の画像をキャプチャすることと、を更に含む、条項４に記載のコンピュータ実装方法。
６．自然言語エンコーダが、Ｔｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（略してＢＥＲＴ）と、ユニバーサルセンテンスエンコーダと、のうちの１つである、条項４に記載のコンピュータ実装方法。
７．画像埋め込み器が、残差ニューラルネットワーク（略してＲｅｓＮｅｔ）と、Ｉｎｃｅｐｔｉｏｎ－ｖ３と、ＶＧＧ－１６と、のうちの１つである、条項４に記載のコンピュータ実装方法。
８．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するコンピュータプログラム命令で特徴付けられた非一時的コンピュータ可読記憶媒体であって、命令が、プロセッサ上で実行されるときに、
ＵＲＬ特徴量ハッシャを適用し、ＵＲＬから特徴量を抽出し、特徴量をハッシュ化して、ＵＲＬ特徴量ハッシュを生成することと、
自然言語に関して事前訓練されており、コンテンツページのレンダリングからパースされた単語の単語エンコーディングを生成する、自然言語エンコーダを適用することと、
画像に関して事前訓練されており、レンダリングの少なくとも一部からキャプチャされた画像の画像埋め込みを生成する、画像エンコーダを適用することと、
フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬについての、ＵＲＬ特徴量ハッシュと、単語エンコーディングと、画像埋め込みと、の連結に関して訓練されたフィッシング分類器層を適用することと、
ＵＲＬ特徴量ハッシュと、単語エンコーディングと、画像埋め込みと、を処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を含む方法を実装する、非一時的コンピュータ可読記憶媒体。
９．ヘッドレスブラウザを適用することと、
ＵＲＬを介してコンテンツページにアクセスし、コンテンツページを内部的にレンダリングすることと、
レンダリングされたコンテンツページから単語をパースすることと、
レンダリングされたコンテンツページの少なくとも一部の画像をキャプチャすることと、を更に含む、条項８に記載の非一時的コンピュータ可読記憶媒体。
１０．自然言語エンコーダが、Ｔｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（略してＢＥＲＴ）と、ユニバーサルセンテンスエンコーダと、のうちの１つである、条項８に記載の非一時的コンピュータ可読記憶媒体。
１１．画像埋め込み器が、残差ニューラルネットワーク（略してＲｅｓＮｅｔ）と、Ｉｎｃｅｐｔｉｏｎ－ｖ３と、ＶＧＧ－１６と、のうちの１つである、条項８に記載の非一時的コンピュータ可読記憶媒体。
１２．フィッシング分類器層を訓練して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するためのコンピュータ実装方法であって、
例示的なＵＲＬについて、
ＵＲＬ特徴量ハッシュと、コンテンツページから抽出された単語の単語エンコーディングと、コンテンツページのレンダリングからキャプチャされた画像の画像埋め込みと、を受信及び処理して、
各例示的なＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、
各例示的なＵＲＬについての尤度スコアと、例示的なＵＲＬ及びコンテンツページがフィッシングか、又はフィッシングでないという各対応するグラウンドトゥルースと、の間の差を計算することと、
例示的なＵＲＬについての差を使用して、フィッシング分類器層の係数を訓練することと、
本番ＵＲＬと、本番ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する使用のために、訓練された係数を保存することと、を含む、コンピュータ実装方法。
１３．差を、フィッシング分類器層を越えて、単語エンコーディングを生成するために使用されるエンコーディング層にバックプロパゲートしないことを更に含む、条項１２に記載のコンピュータ実装方法。
１４．差を、フィッシング分類器層を越えて、画像埋め込みを生成するために使用される埋め込み層にバックプロパゲートしないことを更に含む、条項１２に記載のコンピュータ実装方法。
１５．例示的なＵＲＬの各々についてＵＲＬ特徴量ハッシュを生成することと、コンテンツページのレンダリングから抽出された単語の単語エンコーディングを生成することと、レンダリングからキャプチャされた画像の画像埋め込みを生成することと、を更に含む、条項１２に記載のコンピュータ実装方法。
１６．ｒａｎｓｆｏｒｍｅｒからの双方向エンコーダ表現（略してＢＥＲＴ）エンコーダ又はＢＥＲＴエンコーダの変形を使用して、単語エンコーディングを生成することを更に含む、条項１５に記載のコンピュータ実装方法。
１７．残差ニューラルネットワーク（略してＲｅｓＮｅｔ）と、Ｉｎｃｅｐｔｉｏｎ－ｖ３と、ＶＧＧ－１６と、のうちの１つを使用して画像埋め込みを生成することを更に含む、条項１５に記載のコンピュータ実装方法。 Clause Set 1
1. A phishing classifier that classifies URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
a URL feature hasher that parses a URL into features and hashes the features to generate a URL feature hash;
Access the content page at the URL and render the content page internally,
Extract words from content page renderings,
a headless browser configured to capture an image of at least a portion of a rendering of a content page;
a natural language encoder that is pre-trained on a natural language and that generates word encodings of words extracted from the content pages;
an image embedder that is pre-trained on images and that generates image embeddings of images captured from the content pages;
Trained on URL feature hashes, word encodings, and image embeddings of example URLs, with each example URL accompanied by a ground truth classification as phishing or non-phishing;
a phishing classifier layer that processes a concatenated input of the URL feature hashes, word encodings, and image embeddings of the URL to generate at least one likelihood score that the URL and content accessed via the URL presents a phishing risk.
2. The phishing classifier of clause 1, wherein the natural language encoder is one of: a Bidirectional Encoder Representations from Transformer (abbreviated as BERT) and a Universal Sentence Encoder.
3. The phishing classifier of claim 1, wherein the image embedder is one of: Residual Neural Network (ResNet for short), Inception-v3, and VGG-16.
4. A computer-implemented method for classifying URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
applying a URL feature hasher to extract features from the URL and hash the features to generate a URL feature hash;
applying a natural language encoder that is pre-trained on a natural language and that generates word encodings for words parsed from a rendering of a content page;
applying an image encoder that is pre-trained on an image and that generates an image embedding of an image captured from at least a portion of the rendering;
applying a phishing classifier layer trained on a concatenation of URL feature hashes, word encodings, and image embeddings for example URLs with ground truth classifications as phishing or not phishing;
and processing the URL feature hashes, word encodings, and image embeddings to generate at least one likelihood score that the URL and content accessed via the URL present a phishing risk.
5. Applying headless browsers,
accessing a content page via a URL and internally rendering the content page;
Parsing words from the rendered content page;
5. The computer-implemented method of claim 4, further comprising: capturing an image of at least a portion of the rendered content page.
6. The computer-implemented method of claim 4, wherein the natural language encoder is one of: a Bidirectional Encoder Representation from Transformer (abbreviated as BERT) and a Universal Sentence Encoder.
7. The computer-implemented method of claim 4, wherein the image embedder is one of: Residual Neural Network (ResNet for short), Inception-v3, and VGG-16.
8. A non-transitory computer readable storage medium characterized by computer program instructions for classifying URLs and content pages accessed via the URLs as phishing or non-phishing, the instructions, when executed on a processor,
applying a URL feature hasher to extract features from the URL and hash the features to generate a URL feature hash;
applying a natural language encoder that is pre-trained on a natural language and that generates word encodings for words parsed from a rendering of a content page;
applying an image encoder that is pre-trained on an image and that generates an image embedding of an image captured from at least a portion of the rendering;
applying a phishing classifier layer trained on a concatenation of URL feature hashes, word encodings, and image embeddings for example URLs with ground truth classifications as phishing or not phishing;
1. A non-transitory computer-readable storage medium implementing a method that includes processing the URL feature hashes, word encodings, and image embeddings to generate at least one likelihood score that the URL and content accessed via the URL present a phishing risk.
9. Applying headless browsers;
accessing a content page via a URL and internally rendering the content page;
Parsing words from the rendered content page;
and capturing an image of at least a portion of the rendered content page.
10. The non-transitory computer-readable storage medium of claim 8, wherein the natural language encoder is one of a Bidirectional Encoder Representation from Transformer (abbreviated as BERT) and a Universal Sentence Encoder.
11. The non-transitory computer-readable storage medium of claim 8, wherein the image embedder is one of: Residual Neural Network (ResNet for short), Inception-v3, and VGG-16.
12. A computer-implemented method for training a phishing classifier layer to classify URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
For an exemplary URL:
receiving and processing URL feature hashes, word encodings of words extracted from the content page, and image embeddings of images captured from the rendering of the content page;
generating at least one likelihood score that each example URL and content page accessed via the URL presents a phishing risk;
Calculating the difference between the likelihood score for each example URL and each corresponding ground truth that the example URL and content page is phishing or non-phishing;
training the coefficients of a phishing classifier layer using the differences for the example URLs;
and storing the trained coefficients for use in classifying production URLs and content pages accessed via the production URLs as phishing or non-phishing.
13. The computer-implemented method of clause 12, further comprising not backpropagating the differences beyond a phishing classifier layer to an encoding layer used to generate the word encodings.
14. The computer-implemented method of clause 12, further comprising not backpropagating the difference beyond the phishing classifier layer to an embedding layer used to generate the image embedding.
15. The computer-implemented method of clause 12, further including generating a URL feature hash for each example URL, generating word encodings for words extracted from the rendering of the content page, and generating image embeddings for images captured from the rendering.
16. The computer-implemented method of claim 15, further comprising generating the word encodings using a Bidirectional Encoder Representation from Ransformer (abbreviated BERT) encoder or a variant of a BERT encoder.
17. The computer-implemented method of claim 15, further comprising generating the image embeddings using one of: a Residual Neural Network (abbreviated as ResNet), Inception-v3, and VGG-16.

条項セット２
１．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する、フィッシング分類器であって、
ＵＲＬを特徴量にパースして、特徴量をハッシュ化して、ＵＲＬ特徴量ハッシュを生成する、ＵＲＬ特徴量ハッシャと、
ＵＲＬのコンテンツページにアクセスし、コンテンツページを内部的にレンダリングし、
コンテンツページからＨＴＭＬトークンを抽出し、
コンテンツページのレンダリングの少なくとも一部の画像をキャプチャするように構成されたヘッドレスブラウザと、
例示的なＵＲＬのコンテンツページから抽出され、エンコードされ、次いで、コンテンツページのレンダリングからキャプチャされた画像を再現するようにデコードされるＨＴＭＬトークンに関して訓練されており、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダと、
画像に関して事前訓練されており、コンテンツページからキャプチャされた画像の画像埋め込みを生成する、画像埋め込み器と、
例示的なＵＲＬの、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みに関して訓練されており、各例示的なＵＲＬが、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴い、
ＵＲＬの、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みを処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層と、を含む、フィッシング分類器。
２．所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、かつ所定のトークン語彙に属さないコンテンツページの部分を無視するように構成されたヘッドレスブラウザを更に含む、条項１に記載のフィッシング分類器。
３．最大６４個のＨＴＭＬトークンのＨＴＭＬエンコードの生成のために抽出するように構成されたヘッドレスブラウザを更に含む、条項１に記載のフィッシング分類器。
４．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するコンピュータ実装方法であって、
ＵＲＬ特徴量ハッシャを適用し、ＵＲＬから特徴量を抽出し、特徴量をハッシュ化して、ＵＲＬ特徴量ハッシュを生成することと、
自然言語に関して訓練されており、レンダリングされたコンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダを適用することと、
画像に関して事前訓練されており、レンダリングされたコンテンツページの少なくとも一部からキャプチャされた画像の画像埋め込みを生成する、画像埋め込み器、
フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴って分類された例示的なＵＲＬについての、ＵＲＬ特徴量ハッシュと、ＨＴＭＬエンコーディングと、画像埋め込みと、に関して訓練された、フィッシング分類器層を適用することと、
ＵＲＬの、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みを処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページがフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を含む、コンピュータ実装方法。
５．ヘッドレスブラウザを適用することと、
ＵＲＬを介してコンテンツページにアクセスし、コンテンツページを内部的にレンダリングすることと、
レンダリングされたコンテンツからのＨＴＭＬトークンをパースすることと、
レンダリングされたコンテンツの少なくとも一部の画像をキャプチャすることと、を更に含む、条項４に記載のコンピュータ実装方法。
６．ヘッドレスブラウザが、コンテンツからの、所定のトークン語彙に属するＨＴＭＬトークンをパースし、所定のトークン語彙に属さないコンテンツの部分を無視することを更に含む、条項５に記載のコンピュータ実装方法。
７．ヘッドレスブラウザが、ＨＴＭＬエンコーディングの生成のために最大６４個のＨＴＭＬトークンをパースすることを更に含む、条項５に記載のコンピュータ実装方法。
８．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するコンピュータプログラム命令で特徴付けられた非一時的コンピュータ可読記憶媒体であって、命令が、プロセッサ上で実行されるときに、
ＵＲＬ特徴量ハッシャを適用し、ＵＲＬから特徴量を抽出し、特徴量をハッシュ化して、ＵＲＬ特徴量ハッシュを生成することと、
自然言語に関して訓練されており、レンダリングされたコンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダを適用することと、
レンダリングされたコンテンツの少なくとも一部からキャプチャされた画像の画像埋め込みを生成する画像埋め込み器を適用することと、
ＵＲＬの、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みを処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページがフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層を適用することと、を含む方法を実装する、非一時的コンピュータ可読記憶媒体。
９．命令が、プロセッサ上で実行されるときに、
フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴って分類された例示的なＵＲＬについての、ＵＲＬ特徴量ハッシュ、ＨＴＭＬエンコーディング、及び画像埋め込みに関して、フィッシング分類器層を訓練することを更に含む、条項８に記載の非一時的コンピュータ可読記憶媒体。
１０．命令が、プロセッサ上で実行されるときに、
ＨＴＭＬエンコーダを、例示的なＵＲＬのコンテンツページから抽出され、エンコードされ、次いで、コンテンツページのレンダリングからキャプチャされた画像を再現するようにデコードされるＨＴＭＬトークンに関して、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダを訓練することを更に含む、条項８に記載の非一時的コンピュータ可読記憶媒体。
１１．命令が、プロセッサ上で実行されるときに、
画像埋め込み器が、埋め込み空間に画像を埋め込むように事前訓練された、ＲｅｓＮｅｔ埋め込み器又はＲｅｓＮｅｔ埋め込み器の変形である、方法を実装する、条項８に記載の非一時的コンピュータ可読記憶媒体。
１２．命令が、プロセッサ上で実行されるときに、
ヘッドレスブラウザを適用することと、
ＵＲＬを介してコンテンツにアクセスし、コンテンツを内部的にレンダリングすることと、
レンダリングされたコンテンツからＨＴＭＬトークンをパースすることと、
レンダリングされたコンテンツの少なくとも一部の画像をキャプチャすることと、を更に含む方法を実装する、条項８に記載の非一時的コンピュータ可読記憶媒体。
１３．命令が、プロセッサ上で実行されるときに、ヘッドレスブラウザが、コンテンツからの、所定のトークン語彙に属するＨＴＭＬトークンをパースし、所定のトークン語彙に属さないコンテンツの部分を無視することを更に含む方法を実装する、条項１２に記載の非一時的コンピュータ可読記憶媒体。
１４．命令が、プロセッサ上で実行されるときに、ヘッドレスブラウザが、ＨＴＭＬエンコーディングの生成のために最大６４個のＨＴＭＬトークンをパースすることを更に含む方法を実装する、条項１２に記載の非一時的コンピュータ可読記憶媒体。
１５．フィッシング分類器層を訓練して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するためのコンピュータ実装方法であって、
例示的なＵＲＬについて、
ＵＲＬ特徴量ハッシュと、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングと、コンテンツページのレンダリングからキャプチャされた画像の画像埋め込みと、を受信及び処理して、
各例示的なＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、
各例示的なＵＲＬと、例示的なＵＲＬ及びコンテンツページがフィッシングであるか、又はフィッシングでないかに関する各対応するグラウンドトゥルースとの間の差を計算することと、
例示的なＵＲＬについての計算された差を使用して、フィッシング分類器層の係数を訓練することと、
本番ＵＲＬと、本番ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する使用のために、訓練された係数を保存することと、を含む、コンピュータ実装方法。
１６．差を、フィッシング分類器層を越えて、ＨＴＭＬエンコーディングを生成するために使用されるエンコーディング層にバックプロパゲートすることを更に含む、条項１５に記載のコンピュータ実装方法。
１７．差を、フィッシング分類器層を越えて、画像埋め込みを生成するために使用される埋め込み層にバックプロパゲートしないことを更に含む、条項１５に記載のコンピュータ実装方法。
１８．例示的なＵＲＬの各々についてＵＲＬ特徴量ハッシュを生成することと、コンテンツページのレンダリングから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することと、レンダリングからキャプチャされた画像の画像埋め込みを生成することと、を更に含む、条項１５に記載のコンピュータ実装方法。
１９．第２の例示的なＵＲＬについて、第２の例示的なＵＲＬのコンテンツページから抽出され、エンコードされ、次いで、第２の例示的なＵＲＬのコンテンツページからキャプチャされた画像を生成するようにデコードされるＨＴＭＬトークンを使用して、ＨＴＭＬエンコーディングを生成するようにＨＴＭＬエンコーダ－デコーダを訓練することを更に含む、条項１８に記載のコンピュータ実装方法。
２０．ＲｅｓＮｅｔ埋め込み器又はＲｅｓＮｅｔ埋め込み器の変形を使用して画像埋め込みを生成することを更に含む、条項１９に記載のコンピュータ実装方法。 Clause Set 2
1. A phishing classifier that classifies URLs and content pages accessed via the URLs as phishing or non-phishing,
a URL feature hasher that parses a URL into features and hashes the features to generate a URL feature hash;
Access the content page at the URL and render the content page internally,
Extracting HTML tokens from the content page;
a headless browser configured to capture an image of at least a portion of a rendering of a content page;
an HTML encoder that is trained on HTML tokens extracted from a content page of an exemplary URL, encoded, and then decoded to recreate an image captured from a rendering of the content page, and that generates an HTML encoding of the HTML tokens extracted from the content page;
an image embedder that is pre-trained on images and that generates image embeddings of images captured from the content pages;
Trained on URL feature hashes, HTML encodings, and image embeddings of example URLs, with each example URL accompanied by a ground truth classification as phishing or non-phishing;
a phishing classifier layer that processes the URL feature hashes, HTML encoding, and image embeddings of the URL to generate at least one likelihood score that the URL and a content page accessed via the URL presents a phishing risk.
2. The phishing classifier of clause 1, further comprising a headless browser configured to extract HTML tokens from the content page that belong to a predefined token vocabulary, and to ignore portions of the content page that do not belong to the predefined token vocabulary.
3. The phishing classifier of clause 1, further comprising a headless browser configured to extract for generation an HTML encoding of up to 64 HTML tokens.
4. A computer-implemented method for classifying URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
applying a URL feature hasher to extract features from the URL and hash the features to generate a URL feature hash;
applying an HTML encoder trained on a natural language to generate an HTML encoding of HTML tokens extracted from a rendered content page;
an image embedder that is pre-trained on images and that generates an image embedding for an image captured from at least a portion of a rendered content page;
applying a phishing classifier layer trained on URL feature hashes, HTML encodings, and image embeddings for example URLs classified with ground truth classifications as phishing or not phishing;
1. A computer-implemented method comprising: processing the URL feature hashes, HTML encoding, and image embedding of the URL to generate at least one likelihood score that the URL and a content page accessed via the URL present a phishing risk.
5. Applying headless browsers,
accessing a content page via a URL and internally rendering the content page;
Parsing HTML tokens from the rendered content;
5. The computer-implemented method of claim 4, further comprising: capturing an image of at least a portion of the rendered content.
6. The computer-implemented method of clause 5, further comprising the headless browser parsing HTML tokens from the content that belong to a pre-defined token vocabulary and ignoring portions of the content that do not belong to the pre-defined token vocabulary.
7. The computer-implemented method of claim 5, further comprising the headless browser parsing the up to 64 HTML tokens for generation of the HTML encoding.
8. A non-transitory computer readable storage medium characterized by computer program instructions for classifying URLs and content pages accessed via the URLs as phishing or non-phishing, the instructions, when executed on a processor,
applying a URL feature hasher to extract features from the URL and hash the features to generate a URL feature hash;
applying an HTML encoder trained on a natural language to generate an HTML encoding of HTML tokens extracted from a rendered content page;
applying an image embedder to generate an image embedding of the captured image from at least a portion of the rendered content;
and applying a phishing classifier layer that processes URL feature hashes, HTML encodings, and image embeddings of the URL to generate at least one likelihood score that the URL and a content page accessed via the URL presents a phishing risk.
9. When an instruction is executed on a processor:
9. The non-transitory computer-readable storage medium of claim 8, further comprising training a phishing classifier layer on URL feature hashes, HTML encodings, and image embeddings for example URLs classified with ground truth classifications as phishing or not phishing.
10. When an instruction is executed on a processor:
9. The non-transitory computer-readable storage medium of claim 8, further comprising training an HTML encoder to generate an HTML encoding of the HTML tokens extracted from the content page for the exemplary URL, encoded, and then decoded to recreate an image captured from a rendering of the content page.
11. When an instruction is executed on a processor:
9. The non-transitory computer-readable storage medium of claim 8, implementing the method, wherein the image embedder is a ResNet embedder or a variant of a ResNet embedder that is pre-trained to embed images in an embedding space.
12. When an instruction is executed on a processor:
Applying a headless browser,
accessing the content via a URL and rendering the content internally;
Parsing HTML tokens from the rendered content;
9. The non-transitory computer-readable storage medium of claim 8, further comprising: capturing an image of at least a portion of the rendered content.
13. The non-transitory computer-readable storage medium of clause 12, implementing a method wherein the instructions, when executed on a processor, further include the headless browser parsing HTML tokens from the content that belong to a pre-defined token vocabulary and ignoring portions of the content that do not belong to the pre-defined token vocabulary.
14. The non-transitory computer-readable storage medium of clause 12, wherein the instructions, when executed on a processor, implement a method further comprising the headless browser parsing up to 64 HTML tokens for generating an HTML encoding.
15. A computer-implemented method for training a phishing classifier layer to classify URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
For an exemplary URL:
receiving and processing the URL feature hashes, HTML encodings of HTML tokens extracted from the content page, and image embeddings of images captured from the rendering of the content page;
generating at least one likelihood score that each example URL and content accessed via the URL presents a phishing risk;
Calculating the difference between each example URL and each corresponding ground truth regarding whether the example URL and content page is phishing or non-phishing;
training the coefficients of a phishing classifier layer using the calculated differences for the example URLs;
and storing the trained coefficients for use in classifying production URLs and content pages accessed via the production URLs as phishing or non-phishing.
16. The computer-implemented method of clause 15, further comprising backpropagating the differences beyond the phishing classifier layer to an encoding layer used to generate the HTML encoding.
17. The computer-implemented method of clause 15, further comprising not backpropagating the difference beyond the phishing classifier layer to an embedding layer used to generate the image embedding.
18. The computer-implemented method of clause 15, further including generating a URL feature hash for each example URL, generating an HTML encoding of HTML tokens extracted from the rendering of the content page, and generating image embeddings of images captured from the rendering.
19. The computer-implemented method of clause 18, further comprising training an HTML encoder-decoder to generate an HTML encoding for a second exemplary URL using HTML tokens extracted from the content page of the second exemplary URL, encoded, and then decoded to generate an image captured from the content page of the second exemplary URL.
20. The computer-implemented method of clause 19, further comprising generating the image embedding using a ResNet embedder or a variant of the ResNet embedder.

条項セット３
１．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する、フィッシング分類器であって、
分類のためのＵＲＬを受け入れる入力プロセッサと、
ＵＲＬのＵＲＬ埋め込みを生成するＵＲＬ埋め込み器と、
ＵＲＬを介してアクセスされたコンテンツページからＨＴＭＬトークンを抽出するＨＴＭＬパーサと、
ＨＴＭＬトークンからＨＴＭＬエンコーディングを生成するＨＴＭＬエンコーダと、
ＵＲＬ埋め込み及びＨＴＭＬエンコーディングに対して動作して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、をフィッシングか、又はフィッシングでないとして分類する、フィッシング分類層と、を含む、フィッシング分類器。
２．ＵＲＬから所定の文字セット内の文字を抽出して文字列を生成し、かつフィッシングか、又はフィッシングでないとしてのＵＲＬのグラウンドトゥルース分類を使用して訓練されており、ＵＲＬ埋め込みを生成するＵＲＬ埋め込み器と、
ＵＲＬのコンテンツにアクセスし、
コンテンツページからＨＴＭＬトークンを抽出するように構成されたＨＴＭＬパーサと、
各例示的なＵＲＬが、例示的なＵＲＬを介してアクセスされたコンテンツページからキャプチャされたグラウンドトゥルースイメージを伴う例示的なＵＲＬのコンテンツページから抽出されたＨＴＭＬトークンに関して訓練されており、
コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成する、ＨＴＭＬエンコーダと、
各例示的なＵＲＬがフィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬのＵＲＬ埋め込みとＨＴＭＬエンコーディングに関して訓練されており、
ＵＲＬ埋め込み及びＨＴＭＬエンコーディングの連結された入力を処理して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成する、フィッシング分類器層と、を更に含む、条項１に記載のフィッシング分類器。
３．入力プロセッサが、ＵＲＬをリアルタイムでの分類のために受け入れる、条項１に記載のフィッシング分類器。
４．フィッシング分類器層が、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、をリアルタイムでフィッシングか、又はフィッシングでないとして分類するように動作する、条項１に記載のフィッシング分類器。
５．所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、かつ所定のトークン語彙に属さないコンテンツページの部分を無視するように構成されたＨＴＭＬパーサを更に含む、条項１に記載のフィッシング分類器。
６．最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出するように構成されたＨＴＭＬパーサを更に含む、条項１に記載のフィッシング分類器。
７．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するコンピュータ実装方法であって、
ＵＲＬから所定の文字セット内の文字を抽出して文字列を生成してＵＲＬ埋め込みを生成し、フィッシングか、又はフィッシングでないとしてのＵＲＬのグラウンドトゥルース分類を訓練及び使用する、ＵＲＬ埋め込みを生成するＵＲＬ埋め込み器を適用することと、
ＨＴＭＬパーサを適用して、ＵＲＬのコンテンツにアクセスし、コンテンツページからＨＴＭＬトークンを抽出することと、
ＨＴＭＬエンコーダを適用して、抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することと、
ＵＲＬ埋め込み及びＨＴＭＬエンコーディングの連結された入力にフィッシング分類器層を適用して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を含む、コンピュータ実装方法。
８．所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、所定のトークン語彙に属さないコンテンツページの部分を無視するＨＴＭＬパーサを更に含む、条項７に記載のコンピュータ実装方法。
９．ＨＴＭＬパーサが、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出することを更に含む、条項７に記載のコンピュータ実装方法。
１０．リアルタイムで、ＵＲＬ埋め込み器、ＨＴＭＬパーサ、ＨＴＭＬエンコーダ、及びフィッシング分類器層を適用することを更に含む、条項７に記載のコンピュータ実装方法。
１１．フィッシング分類器層が、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がリアルタイムでフィッシングリスクを提示する少なくとも１つの尤度スコアを生成するように動作する、条項７に記載のコンピュータ実装方法。
１２．ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するコンピュータプログラム命令で特徴付けられた非一時的コンピュータ可読記憶媒体であって、命令が、プロセッサ上で実行されるときに、
ＵＲＬ埋め込み器を適用し、ＵＲＬから所定の文字セット内の文字を抽出して文字列を生成し、ＵＲＬ埋め込みを生成することと、
ＨＴＭＬパーサを適用して、コンテンツページからＨＴＭＬトークンを抽出することと、
ＨＴＭＬエンコーダを適用して、抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することと、
ＵＲＬ埋め込み及びＨＴＭＬエンコーディングの連結された入力にフィッシング分類器層を適用して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、を含む方法を実装する、非一時的コンピュータ可読記憶媒体。
１３．命令が、プロセッサ上で実行されるときに、所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、所定のトークン語彙に属さないコンテンツページの部分を無視するＨＴＭＬパーサを更に含む方法を実装する、条項１２に記載の非一時的コンピュータ可読記憶媒体。
１４．命令が、プロセッサ上で実行されるときに、ＨＴＭＬパーサが、最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングの生成のために抽出することを更に含む方法を実装する、条項１２に記載の非一時的コンピュータ可読記憶媒体。
１５．命令が、プロセッサ上で実行されるときに、各例示的なＵＲＬが、例示的なＵＲＬを介してアクセスされたコンテンツページからキャプチャされたグラウンドトゥルース画像を伴う例示的なＵＲＬのコンテンツページから抽出されたＨＴＭＬトークンに関してＨＴＭＬエンコーダを訓練することを更に含む方法を実装する、条項１２に記載の非一時的コンピュータ可読記憶媒体。
１６．命令が、プロセッサ上で実行されるときに、各例示的なＵＲＬが、フィッシングか、又はフィッシングでないとしてのグラウンドトゥルース分類を伴う例示的なＵＲＬのＵＲＬ埋め込み及びＨＴＭＬエンコーディングに関してフィッシング分類器層を訓練することを更に含む方法を実装する、条項１２に記載の非一時的コンピュータ可読記憶媒体。
１７．フィッシング分類器層を訓練して、ＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類するためのコンピュータ実装方法であって、
例示的なＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、について、
ＵＲＬと、コンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングと、から抽出された文字のＵＲＬ埋め込みを受信及び処理して、
各例示的なＵＲＬと、ＵＲＬを介してアクセスされたコンテンツページと、がフィッシングリスクを提示する少なくとも１つの尤度スコアを生成することと、
各例示的なＵＲＬについての尤度スコアと、例示的なＵＲＬ及びコンテンツページがフィッシングか、又はフィッシングでないという各対応するグラウンドトゥルースと、の間の差を計算することと、
例示的なＵＲＬについての差を使用して、フィッシング分類器層の係数を訓練することと、
本番ＵＲＬと、本番ＵＲＬを介してアクセスされたコンテンツページと、をフィッシングか、又はフィッシングでないとして分類する使用のために、訓練された係数を保存することと、を含む、コンピュータ実装方法。
１８．ヘッドレスブラウザを適用することと、
ＵＲＬのコンテンツにアクセスし、コンテンツページを内部的にレンダリングすることと、
コンテンツページの少なくとも一部の画像をキャプチャすることと、を更に含む、条項１７に記載のコンピュータ実装方法。
１９．差を、フィッシング分類器層を越えて、ＨＴＭＬエンコーディングを生成するために使用されるエンコーディング層にバックプロパゲートすることを更に含む、条項１７に記載のコンピュータ実装方法。
２０．差を、フィッシング分類器層を越えて、ＵＲＬ埋め込みを生成するために使用される埋め込み層にバックプロパゲートすることを更に含む、条項１７に記載のコンピュータ実装方法。
２１．例示的なＵＲＬから抽出された文字のＵＲＬ埋め込みを生成することと、
例示的なＵＲＬを介してアクセスされたコンテンツページから抽出されたＨＴＭＬトークンのＨＴＭＬエンコーディングを生成し、レンダリングからキャプチャされた画像の画像埋め込みを生成することと、を更に含む、条項１８に記載のコンピュータ実装方法。
２２．第２の例示的なＵＲＬについて、第２の例示的なＵＲＬのコンテンツページから抽出され、エンコードされ、次いで、第２の例示的なＵＲＬのコンテンツページからキャプチャされた画像を再現するようにデコードされるＨＴＭＬトークンを使用して、ＨＴＭＬエンコーディングを生成するようにＨＴＭＬエンコーダ－デコーダを訓練することを更に含む、条項１８に記載のコンピュータ実装方法。
２３．所定のトークン語彙に属するＨＴＭＬトークンをコンテンツページから抽出し、所定のトークン語彙に属さないコンテンツページの部分を無視することを更に含む、条項１７に記載のコンピュータ実装方法。
２４．抽出を所定の数のＨＴＭＬトークンに制限することを更に含む、条項２３に記載のコンピュータ実装方法。
２５．最大６４個のＨＴＭＬトークンのＨＴＭＬエンコーディングを生成することを更に含む、条項２３に記載のコンピュータ実装方法。

Clause Set 3
1. A phishing classifier that classifies URLs and content pages accessed via the URLs as phishing or non-phishing,
an input processor for accepting a URL for classification;
a URL embedder for generating a URL embedding for a URL;
an HTML parser that extracts HTML tokens from a content page accessed via a URL;
an HTML encoder for generating an HTML encoding from the HTML tokens;
a phishing classification layer that operates on the URL embedding and HTML encoding to classify the URL and the content accessed via the URL as phishing or non-phishing.
2. A URL embedder that extracts characters in a predefined character set from a URL to generate a string and is trained using ground truth classification of URLs as phishing or non-phishing to generate a URL embedding;
Access the content of the URL
an HTML parser configured to extract HTML tokens from a content page;
each example URL is trained on HTML tokens extracted from the example URL's content page along with a ground truth image captured from the content page accessed via the example URL;
an HTML encoder that generates an HTML encoding of the HTML tokens extracted from the content page;
trained on URL embeddings and HTML encodings of example URLs with a ground truth classification of each example URL as phishing or non-phishing;
and a phishing classifier layer that processes the concatenated input of the URL embedding and HTML encoding to generate at least one likelihood score that the URL and the content accessed via the URL present a phishing risk.
3. The phishing classifier of claim 1, wherein the input processor accepts URLs for classification in real time.
4. The phishing classifier of clause 1, wherein the phishing classifier layer operates to classify URLs and content accessed via the URLs as phishing or non-phishing in real-time.
5. The phishing classifier of clause 1, further comprising an HTML parser configured to extract HTML tokens from the content page that belong to a predefined token vocabulary, and to ignore portions of the content page that do not belong to the predefined token vocabulary.
6. The phishing classifier of claim 1, further comprising an HTML parser configured to extract for generation an HTML encoding of up to 64 HTML tokens.
7. A computer-implemented method for classifying URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
applying a URL embedder that extracts characters in a predefined character set from a URL to generate a string to generate a URL embedding, and training and using a ground truth classification of URLs as phishing or non-phishing;
applying an HTML parser to access the content of the URL and extract HTML tokens from the content page;
applying an HTML encoder to generate an HTML encoding of the extracted HTML tokens;
and applying a phishing classifier layer to a concatenated input of the URL embedding and the HTML encoding to generate at least one likelihood score that the URL and content accessed via the URL present a phishing risk.
8. The computer-implemented method of claim 7, further comprising an HTML parser that extracts HTML tokens from the content page that belong to a predefined token vocabulary and ignores portions of the content page that do not belong to the predefined token vocabulary.
9. The computer-implemented method of claim 7, further comprising the HTML parser extracting for generation an HTML encoding of up to 64 HTML tokens.
10. The computer-implemented method of clause 7, further comprising applying in real-time a URL embedder, an HTML parser, an HTML encoder, and a phishing classifier layer.
11. The computer-implemented method of clause 7, wherein the phishing classifier layer operates to generate at least one likelihood score that a URL and content accessed via the URL presents a phishing risk in real time.
12. A non-transitory computer readable storage medium characterized by computer program instructions for classifying URLs and content pages accessed via the URLs as phishing or non-phishing, the instructions, when executed on a processor,
applying a URL embedder to extract characters in a predefined character set from the URL to generate a string to generate a URL embedding;
applying an HTML parser to extract HTML tokens from the content page;
applying an HTML encoder to generate an HTML encoding of the extracted HTML tokens;
1. A non-transitory computer-readable storage medium implementing a method that includes applying a phishing classifier layer to a concatenated input of the URL embedding and the HTML encoding to generate at least one likelihood score that the URL and content accessed via the URL present a phishing risk.
13. The non-transitory computer-readable storage medium of clause 12, implementing a method, wherein the instructions, when executed on a processor, further include an HTML parser that extracts HTML tokens from the content page that belong to a pre-defined token vocabulary and ignores portions of the content page that do not belong to the pre-defined token vocabulary.
14. The non-transitory computer-readable storage medium of clause 12, implementing a method, the instructions, when executed on a processor, further comprising: an HTML parser extracting for generation of an HTML encoding of up to 64 HTML tokens.
15. The non-transitory computer-readable storage medium of clause 12, implementing a method, the instructions, when executed on a processor, further comprising training an HTML encoder on HTML tokens extracted from content pages of example URLs, where each example URL is accompanied by a ground truth image captured from the content page accessed via the example URL.
16. The non-transitory computer-readable storage medium of clause 12, implementing a method, the instructions, when executed on a processor, further comprising training a phishing classifier layer on URL embeddings and HTML encodings of example URLs, each example URL with a ground truth classification as phishing or not phishing.
17. A computer-implemented method for training a phishing classifier layer to classify URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
For an example URL and a content page accessed via the URL:
receiving and processing the URL and the HTML encoding of the HTML tokens extracted from the content page and the URL embedding of the extracted characters;
generating at least one likelihood score that each example URL and content page accessed via the URL presents a phishing risk;
Calculating the difference between the likelihood score for each example URL and each corresponding ground truth that the example URL and content page is phishing or non-phishing;
training the coefficients of a phishing classifier layer using the differences for the example URLs;
and storing the trained coefficients for use in classifying production URLs and content pages accessed via the production URLs as phishing or non-phishing.
18. Applying a headless browser;
accessing the content of the URL and internally rendering the content page;
20. The computer-implemented method of claim 17, further comprising: capturing an image of at least a portion of the content page.
19. The computer-implemented method of clause 17, further comprising backpropagating the differences beyond the phishing classifier layer to an encoding layer used to generate the HTML encoding.
20. The computer-implemented method of clause 17, further comprising backpropagating the differences beyond the phishing classifier layer to an embedding layer used to generate URL embeddings.
21. Generating a URL embedding of characters extracted from an example URL;
20. The computer-implemented method of claim 18, further comprising: generating an HTML encoding of the HTML tokens extracted from the content page accessed via the exemplary URL; and generating an image embedding of the image captured from the rendering.
22. The computer-implemented method of clause 18, further comprising training an HTML encoder-decoder to generate HTML encoding for a second exemplary URL using HTML tokens extracted from the content page of the second exemplary URL, encoded, and then decoded to recreate the image captured from the content page of the second exemplary URL.
23. The computer-implemented method of clause 17, further comprising extracting HTML tokens from the content page that belong to a predefined token vocabulary, and ignoring portions of the content page that do not belong to the predefined token vocabulary.
24. The computer-implemented method of clause 23, further comprising limiting the extraction to a predetermined number of HTML tokens.
25. The computer-implemented method of clause 23, further comprising generating an HTML encoding of up to 64 HTML tokens.

Claims

a phishing classifier for classifying a URL and a content page accessed via said URL as phishing or non-phishing,
a URL feature hasher that parses the URL into features and hashes the features to generate a URL feature hash;
Accessing the content page at the URL and internally rendering the content page;
Extracting words from the rendering of the content page;
a headless browser configured to capture an image of at least a portion of the rendering of the content page;
a natural language encoder that is pre-trained on a natural language and that generates word encodings of the words extracted from the content pages;
an image embedder that is pre-trained on images and that generates an image embedding of the image captured from the content page;
a phishing classifier layer that processes a concatenated input of the URL feature hash, the word encoding, and the image embedding of the URL to generate at least one score that indicates a phishing risk of the URL and the content accessed via the URL;
a phishing classifier, the phishing classifier layer being trained on the URL feature hashes, the word encodings, and the image embeddings of example URLs, each example URL being accompanied by a ground truth classification as either phishing or not phishing.

The phishing classifier of claim 1, wherein the natural language encoder is one of a Bidirectional Encoder Representation from Transformer (abbreviated BERT) and a Universal Sentence Encoder.

The phishing classifier of claim 1, wherein the image embedder is one of: Residual Neural Network (abbreviated as ResNet), Inception-v3, and VGG-16.

1. A computer-implemented method for classifying URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
applying a URL feature hasher to extract features from the URL and hash the features to generate a URL feature hash;
applying a natural language encoder that is pre-trained on a natural language and that generates word encodings for words parsed from the rendering of the content page;
applying an image embedder that is pre-trained on an image and that generates an image embedding of an image captured from at least a portion of the rendering;
applying a phishing classifier layer trained on the concatenation of the URL feature hashes, the word encodings, and the image embeddings for example URLs with ground truth classifications as phishing or not phishing;
and processing a concatenated input of the URL feature hash, the word encodings, and the image embeddings to generate at least one score that indicates a phishing risk of the URL and the content accessed via the URL.

Applying a headless browser,
accessing the content page via the URL and internally rendering the content page;
Parsing words from the rendered content page;
The computer-implemented method of claim 4 , further comprising: capturing an image of at least a portion of the rendered content page.

The computer-implemented method of claim 4, wherein the natural language encoder is one of a Bidirectional Encoder Representation from Transformer (abbreviated BERT) and a Universal Sentence Encoder.

The computer-implemented method of claim 4, wherein the image embedder is one of: Residual Neural Network (abbreviated as ResNet), Inception-v3, and VGG-16.

1. A non-transitory computer readable storage medium characterized by computer program instructions for classifying URLs and content pages accessed via said URLs as phishing or non-phishing, said instructions, when executed on a processor, comprising:
applying a URL feature hasher to extract features from the URL and hash the features to generate a URL feature hash;
applying a natural language encoder that is pre-trained on a natural language and that generates word encodings for words parsed from the rendering of the content page;
applying an image embedder that is pre-trained on an image and that generates an image embedding of an image captured from at least a portion of the rendering;
applying a phishing classifier layer trained on the concatenation of the URL feature hashes, the word encodings, and the image embeddings for example URLs with ground truth classifications as phishing or not phishing;
and processing a concatenated input of the URL feature hash, the word encodings, and the image embeddings to generate at least one score that presents a phishing risk for the URL and the content accessed via the URL.

Applying a headless browser,
accessing the content page via the URL and internally rendering the content page;
Parsing words from the rendered content page;
and capturing an image of at least a portion of the rendered content page.

The non-transitory computer-readable storage medium of claim 8, wherein the natural language encoder is one of a Bidirectional Encoder Representation from Transformer (abbreviated BERT) and a Universal Sentence Encoder.

The non-transitory computer-readable storage medium of claim 8, wherein the image embedder is one of a residual neural network (abbreviated as ResNet), Inception-v3, and VGG-16.

1. A computer-implemented method for training a phishing classifier layer to classify URLs and content pages accessed via the URLs as phishing or non-phishing, comprising:
For an exemplary URL:
receiving and processing a concatenated input of URL feature hashes, word encodings of words extracted from the content page, and image embeddings of images captured from a rendering of the content page;
generating at least one score representing a phishing risk for each example URL and the content page accessed via said URL;
calculating a difference between the score for each example URL and each corresponding ground truth that the example URL and the content page are phishing or non-phishing;
training coefficients of the phishing classifier layer using the differences for the example URLs;
and storing the trained coefficients for use in classifying production URLs and content pages accessed via the production URLs as phishing or non-phishing.

The computer-implemented method of claim 12, further comprising not backpropagating the differences beyond the phishing classifier layer to an encoding layer used to generate the word encodings.

The computer-implemented method of claim 12, further comprising not backpropagating the difference beyond the phishing classifier layer to an embedding layer used to generate the image embeddings.

The computer-implemented method of claim 12, further comprising: generating the URL feature hash for each of the example URLs; generating the word encodings of words extracted from the rendering of the content page; and generating the image embeddings of the images captured from the rendering.

The computer-implemented method of claim 15, further comprising generating the word encodings using a Bidirectional Encoder Representation (abbreviated BERT) encoder or a variant of a BERT encoder from Transformer.

16. The computer-implemented method of claim 15, further comprising generating the image embeddings using one of: a Residual Neural Network (ResNet for short), Inception-v3, and VGG-16.