[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2014000537A1 - 一种钓鱼网站查找系统及方法 - Google Patents

一种钓鱼网站查找系统及方法 Download PDF

Info

Publication number
WO2014000537A1
WO2014000537A1 PCT/CN2013/075950 CN2013075950W WO2014000537A1 WO 2014000537 A1 WO2014000537 A1 WO 2014000537A1 CN 2013075950 W CN2013075950 W CN 2013075950W WO 2014000537 A1 WO2014000537 A1 WO 2014000537A1
Authority
WO
WIPO (PCT)
Prior art keywords
seed
webpage
link
suspicious
phishing
Prior art date
Application number
PCT/CN2013/075950
Other languages
English (en)
French (fr)
Inventor
陈营营
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Priority to US14/411,089 priority Critical patent/US20150128272A1/en
Publication of WO2014000537A1 publication Critical patent/WO2014000537A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of network security technologies, and in particular, to a phishing website search system and method. Background technique
  • the existing phishing website discovery technology adopts the following two methods: monitoring the search engine result page through specific keywords; and combining with the client to monitor and identify the website with less access to the netizen.
  • the present invention has been made in order to provide a phishing website search system and method that overcomes the above problems or at least partially solves or alleviates the above problems.
  • a phishing website search system comprising: a seed bank establishing unit adapted to put a raw link of a target webpage hitting a number of known phishing websites greater than a predetermined threshold as a seed link into a seed a seed extractor, configured to extract a seed link in the seed library; a seed web page analyzer, configured to search for a corresponding seed web page according to the extracted seed link, and analyze the seed web page to obtain a
  • the judging unit is configured to search for the suspicious webpage corresponding to the suspicious link, and determine whether the suspicious webpage is a phishing website; and the output interface is adapted to output when the suspicious webpage is a phishing website The corresponding phishing website.
  • a phishing website searching method comprising the steps of: A: placing an original link of a target webpage hitting a number of known phishing websites greater than a predetermined threshold as a seed link into a seed bank; Extracting a seed link in the seed library, collecting a suspicious link appearing in the seed webpage corresponding to the seed link; C: outputting a corresponding phishing website when the suspicious webpage corresponding to the suspicious link is a phishing website.
  • a computer program comprising computer readable code, when said computer readable code is run on a server, causing said server to perform any of claims 6-1 1 A method of finding a phishing website as described.
  • a computer readable medium wherein the computer program according to claim 12 is stored.
  • phishing website searching system and method of the present invention according to the characteristics of advertisements and dark chain SEO dissemination frequently used by phishing websites, a blacklist database of known phishing websites is used to obtain seed web pages, and a new phishing website is found by regularly detecting seed web pages. , greatly improving the search speed of phishing websites and reducing the security risks of Internet users using the Internet.
  • FIG. 1 is a block diagram showing a module structure of a phishing website search system according to a first embodiment of the present invention
  • FIG. 2 is a block diagram showing a module structure of the seed bank building unit
  • FIG. 3 is a schematic structural diagram of a phishing website searching system according to a second embodiment of the present invention
  • FIG. 4 is a flowchart of a phishing website searching method according to Embodiment 3 of the present invention.
  • Figure 5 is a flow chart of the step A
  • Figure 6 is a flow chart of the step B
  • Figure ⁇ is a flow chart of the step C;
  • Figure 8 shows schematically a block diagram of a server for carrying out the method according to the invention;
  • Fig. 9 schematically shows a memory unit for holding or carrying a program code implementing a method according to the invention.
  • FIG. 1 is a block diagram showing the structure of a phishing website search system according to Embodiment 1 of the present invention.
  • the system includes: a seed bank establishing unit 100, a seed bank 200, a seed extractor 300, and a seed web page analyzer 400. , the determining unit 500 and the output interface 600.
  • the seed library establishing unit 100 is adapted to put the original link of the target webpage hitting the number of known phishing websites larger than a predetermined threshold as a seed link into the seed bank.
  • the seed bank establishing unit 100 further includes: a blacklist module 110 and a selecting module 120.
  • the blacklist module 110 is adapted to establish a blacklist library according to a known phishing website.
  • the blacklist library should include all known phishing websites as much as possible, and the black list library is continuously updated in actual use, and the phishing websites are added.
  • the selection module 120 is adapted to place the original link of the target webpage as a seed link into the seed repository when the number of known phishing websites in the blacklist library is greater than a predetermined threshold. That is, using all the links in the target webpage as the first set, the domain name of the known phishing website in the blacklist library is used as the second set, and the intersection of the first set and the second set is calculated, and The number of intersection elements is used as the target webpage to hit the number of known phishing websites in the blacklist library, and then the number is compared with a predetermined threshold, and if it is greater than a predetermined threshold, the original of the target webpage is The link is placed as a seed link in the seed library; otherwise, the landing page is discarded.
  • represents a set of links included in the target webpage; ) represents a collection of domain names of known phishing websites in the blacklist library; M represents ⁇ and ! The intersection of ); M
  • the predetermined threshold may be set and adjusted according to actual usage, and may be generally set to 3, 4 or 5, and is preferably set to 3 in this embodiment.
  • the seed bank 200 is adapted to store the seed link.
  • the number of seed links in the seed bank 200 is at least 1, and the number of seed links in the seed bank 200 should be continuously increased in actual use to improve the search efficiency of the phishing website.
  • the seed extractor 300 is adapted to extract a seed link in the seed bank 200.
  • the seed webpage analyzer 400 is adapted to search for a corresponding seed webpage according to the extracted seed link, and analyze the seed webpage to obtain a suspicious link existing in the seed webpage.
  • the suspicious link is typically a new unknown link that appears on the seed web page.
  • the determining unit 500 is adapted to search for the suspicious webpage corresponding to the suspicious link, and judges the well-known discriminating technology, which is not the focus of the present invention and will not be further described herein.
  • the output interface 600 is adapted to output a corresponding phishing website when the suspicious webpage is a phishing website.
  • the output interface 600 is further adapted to update the blacklist library after outputting the corresponding phishing website, and insert the newly found phishing website into the blacklist library.
  • FIG. 3 is a block diagram showing the structure of a phishing website search system according to the second embodiment of the present invention.
  • the system in this embodiment is basically the same as the system in the first embodiment, and the difference is only in the implementation.
  • the system described further includes: a web crawler 000.
  • the webpage crawler 000 is adapted to capture the target webpage for use by the seed repository establishing unit 100.
  • the web crawler 000 can generally use a web spider, a web crawler, a search robot, or a web crawl script.
  • FIG. 4 is a flowchart of a method for searching for a phishing website according to Embodiment 3 of the present invention. As shown in FIG. 4, the method includes the following steps:
  • A Put the original link of the landing page hitting the known phishing website with a number greater than the predetermined threshold as a seed link into the seed library.
  • FIG. 5 is a flowchart of the step A. As shown in FIG. 4, the step A further includes the following steps: A1: Establishing a blacklist library according to a known phishing website.
  • step A2 crawling the target webpage, determining, according to the blacklist library, whether the number of known phishing websites hits the target webpage is greater than a predetermined threshold, and if so, placing the original link of the target webpage as a seed link into the seed repository, Then step A3 is performed; otherwise, step A3 is directly executed.
  • step A3 Determine whether the number of seed links in the seed bank is greater than a predetermined number of seeds, and if yes, perform step B; otherwise, return to step A2.
  • step B is a flowchart of the step B. As shown in FIG. 5, the step B further includes the following steps: B1: extracting a seed link in the seed library, and downloading a seed web page corresponding to the seed link;
  • B2 analyzing the seed webpage to obtain a suspicious link appearing in the seed webpage.
  • C When the suspicious webpage corresponding to the suspicious link is a phishing website, the corresponding phishing website is output.
  • FIG. 7 is a flow chart of the step C. As shown in Figure 7, the step C further includes the steps of:
  • step C1 determining whether the suspicious webpage is a phishing website, if yes, outputting a corresponding phishing website, updating the blacklisting library, and then performing step C2; otherwise, directly performing step C2.
  • step C2 It is judged whether the seed links in the seed library have been extracted, and if so, the process ends; otherwise, the process returns to step B.
  • the phishing website searching system and method according to the embodiment of the present invention often adopts the characteristics of advertisement, SEO (Search Engine Optimization), and uses the blacklist library of the known phishing website to obtain the seed webpage. Regularly detecting the seed page to find new phishing websites has greatly improved the search speed of phishing websites and reduced the security risks of Internet users using the Internet.
  • SEO Search Engine Optimization
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of the functionality of some or all of the components of the phishing website lookup system in accordance with embodiments of the present invention.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • a program implementing the present invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.
  • FIG. 8 illustrates a server, such as an application server, that can implement the phishing website lookup method in accordance with the present invention.
  • the server conventionally includes a processor 810 and a computer program product or computer readable medium in the form of a memory 820.
  • Memory 820 can be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • Memory 820 has a memory space 830 for program code 831 for performing any of the method steps described above.
  • storage space 830 for program code Various program codes 831 for implementing the various steps in the above methods, respectively, may be included.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have a storage section, a storage space, and the like arranged similarly to the storage 820 in the server of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 83, i.e., code that can be read by a processor, such as 810, that when executed by the server causes the server to perform various steps in the methods described above.
  • an embodiment or “one or more embodiments” as used herein means that the particular features, structures, or characteristics described in connection with the embodiments are included in at least one embodiment of the invention.
  • the phrase “in one embodiment” herein does not necessarily refer to the same embodiment.
  • any reference signs placed between parentheses shall not be construed as a limitation.
  • the word “comprising” does not exclude the presence of the elements or steps that are not in the claims.
  • the word “a” or “an” preceding a component does not exclude the presence of a plurality of such elements.
  • the invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item.
  • the use of the words first, second, and third does not indicate any order. These words can be interpreted as names.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种钓鱼网站查找系统及方法,涉及网络安全领域。所述系统包括:种子库建立单元,适于将命中已知钓鱼网站的个数大于预定阈值的目标网页的原始链接作为种子链接放入种子库;种子提取器,适于提取出种子库中的种子链接;种子网页分析器,适于根据提取出的种子链接查找对应的种子网页,对种子网页进行分析,得到种子网页中存在的可疑链接;判断单元,适于查找可疑链接对应的可疑网页,判断可疑网页是否是钓鱼网站;输出接口,适于在可疑网页是钓鱼网站时,输出相应的钓鱼网站。所述系统及方法,大幅提高了钓鱼网站的查找速度,降低了网民使用互联网的安全风险。

Description

一种钓鱼网站查找系统及方法
技术领域
本发明涉及网络安全技术领域, 特别涉及一种钓鱼网站查找系统及 方法。 背景技术
随着互联网的发展, 网民数量逐年增加。 在上网时, 除了传统的木马、 病毒的威胁, 近两年钓鱼网站的数量大幅增加。 互联网上每天新产生十多万 的站点, 数十亿的新 URL, 数量庞大。 因此, 除了能准确识别钓鱼网站外, 钓鱼网站的发现速度也显得越来越重要。许多互联网公司都在致力于解决这 样一个难题:如何在钓鱼网站未大量传播之前,甚至在未开始传播前发现它。
现有的钓鱼网站发现技术多采用以下两种方法: 通过特定关键词对搜索 引擎结果页进行监控; 通过与客户端结合, 对网民访问较少的网址进行监控 识别。
无论是通过特定关键词对搜索引擎结果页进行监控,还是通过与客户端 结合,对网民访问较少的网址进行监控,都具有滞后情。特别是第二种方法, 更是需要有网民访问以后, 才有可能发现这些网址, 而这过程中, 最先访问 这个钓鱼网站的网民可能已经上当受骗。 发明内容
鉴于上述问题, 提出了本发明以便提供一种克服上述问题或者至少 部分地解决或者减緩上述问题的钓鱼网站查找系统及方法。
根据本发明的一个方面, 提供了一种钓鱼网站查找系统, 其包括: 种 子库建立单元,适于将命中已知钓鱼网站的个数大于预定阈值的目标网页的 原始链接作为种子链接放入种子库; 种子提取器, 适于提取出所述种子库中 的种子链接; 种子网页分析器, 适于根据所述提取出的种子链接查找对应的 种子网页,对所述种子网页进行分析,得到所述种子网页中存在的可疑链接; 判断单元, 适于查找所述可疑链接对应的可疑网页, 判断所述可疑网页是否 是钓鱼网站; 输出接口, 适于在所述可疑网页是钓鱼网站时, 输出相应的钓 鱼网站。 根据本发明的另一个方面, 提供了一种钓鱼网站查找方法, 其包括步 骤: A: 将命中已知钓鱼网站的个数大于预定阈值的目标网页的原始链接作 为种子链接放入种子库; B: 提取出所述种子库中的种子链接, 收集所述种 子链接对应的种子网页中出现的可疑链接; C: 当所述可疑链接对应的可疑 网页是钓鱼网站时, 输出相应的钓鱼网站。
根据本发明的又一个方面, 提供了一种计算机程序, 其包括计算机 可读代码, 当所述计算机可读代码在服务器上运行时, 导致所述服务器 执行根据权利要求 6-1 1 中的任一个所述的钓鱼网站查找方法。
根据本发明的再一个方面, 提供了一种计算机可读介质, 其中存储 了如权利要求 12所述的计算机程序。
本发明的有益效果为:
本发明的所述钓鱼网站查找系统及方法, 根据钓鱼网站常采用广告、 暗 链 SEO传播的特点, 利用已知钓鱼网站的黑名单库得到种子网页, 通过定 期检测种子网页查找发现新的钓鱼网站, 大幅提高了钓鱼网站的查找速度, 降低了网民使用互联网的安全风险。
上述说明仅是本发明技术方案的概述, 为了能够更清楚了解本发明 的技术手段, 而可依照说明书的内容予以实施, 并且为了让本发明的上 述和其它目的、 特征和优点能够更明显易懂, 以下特举本发明的具体实 施方式。 附图说明
通过阅读下文优选实施方式的详细描述, 各种其他的优点和益处对 于本领域普通技术人员将变得清楚明了。 附图仅用于示出优选实施方式 的目的, 而并不认为是对本发明的限制。 而且在整个附图中, 用相同的 参考符号表示相同的部件。 在附图中:
图 1是依据本发明实施例一的钓鱼网站查找系统的模块结构示意图; 图 2是所述种子库建立单元的模块结构示意图;
图 3是依据本发明实施例二的钓鱼网站查找系统的模块结构示意图; 图 4是依据本发明实施例三的钓鱼网站查找方法的流程图;
图 5是所述步骤 A的流程图;
图 6是所述步骤 B的流程图;
图 Ί是所述步骤 C的流程图; 图 8示意性地示出了用于执行根据本发明的方法的服务器的框图; 以及
图 9示意性地示出了用于保持或者携带实现根据本发明的方法的程 序代码的存储单元。 具体实施例
下面结合附图和具体的实施方式对本发明作进一步的描述。
图 1是本发明实施例一所述钓鱼网站查找系统的模块结构示意图,如图 1 所示, 所述系统包括: 种子库建立单元 100、 种子库 200、 种子提取器 300、 种子网页分析器 400、 判断单元 500和输出接口 600。
所述种子库建立单元 100, 适于将命中已知钓鱼网站的个数大于预定阈 值的目标网页的原始链接作为种子链接放入种子库
图 2是所述种子库建立单元的模块结构示意图, 如图 2所示, 所述种子库 建立单元 100进一步包括: 黑名单模块 110和选择模块 120。
所述黑名单模块 110, 适于根据已知钓鱼网站建立黑名单库。 为保证钓 鱼网站查找的准确度, 所述黑名单库中应该尽可能包含所有已知钓鱼网站, 并且在实际使用中不断更新所述黑名单库, 增加其中的钓鱼网站。
所述选择模块 120, 适于在所述目标网页命中所述黑名单库中已知钓鱼 网站的个数大于预定阈值时,将所述目标网页的原始链接作为种子链接放入 种子库。 也就是说, 将所述目标网页中的所有链接作为第一集合, 将所述黑 名单库中的已知钓鱼网站的域名作为第二集合,计算第一集合和第二集合的 交集, 并将交集中元素的数量作为所述目标网页命中所述黑名单库中已知钓 鱼网站的个数, 然后将所述个数与预定阈值进行比较, 如果大于预定阈值, 则将所述目标网页的原始链接作为种子链接放入种子库; 否则, 弃置所述目 标网页。
其中, 所述目标网页命中所述黑名单库中已知钓鱼网站的个数的计算公 式如下:
Figure imgf000005_0001
M = W C\ D ;
其中, ^表示所述目标网页中所包含的链接的集合; )表示所述黑名单 库中已知钓鱼网站的域名的集合; M表示 ^和!)的交集; |M|表示 M中元素 的数量; N表示所述目标网页命中所述黑名单库中已知钓鱼网站的个数。 其中, 所述预定阈值可以根据实际使用情况进行设置和调整, 一般可以 设置为 3、 4或者 5 , 本实施例中优选设置为 3。
所述种子库 200, 适于存储所述种子链接。 所述种子库 200中种子链接的 数量至少为 1 , 并且在实际使用中应该不断增加所述种子库 200中种子链接的 数量, 以提高钓鱼网站的查找效率。
所述种子提取器 300, 适于提取出所述种子库 200中的种子链接。
所述种子网页分析器 400, 适于根据所述提取出的种子链接查找对应的 种子网页,对所述种子网页进行分析,得到所述种子网页中存在的可疑链接。 所述可疑链接一般是所述种子网页上出现的新的未知链接。
所述判断单元 500, 适于查找所述可疑链接对应的可疑网页, 判断所述 的公知判别技术, 其非本发明重点, 在此不再贅述。
输出接口 600, 适于在所述可疑网页是钓鱼网站时, 输出相应的钓鱼网 站。 所述输出接口 600还适于在输出相应的钓鱼网站后更新所述黑名单库, 即将新查找到的钓鱼网站放入所述黑名单库。
图 3是本发明实施例二所述钓鱼网站查找系统的模块结构示意图,如图 3 所示,本实施例所述系统与实施例一所述系统基本相同,其不同之处仅在于, 本实施例所述系统还包括: 网页抓取器 000。 所述网页抓取器 000, 适于抓取 所述目标网页, 以供所述种子库建立单元 100使用。 所述网页抓取器 000—般 可以采用网络蜘蛛、 网页爬虫、 搜索机器人或网络抓取脚本程序等。
图 4是本发明实施例三所述钓鱼网站查找方法的流程图, 如图 4所示, 所 述方法包括步骤:
A: 将命中已知钓鱼网站的个数大于预定阈值的目标网页的原始链接作 为种子链接放入种子库。
图 5是所述步骤 A的流程图, 如图 4所示, 所述步骤 A进一步包括步骤: A1 : 根据已知钓鱼网站建立黑名单库。
A2: 抓取目标网页,根据所述黑名单库判断所述目标网页命中已知钓鱼 网站的个数是否大于预定阈值, 如果是, 将所述目标网页的原始链接作为种 子链接放入种子库, 然后执行步骤 A3; 否则, 直接执行步骤 A3。
A3 :判断所述种子库中的种子链接的数量是否大于预定种子数,如果是, 执行步骤 B; 否则, 返回步骤 A2。
B: 提取出所述种子库中的种子链接, 收集所述种子链接对应的种子网 页中出现的可疑链接。
图 6是所述步骤 B的流程图, 如图 5所示, 所述步骤 B进一步包括步骤: B1 : 提取出所述种子库中的种子链接, 下载所述种子链接对应的种子网 页;
B2: 对所述种子网页进行分析, 得到所述种子网页中出现的可疑链接。 C: 当所述可疑链接对应的可疑网页是钓鱼网站时, 输出相应的钓鱼网 站。
图 7是所述步骤 C的流程图, 如图 7所示, 所述步骤 C进一步包括步骤:
C1 :判断所述可疑网页是否是钓鱼网站,如果是,输出相应的钓鱼网站, 更新所述黑名单库, 然后执行步骤 C2; 否则, 直接执行步骤 C2。
C2: 判断所述种子库中的种子链接是否已经都被提取出, 如果是, 结束 流程; 否则, 返回所述步骤 B。
本发明实施例所述钓鱼网站查找系统及方法, 根据钓鱼网站常采用广 告、 暗链 SEO ( Search Engine Optimization, 搜索引擎优化 )传播的特点, 利用已知钓鱼网站的黑名单库得到种子网页,通过定期检测种子网页查找发 现新的钓鱼网站, 大幅提高了钓鱼网站的查找速度, 降低了网民使用互联网 的安全风险。
本发明的各个部件实施例可以以硬件实现, 或者以在一个或者多个 处理器上运行的软件模块实现, 或者以它们的组合实现。 本领域的技术 人员应当理解, 可以在实践中使用微处理器或者数字信号处理器 (DSP ) 来实现根据本发明实施例的钓鱼网站查找系统中的一些或者全部部件的 一些或者全部功能。 本发明还可以实现为用于执行这里所描述的方法的 一部分或者全部的设备或者装置程序 (例如, 计算机程序和计算机程序 产品) 。 这样的实现本发明的程序可以存储在计算机可读介质上, 或者 可以具有一个或者多个信号的形式。 这样的信号可以从因特网网站上下 载得到, 或者在载体信号上提供, 或者以任何其他形式提供。
例如, 图 8示出了可以实现根据本发明的钓鱼网站查找方法的服务 器, 例如应用服务器。 该服务器传统上包括处理器 810和以存储器 820 形式的计算机程序产品或者计算机可读介质。 存储器 820可以是诸如闪 存、 EEPROM (电可擦除可编程只读存储器) 、 EPROM、 硬盘或者 ROM 之类的电子存储器。 存储器 820具有用于执行上述方法中的任何方法步 骤的程序代码 831的存储空间 830。 例如, 用于程序代码的存储空间 830 可以包括分别用于实现上面的方法中的各种步骤的各个程序代码 831。这 些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一 个或者多个计算机程序产品中。 这些计算机程序产品包括诸如硬盘, 紧 致盘 (CD ) 、 存储卡或者软盘之类的程序代码载体。 这样的计算机程序 产品通常为如参考图 9所述的便携式或者固定存储单元。 该存储单元可 以具有与图 8的服务器中的存储器 820类似布置的存储段、 存储空间等。 程序代码可以例如以适当形式进行压缩。 通常, 存储单元包括计算机可 读代码 83 Γ , 即可以由例如诸如 810之类的处理器读取的代码, 这些代 码当由服务器运行时, 导致该服务器执行上面所描述的方法中的各个步 骤。
本文中所称的 "一个实施例"、 "实施例"或者"一个或者多个实施例 "意 味着, 结合实施例描述的特定特征、 结构或者特性包括在本发明的至少 一个实施例中。 此外, 请注意, 这里"在一个实施例中"的词语例子不一定 全指同一个实施例。
在此处所提供的说明书中, 说明了大量具体细节。 然而, 能够理解, 中, 并未详细示出公知的方法、 结构和技术, 以便不模糊对本说明书的 理解。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限 制, 并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计 出替换实施例。 在权利要求中, 不应将位于括号之间的任何参考符号构 造成对权利要求的限制。单词"包含"不排除存在未列在权利要求中的元件 或步骤。 位于元件之前的单词 "一"或"一个"不排除存在多个这样的元件。 本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计 算机来实现。 在列举了若干装置的单元权利要求中, 这些装置中的若干 个可以是通过同一个硬件项来具体体现。 单词第一、 第二、 以及第三等 的使用不表示任何顺序。 可将这些单词解释为名称。
此外, 还应当注意, 本说明书中使用的语言主要是为了可读性和教 导的目的而选择的, 而不是为了解释或者限定本发明的主题而选择的。 因此, 在不偏离所附权利要求书的范围和精神的情况下, 对于本技术领 域的普通技术人员来说许多修改和变更都是显而易见的。 对于本发明的 范围, 对本发明所做的公开是说明性的, 而非限制性的, 本发明的范围 由所附权利要求书限定。

Claims

权 利 要 求
1、 一种钓鱼网站查找系统, 其包括:
种子库建立单元,适于将命中已知钓鱼网站的个数大于预定阈值的目标 网页的原始链接作为种子链接放入种子库;
种子提取器, 适于提取出所述种子库中的种子链接;
种子网页分析器, 适于根据所述提取出的种子链接查找对应的种子网 页, 对所述种子网页进行分析, 得到所述种子网页中存在的可疑链接; 判断单元, 适于查找所述可疑链接对应的可疑网页, 判断所述可疑网页 是否是钓鱼网站;
输出接口, 适于在所述可疑网页是钓鱼网站时, 输出相应的钓鱼网站。
2、 如权利要求 1所述的系统, 其中, 所述系统还包括: 网页抓取器; 所述网页抓取器, 适于抓取所述目标网页。
3、 如权利要求 1或 2所述的系统, 其中, 所述种子库建立单元包括: 黑名单模块, 适于根据已知钓鱼网站建立黑名单库;
选择模块,适于在所述目标网页命中所述黑名单库中已知钓鱼网站的个 数大于预定阈值时, 将所述目标网页的原始链接作为种子链接放入种子库。
4、 如权利要求 3所述的系统, 其中, 所述输出接口还适于在输出相应 的钓鱼网站后更新所述黑名单库。
5、 如权利要求 3所述的系统, 其中, 所述目标网页命中所述黑名单库中 已知钓鱼网站的个数的计算公式如下:
Figure imgf000010_0001
M = W C\ D ;
其中, 表示所述目标网页中所包含的链接的集合; )表示所述黑名单 库中已知钓鱼网站的域名的集合; M表示 和!)的交集; |M|表示 M中元素 的数量; N表示所述目标网页命中所述黑名单库中已知钓鱼网站的个数。
6、 一种钓鱼网站查找方法, 其包括步骤:
A: 将命中已知钓鱼网站的个数大于预定阈值的目标网页的原始链接作 为种子链接放入种子库;
B: 提取出所述种子库中的种子链接, 收集所述种子链接对应的种子网 页中出现的可疑链接;
C: 当所述可疑链接对应的可疑网页是钓鱼网站时, 输出相应的钓鱼网 站。
7、 如权利要求 6所述的方法, 其中, 所述将命中已知钓鱼网站的个数大 于预定阈值的目标网页的原始链接作为种子链接放入种子库的步骤, 进一步 包括:
A2: 抓取目标网页,判断所述目标网页命中已知钓鱼网站的个数是否大 于预定阈值,如果是,将所述目标网页的原始链接作为种子链接放入种子库, 然后执行步骤 A3; 否则, 直接执行步骤 A3;
A3:判断所述种子库中的种子链接的数量是否大于预定种子数,如果是, 执行步骤 B; 否则, 返回步骤 A2。
8、 如权利要求 7所述的方法, 其中, 在所述步骤 A2之前还包括步骤 A1 : 根据已知钓鱼网站建立黑名单库;
并且, 在所述步骤 A2中, 判断所述目标网页命中已知钓鱼网站的个数 是否大于预定阈值的步骤进一步为, 判断所述目标网页命中所述黑名单库中 已知钓鱼网站的个数是否大于预定阈值。
9、 如权利要求 8所述的方法, 其中, 所述目标网页命中所述黑名单库中 已知钓鱼网站的个数的计算公式如下:
Figure imgf000011_0001
M =W (I D ;
其中, 表示所述目标网页中所包含的链接的集合; D表示所述黑名单 库中已知钓鱼网站的域名的集合; M表示 和!)的交集; |M|表示 M中元素 的数量; N表示所述目标网页命中所述黑名单库中已知钓鱼网站的个数。
10、 如权利要求 8所述的方法, 其中, 所述当所述可疑链接对应的可疑 网页是钓鱼网站时输出相应的钓鱼网站, 进一步包括步骤:
C1 :判断所述可疑网页是否是钓鱼网站,如果是,输出相应的钓鱼网站, 更新所述黑名单库, 然后执行步骤 C2; 否则, 直接执行步骤 C2;
C2: 判断所述种子库中的种子链接是否已经都被提取出, 如果是, 结束 流程; 否则, 返回所述步骤 B。
11、 如权利要求 6所述的方法, 其中, 所述提取出所述种子库中的种子 链接, 收集所述种子链接对应的种子网页中出现的可疑链接, 进一步包括步 骤:
B1 : 提取出所述种子库中的种子链接, 下载所述种子链接对应的种子网 页; B2: 对所述种子网页进行分析, 得到所述种子网页中出现的可疑链接。
12、 一种计算机程序, 包括计算机可读代码, 当所述计算机可读代 码在服务器上运行时, 导致所述服务器执行根据权利要求 6-1 1中的任一 个所述的钓鱼网站查找方法。
13、 一种计算机可读介质, 其中存储了如权利要求 12所述的计算机 程序。
PCT/CN2013/075950 2012-06-28 2013-05-21 一种钓鱼网站查找系统及方法 WO2014000537A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/411,089 US20150128272A1 (en) 2012-06-28 2013-05-21 System and method for finding phishing website

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210220826.X 2012-06-28
CN201210220826.XA CN102799814B (zh) 2012-06-28 2012-06-28 一种钓鱼网站查找系统及方法

Publications (1)

Publication Number Publication Date
WO2014000537A1 true WO2014000537A1 (zh) 2014-01-03

Family

ID=47198920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/075950 WO2014000537A1 (zh) 2012-06-28 2013-05-21 一种钓鱼网站查找系统及方法

Country Status (3)

Country Link
US (1) US20150128272A1 (zh)
CN (1) CN102799814B (zh)
WO (1) WO2014000537A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11252174B2 (en) * 2016-12-16 2022-02-15 Worldpay, Llc Systems and methods for detecting security risks in network pages

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799814B (zh) * 2012-06-28 2015-11-25 北京奇虎科技有限公司 一种钓鱼网站查找系统及方法
CN103020188A (zh) * 2012-11-30 2013-04-03 北京网秦天下科技有限公司 多平台应用搜索方法和服务器
CN103152355A (zh) * 2013-03-19 2013-06-12 北京奇虎科技有限公司 对危险网站进行提示的方法、系统及客户端设备
CN104978523A (zh) * 2014-11-06 2015-10-14 哈尔滨安天科技股份有限公司 一种基于网络热词识别的恶意样本捕获方法及系统
US9473531B2 (en) * 2014-11-17 2016-10-18 International Business Machines Corporation Endpoint traffic profiling for early detection of malware spread
EP3125147B1 (en) * 2015-07-27 2020-06-03 Swisscom AG System and method for identifying a phishing website
CN105577676A (zh) * 2015-12-30 2016-05-11 广东欧珀移动通信有限公司 一种钓鱼网站的识别方法及装置
WO2018085732A1 (en) 2016-11-03 2018-05-11 RiskIQ, Inc. Techniques for detecting malicious behavior using an accomplice model
CN107743128A (zh) * 2017-10-31 2018-02-27 哈尔滨工业大学(威海) 一种基于首页关联域名和同服务ip的非法网站挖掘方法
CN109756467B (zh) * 2017-11-07 2021-04-27 中国移动通信集团广东有限公司 一种钓鱼网站的识别方法及装置
CN107977575B (zh) * 2017-12-20 2021-03-09 北京关键科技股份有限公司 一种基于私有云平台的代码组成分析系统和方法
CN109246074A (zh) * 2018-07-23 2019-01-18 北京奇虎科技有限公司 识别可疑域名的方法、装置、服务器及可读存储介质
US10785260B2 (en) * 2018-08-09 2020-09-22 Morgan Stanley Services Group Inc. Optically analyzing domain names
CN109218332B (zh) * 2018-10-19 2020-11-13 杭州安恒信息技术股份有限公司 一种埋点式钓鱼网站监测方法
US11443004B1 (en) 2019-01-02 2022-09-13 Foundrydc, Llc Data extraction and optimization using artificial intelligence models
CN110909291A (zh) * 2019-12-31 2020-03-24 徐州八方网络科技有限公司 一种网站信息采集发布平台系统
CN112968875B (zh) * 2021-01-29 2022-11-01 上海安恒时代信息技术有限公司 网络关系构建方法及系统
US12105761B2 (en) * 2022-11-10 2024-10-01 Palo Psifiakes Technologie Epe System and method for web crawling and content summarization

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN101820366A (zh) * 2010-01-27 2010-09-01 南京邮电大学 一种基于预取的钓鱼网页检测方法
CN102523210A (zh) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 钓鱼网站检测方法及装置
CN102799814A (zh) * 2012-06-28 2012-11-28 北京奇虎科技有限公司 一种钓鱼网站查找系统及方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060095955A1 (en) * 2004-11-01 2006-05-04 Vong Jeffrey C V Jurisdiction-wide anti-phishing network service
US7630987B1 (en) * 2004-11-24 2009-12-08 Bank Of America Corporation System and method for detecting phishers by analyzing website referrals
US8726369B1 (en) * 2005-08-11 2014-05-13 Aaron T. Emigh Trusted path, authentication and data security
US8839418B2 (en) * 2006-01-18 2014-09-16 Microsoft Corporation Finding phishing sites
US7854001B1 (en) * 2007-06-29 2010-12-14 Trend Micro Incorporated Aggregation-based phishing site detection
AU2011201043A1 (en) * 2010-03-11 2011-09-29 Mailguard Pty Ltd Web site analysis system and method
US8521667B2 (en) * 2010-12-15 2013-08-27 Microsoft Corporation Detection and categorization of malicious URLs
CN102279875B (zh) * 2011-06-24 2013-04-24 华为数字技术(成都)有限公司 钓鱼网站的识别方法和装置
CN102299918A (zh) * 2011-07-08 2011-12-28 盛大计算机(上海)有限公司 一种网络交易安全系统及方法
CN102375952B (zh) * 2011-10-31 2014-12-24 北龙中网(北京)科技有限责任公司 在搜索引擎结果中显示网站是否为可信验证的方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN101820366A (zh) * 2010-01-27 2010-09-01 南京邮电大学 一种基于预取的钓鱼网页检测方法
CN102523210A (zh) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 钓鱼网站检测方法及装置
CN102799814A (zh) * 2012-06-28 2012-11-28 北京奇虎科技有限公司 一种钓鱼网站查找系统及方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11252174B2 (en) * 2016-12-16 2022-02-15 Worldpay, Llc Systems and methods for detecting security risks in network pages

Also Published As

Publication number Publication date
CN102799814B (zh) 2015-11-25
US20150128272A1 (en) 2015-05-07
CN102799814A (zh) 2012-11-28

Similar Documents

Publication Publication Date Title
WO2014000537A1 (zh) 一种钓鱼网站查找系统及方法
CN105184159B (zh) 网页篡改的识别方法和装置
EP2998884B1 (en) Security information management system and security information management method
US9544316B2 (en) Method, device and system for detecting security of download link
CN103685174B (zh) 一种不依赖样本的钓鱼网站检测方法
WO2013044744A1 (zh) 一种下载资源提供方法及装置
CN103279710B (zh) Internet信息系统恶意代码的检测方法和系统
US20090287641A1 (en) Method and system for crawling the world wide web
US20120304287A1 (en) Automatic detection of search results poisoning attacks
CN105760379B (zh) 一种基于域内页面关联关系检测webshell页面的方法及装置
CN103632084A (zh) 恶意特征数据库的建立方法、恶意对象检测方法及其装置
Kim et al. Detecting fake anti-virus software distribution webpages
CN109104421B (zh) 一种网站内容篡改检测方法、装置、设备及可读存储介质
CN106021418B (zh) 新闻事件的聚类方法及装置
WO2014000538A1 (zh) 基于终端访问统计的云网址推荐方法及系统及相关设备
CN107437026B (zh) 一种基于广告网络拓扑的恶意网页广告检测方法
CN107463844B (zh) Web木马检测方法及系统
CN106022126B (zh) 一种面向web木马检测的网页特征提取方法
JP5752642B2 (ja) 監視装置および監視方法
CN112532624B (zh) 一种黑链检测方法、装置、电子设备及可读存储介质
CN103440454B (zh) 一种基于搜索引擎关键词的主动式蜜罐检测方法
CN110135153A (zh) 软件的可信检测方法及装置
CN108959930A (zh) 恶意pdf检测方法、系统、数据存储设备和检测程序
JP6823205B2 (ja) 収集装置、収集方法及び収集プログラム
CN113132340B (zh) 一种基于视觉与主机特征的钓鱼网站识别方法及电子装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13809093

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14411089

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13809093

Country of ref document: EP

Kind code of ref document: A1