KR20220053549A

KR20220053549A - Inline malware detection

Info

Publication number: KR20220053549A
Application number: KR1020227001606A
Authority: KR
Inventors: 윌리엄 레딩턴 휴렛; 쑤이창 뎅; 솅 양; 호 유 람
Original assignee: 팔로 알토 네트웍스, 인크.
Priority date: 2019-07-19
Filing date: 2020-07-06
Publication date: 2022-04-29
Also published as: EP3999985A1; KR102676386B1; JP2022541250A; JP7411775B2; EP3999985A4; CN114072798A; JP2024023875A; WO2021015941A1

Abstract

악성 파일들의 검출이 개시된다. 하나 이상의 샘플 분류 모델들을 포함한 세트가 네트워킹 디바이스 상에 저장된다. N-그램 분석이 수신 파일과 연관된 수신 패킷들의 시퀀스에 대해 수행된다. n-그램 분석을 수행하는 것은 적어도 하나의 저장된 샘플 분류 모델을 사용하는 것을 포함한다. 수신 패킷들의 시퀀스의 n-그램 분석에 적어도 부분적으로 기초하여 수신 파일이 악성인지의 결정이 이루어진다. 파일이 악성이라고 결정하는 것에 응답하여, 수신 파일의 전파가 방지된다. Detection of malicious files is initiated. A set comprising one or more sample classification models is stored on the networking device. An N-gram analysis is performed on the sequence of received packets associated with the received file. Performing the n-gram analysis includes using at least one stored sample classification model. A determination is made whether the received file is malicious based at least in part on an n-gram analysis of the sequence of received packets. In response to determining that the file is malicious, propagation of the received file is prevented.

Description

Inline malware detection

멀웨어는 일반적으로 악성 소프트웨어(예컨대, 다양한 적대적이고, 침입적이며, 및/또는 그 외 원치 않는 소프트웨어를 포함하는)를 나타내기 위해 사용되는 일반적인 용어이다. 멀웨어는 코드, 스크립트들, 활성 콘텐트, 및/또는 다른 소프트웨어의 형태로 있을 수 있다. 멀웨어의 예시적인 사용들은 컴퓨터 및/또는 네트워크 동작들을 방해하는 것, 독점적 정보(예컨대, 아이덴티티, 금융, 및/또는 지적 재산 관련 정보와 같은, 기밀 정보)를 훔치는 것, 및/또는 사설/독점 컴퓨터 시스템들 및/또는 컴퓨터 네트워크들로의 액세스를 얻는 것을 포함한다. 불운하게도, 기술들이 멀웨어를 검출하고 완화하도록 돕기 위해 개발됨에 따라, 비도덕적인 저자들이 이러한 노력들을 피해가기 위한 방식들을 발견한다. 따라서, 멀웨어를 식별하고 완화하기 위한 기술들에 대한 개선들을 위한 진행 중인 요구가 있다. Malware is a generic term generally used to refer to malicious software (eg, including various hostile, intrusive, and/or other unwanted software). Malware may be in the form of code, scripts, active content, and/or other software. Exemplary uses of malware include disrupting computer and/or network operations, stealing proprietary information (eg, confidential information, such as identity, financial, and/or intellectual property related information), and/or private/proprietary computer and gaining access to systems and/or computer networks. Unfortunately, as technologies are developed to help detect and mitigate malware, unscrupulous authors find ways to circumvent these efforts. Accordingly, there is an ongoing need for improvements to techniques for identifying and mitigating malware.

본 발명의 다양한 실시예들은 다음의 상세한 설명 및 첨부된 도면들에서 개시된다.
도 1은 악성 애플리케이션들이 검출되고 피해를 야기하는 것으로부터 방지되는 환경의 예를 예시한다.
도 2a는 데이터 기기의 실시예를 예시한다.
도 2b는 데이터 기기의 실시예의 논리 구성요소들의 기능 다이어그램이다.
도 3은 샘플들을 분석하기 위해 시스템에 포함될 수 있는 논리 구성요소들의 예를 예시한다.
도 4는 위협 엔진의 예시적인 실시예의 부분들을 예시한다.
도 5는 트리의 일 부분의 예를 예시한다.
도 6은 데이터 기기상에서 인라인 멀웨어 검출을 수행하기 위한 프로세스의 예를 예시한다.
도 7a는 파일에 대한 예시적인 해시 테이블을 예시한다.
도 7b는 샘플에 대한 예시적인 위협 서명을 예시한다.
도 8a는 특징 추출을 수행하기 위한 프로세스의 예를 예시한다.
도 8b는 모델을 생성하기 위한 프로세스의 예를 예시한다.Various embodiments of the present invention are disclosed in the following detailed description and accompanying drawings.
1 illustrates an example of an environment in which malicious applications are detected and prevented from causing harm.
2A illustrates an embodiment of a data device.
2B is a functional diagram of the logical components of an embodiment of a data device;
3 illustrates an example of logic components that may be included in a system for analyzing samples.
4 illustrates portions of an exemplary embodiment of a threat engine.
5 illustrates an example of a portion of a tree.
6 illustrates an example of a process for performing inline malware detection on a data device.
7A illustrates an example hash table for a file.
7B illustrates an example threat signature for a sample.
8A illustrates an example of a process for performing feature extraction.
8B illustrates an example of a process for creating a model.

본 발명은 프로세스로서; 장치; 시스템; 물질의 구성; 컴퓨터 판독 가능한 저장 매체상에 구현된 컴퓨터 프로그램 제품; 및/또는 프로세서에 결합된 메모리 상에 저장되고 및/또는 그것에 의해 제공된 지시들을 실행하도록 구성된 프로세서와 같은, 프로세서를 포함한, 다양한 방식들로 구현될 수 있다. 이 명세서에서, 이들 구현들, 또는 본 발명이 취할 수 있는 임의의 다른 형태는 기술들로서 언급될 수 있다. 일반적으로, 개시된 프로세스들의 단계들의 순서는 본 발명의 범위 내에서 변경될 수 있다. 달리 서술되지 않는다면, 태스크를 수행하도록 구성되는 것으로서 설명된 프로세서 또는 메모리와 같은 구성요소는 주어진 시간에 태스크를 수행하도록 임시로 구성되는 일반적인 구성요소 또는 태스크를 수행하기 위해 제조되는 특정 구성요소로서 구현될 수 있다. 본 출원에서 사용된 바와 같이, 용어 '프로세서'는 컴퓨터 프로그램 지시들과 같은, 데이터를 프로세싱하도록 구성된 하나 이상의 디바이스들, 회로들, 및/또는 프로세싱 코어들을 나타낸다. The present invention is a process; Device; system; composition of matter; a computer program product embodied on a computer readable storage medium; and/or may be implemented in a variety of ways, including a processor, such as a processor stored on and/or configured to execute instructions provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of steps in the disclosed processes may be varied within the scope of the present invention. Unless otherwise stated, a component, such as a processor or memory, that is described as being configured to perform a task may be implemented as a general component temporarily configured to perform a task at a given time, or as a specific component manufactured to perform a task. can As used herein, the term 'processor' refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

본 발명의 하나 이상의 실시예들에 대한 상세한 설명은 이하에서 본 발명의 원리들을 예시하는 첨부된 도면들과 함께 제공된다. 본 발명은 이러한 실시예들과 관련되어 설명되지만, 본 발명은 임의의 실시예에 제한되지 않는다. 본 발명의 범위는 단지 청구항들에 의해서만 제한되며 본 발명은 다수의 대안들, 수정들 및 등가물들을 포함한다. 다수의 특정 세부사항들은 본 발명의 철저한 이해를 제공하기 위해 다음의 설명에서 제시된다. 이들 세부사항들은 예의 목적으로 제공되며 본 발명은 이들 특정 세부사항들 중 일부 또는 모두 없이 청구항들에 따라 실시될 수 있다. 명료함의 목적을 위해, 본 발명에 관련된 기술 분야들에서 알려져 있는 기술적 자료는 본 발명이 불필요하게 모호해지지 않도록 상세하게 설명되지 않았다. The detailed description of one or more embodiments of the invention is provided below in conjunction with the accompanying drawings, which illustrate the principles of the invention. Although the present invention is described in connection with these embodiments, the present invention is not limited to any embodiments. The scope of the invention is limited only by the claims and the invention includes many alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for purposes of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material known in the technical fields related to the present invention has not been described in detail so as not to unnecessarily obscure the present invention.

1. 개요1. Overview

방화벽은 일반적으로 허가된 통신들이 방화벽을 통과하도록 허용하는 동안 허가되지 않은 액세스로부터 네트워크들을 보호한다. 방화벽은 통상적으로 네트워크 액세스를 위한 방화벽 기능을 제공하는 디바이스, 디바이스들의 세트, 또는 디바이스 상에서 실행된 소프트웨어이다. 예를 들어, 방화벽은 디바이스들(예컨대, 컴퓨터들, 스마트폰들, 또는 다른 유형들의 네트워크 통신 가능 디바이스들)의 운영 시스템들로 통합될 수 있다. 방화벽은 또한 컴퓨터 서버들, 게이트웨이들, 네트워크/라우팅 디바이스들(예컨대, 네트워크 라우터들), 및 데이터 기기들(예컨대, 보안 기기들 또는 다른 유형들의 특수 목적 디바이스들)과 같은, 다양한 유형들의 디바이스들 상에서 하나 이상의 소프트웨어 애플리케이션들로 통합되거나 또는 그것으로서 실행될 수 있으며, 다양한 구현들에서, 특정한 동작들이 ASIC 또는 FPGA와 같은, 특수 목적 하드웨어에 구현될 수 있다. Firewalls generally protect networks from unauthorized access while allowing authorized communications to pass through the firewall. A firewall is typically a device, set of devices, or software running on a device that provides firewall functionality for network access. For example, a firewall may be integrated into the operating systems of devices (eg, computers, smartphones, or other types of network communication capable devices). A firewall can also be configured with various types of devices, such as computer servers, gateways, network/routing devices (eg, network routers), and data appliances (eg, security appliances or other types of special purpose devices). It may be implemented as or integrated into one or more software applications on the platform, and in various implementations, certain operations may be implemented in special purpose hardware, such as an ASIC or FPGA.

방화벽들은 통상적으로 규칙들의 세트에 기초하여 네트워크 송신을 거부하거나 또는 허용한다. 이들 규칙들의 세트들은 종종 정책들(예컨대, 네트워크 정책들 또는 네트워크 보안 정책들)로서 불리운다. 예를 들어, 방화벽은 원치 않는 외부 트래픽이 보호된 디바이스들에 이르는 것을 방지하기 위해 규칙들 또는 정책들의 세트를 적용함으로써 인바운드 트래픽을 필터링할 수 있다. 방화벽은 또한 규칙들 또는 정책들의 세트를 적용함으로써 아웃바운드 트래픽을 필터링할 수 있다(예컨대, 허용, 차단, 모니터, 통지 또는 로그, 및/또는 다른 동작들은 방화벽 규칙들 또는 방화벽 정책들에서 특정될 수 있으며, 이것은 본 출원에서 설명된 바와 같이, 다양한 기준들에 기초하여 트리거될 수 있다). 방화벽은 또한 규칙들 또는 정책들의 세트를 유사하게 적용함으로써 로컬 네트워크(예컨대, 인트라넷) 트래픽을 필터링할 수 있다. Firewalls typically deny or allow network transmission based on a set of rules. These sets of rules are often referred to as policies (eg, network policies or network security policies). For example, a firewall may filter inbound traffic by applying a set of rules or policies to prevent unwanted external traffic from reaching protected devices. A firewall may also filter outbound traffic by applying a set of rules or policies (eg, allow, block, monitor, notify or log, and/or other actions may be specified in the firewall rules or firewall policies). , which may be triggered based on various criteria, as described herein). A firewall may also filter local network (eg, intranet) traffic by similarly applying a set of rules or policies.

보안 디바이스들(예컨대, 보안 기기들, 보안 게이트웨이들, 보안 서비스들, 및/또는 다른 보안 디바이스들)은 다양한 보안 기능들(예컨대, 방화벽, 멀웨어-금지, 침입 방지/검출, 데이터 손실 방지(DLP), 및/또는 다른 보안 기능들), 네트워킹 기능들(예컨대, 라우팅, 서비스 품질(QoS), 네트워크 관련 리소스들의 작업부하 균형화, 및/또는 다른 네트워킹 기능들), 및/또는 다른 기능들을 포함할 수 있다. 예를 들어, 라우팅 기능들은 소스 정보(예컨대, IP 어드레스 및 포트), 목적지 정보(예컨대, IP 어드레스 및 포트), 및 프로토콜 정보에 기초할 수 있다. Security devices (eg, security appliances, security gateways, security services, and/or other security devices) provide various security functions (eg, firewall, anti-malware, intrusion prevention/detection, data loss prevention (DLP) ), and/or other security functions), networking functions (eg, routing, quality of service (QoS), workload balancing of network related resources, and/or other networking functions), and/or other functions. can For example, routing functions may be based on source information (eg, IP address and port), destination information (eg, IP address and port), and protocol information.

기본 패킷 필터링 방화벽은 네트워크를 통해 송신된 개개의 패킷들을 검사함으로써 네트워크 통신 트래픽을 필터링한다(예컨대, 무상태형(stateless) 패킷 필터링 방화벽들인, 패킷 필터링 방화벽들 또는 1세대 방화벽들). 무상태형 패킷 필터링 방화벽들은 통상적으로 개개의 패킷들 자체를 검사하며 검사된 패킷들에 기초하여(예컨대, 패킷의 소스 및 목적지 어드레스 정보, 프로토콜 정보, 및 포트 번호의 조합을 사용하여) 규칙들을 적용한다.A basic packet filtering firewall filters network communication traffic by examining individual packets transmitted over the network (eg, packet filtering firewalls or first generation firewalls, which are stateless packet filtering firewalls). Stateless packet filtering firewalls typically inspect individual packets themselves and apply rules based on the inspected packets (eg, using a combination of the packet's source and destination address information, protocol information, and port number). .

애플리케이션 방화벽들은 또한 애플리케이션 계층 필터링(예컨대, TCP/IP 스택의 애플리케이션 레벨 상에서 작동하는, 애플리케이션 계층 필터링 방화벽들 또는 2세대 방화벽들)을 수행할 수 있다. 애플리케이션 계층 필터링 방화벽들 또는 애플리케이션 방화벽들은 일반적으로 특정한 애플리케이션들 및 프로토콜들(예컨대, 하이퍼텍스트 전송 프로토콜(HTTP), 도메인 이름 시스템(DNS) 요청, 파일 전송 프로토콜(FTP)을 사용한 파일 전송, 및 텔넷, DHCP, TCP, UDP, 및 TFTP(GSS)와 같은, 다양한 다른 유형들의 애플리케이션들 및 다른 프로토콜들을 사용한 웹 브라우징)을 식별할 수 있다. 예를 들어, 애플리케이션 방화벽들은 표준 포트를 통해 통신하려고 시도하는 허가되지 않은 프로토콜들(예컨대, 프로토콜이 일반적으로 애플리케이션 방화벽들을 사용하여 식별될 수 있는 비-표준 포트를 사용함으로써 몰래 하려고 시도하는 허가되지 않은/정책 외 프로토콜)을 차단할 수 있다. Application firewalls may also perform application layer filtering (eg, application layer filtering firewalls or second generation firewalls, operating on the application level of the TCP/IP stack). Application layer filtering firewalls or application firewalls are generally used for specific applications and protocols (eg, Hypertext Transfer Protocol (HTTP), Domain Name System (DNS) requests, file transfer using File Transfer Protocol (FTP), and Telnet; web browsing using different protocols and various other types of applications, such as DHCP, TCP, UDP, and TFTP (GSS). For example, application firewalls may attempt to stealth unauthorized protocols that attempt to communicate over a standard port (eg, an unauthorized protocol that attempts to steal by using a non-standard port that can typically be identified using application firewalls). /out-of-policy protocols) can be blocked.

상태형(stateful) 방화벽들은 또한 각각의 패킷이 패킷들의 상기 네트워크 송신의 흐름과 연관된 일련의 패킷들의 콘텍스트 내에서 검사되는 상태-기반 패킷 검사를 수행할 수 있다. 이러한 방화벽 기술은 일반적으로 그것이 방화벽을 통과하는 모든 연결들의 레코드들을 유지하고 패킷이 새로운 연결의 시작, 기존의 연결의 부분인지, 또는 유효하지 않은 패킷인지를 결정할 수 있으므로 상태형 패킷으로서 불리운다. 예를 들어, 연결의 상태 자체는 정책 내에서 규칙을 트리거하는 기준들 중 하나일 수 있다.Stateful firewalls may also perform state-based packet inspection, where each packet is inspected within the context of a series of packets associated with the flow of said network transmission of packets. This firewall technology is generally referred to as a stateful packet because it maintains a record of all connections that pass through the firewall and can determine whether a packet is the start of a new connection, part of an existing connection, or an invalid packet. For example, the state of the connection itself may be one of the criteria that triggers a rule within a policy.

개선된 또는 차세대 방화벽들은 상기 논의된 바와 같이 무상태형 및 상태형 패킷 필터링 및 애플리케이션 계층 필터링을 수행할 수 있다. 차세대 방화벽들은 또한 부가적인 방화벽 기술들을 수행할 수 있다. 예를 들어, 때때로 개선된 또는 차세대 방화벽들로 불리우는 특정한 더 새로운 방화벽들이 또한 사용자들 및 콘텐트(예컨대, 차세대 방화벽들)를 식별할 수 있다. 특히, 특정한 차세대 방화벽들은 이들 방화벽들이 자동으로 식별할 수 있는 애플리케이션들의 리스트를 수천 개의 애플리케이션들로 확대한다. 이러한 차세대 방화벽들의 예들은 Palo Alto Networks, Inc.로부터 상업적으로 이용 가능하다(예컨대, Palo Alto Networks의 PA 시리즈 방화벽들). 예를 들어, Palo Alto Network의 차세대 방화벽들은 기업들이 다음: 정확한 애플리케이션 식별을 위한 APP-ID, 사용자 식별을 위한 사용자-ID(예컨대, 사용자 또는 사용자 그룹에 의한), 및 실시간 콘텐트 스캐닝을 위한 콘텐트-ID(예컨대, 웹 서핑을 제어하고 데이터 및 파일 전송들을 제한하는)와 같은, 다양한 식별 기술들을 사용하여 애플리케이션들, 사용자들, 및 콘텐트 - 포트들, IP 어드레스들, 및 패킷들이 아닌 - 를 식별하고 제어할 수 있게 한다. 이들 식별 기술들은 기업들이, 종래의 포트-차단 방화벽들에 의해 제공된 종래의 접근법을 따르는 대신에, 비즈니스-관련 개념들을 사용하여 애플리케이션 사용을 안전하게 가능하게 하도록 허용한다. 또한, 차세대 방화벽들을 위한 특수 목적 하드웨어(예컨대, 전용 기기들로서 구현된)는 일반적으로 범용 하드웨어 상에서 실행된 소프트웨어보다 애플리케이션 검사를 위한 더 높은 성능 레벨들을 제공한다(예컨대, 대기시간을 최소화하면서 네트워크 스루풋을 최대화하기 위해 단일-패스 소프트웨어 엔진과 빽빽하게 통합되는 전용, 기능 특정 프로세싱을 사용하는, Palo Alto Networks, Inc.에 의해 제공된 보안 기기들과 같은). Advanced or next-generation firewalls may perform stateless and stateful packet filtering and application layer filtering as discussed above. Next-generation firewalls may also implement additional firewall technologies. For example, certain newer firewalls, sometimes called advanced or next-generation firewalls, may also identify users and content (eg, next-generation firewalls). In particular, certain next-generation firewalls expand the list of applications that these firewalls can automatically identify to thousands of applications. Examples of such next-generation firewalls are commercially available from Palo Alto Networks, Inc. (eg, PA series firewalls from Palo Alto Networks). For example, Palo Alto Network's next-generation firewalls enable enterprises to: APP-ID for accurate application identification, user-ID for user identification (eg, by user or group of users), and content- for real-time content scanning identify applications, users, and content - not ports, IP addresses, and packets - using a variety of identification techniques, such as ID (eg, to control web surfing and limit data and file transfers); allow you to control These identification techniques allow enterprises to securely enable application usage using business-related concepts, instead of following the conventional approach provided by conventional port-blocking firewalls. Additionally, special-purpose hardware (eg, implemented as dedicated devices) for next-generation firewalls typically provides higher performance levels for application inspection (eg, reducing network throughput while minimizing latency) than software running on general-purpose hardware. (such as security appliances provided by Palo Alto Networks, Inc.) using dedicated, function-specific processing tightly integrated with a single-pass software engine to maximize.

개선된 또는 차세대 방화벽들은 또한 가상화된 방화벽들을 사용하여 구현될 수 있다. 이러한 차세대 방화벽들의 예들은 Palo Alto Networks, Inc.로부터 상업적으로 이용 가능하다(예컨대, 예를 들어, VMware® ESXi™ 및 NSX™, Citrix® Netscaler SDX™, KVM/OpenStack(Centos/RHEL, Ubuntu®을 포함한, 다양한 상업적인 가상화된 환경들을 지원하는, Palo Alto Networks의 VM 시리즈 방화벽들, 및 Amazon Web Services(AWS)). 예를 들어, 가상화된 방화벽들은 물리적 형태 인자 기기들에서 이용 가능한 유사한 또는 정확하게 동일한 차세대 방화벽 및 개선된 위협 방지 특징들을 지원할 수 있어서, 기업들이 그것들의 사설, 공개, 및 하이브리드 클라우드 컴퓨팅 환경들로, 및 그것에 걸쳐 흐르는 애플리케이션들을 안전하게 가능화하도록 허용한다. VM 모니터링, 동적 어드레스 그룹들, 및 REST-기반 API와 같은 자동화 특징들은 기업들이 상기 콘텍스트를 보안 정책들로 동적으로 공급하는 VM 변화들을 능동적으로 모니터링하도록 허용하며, 그에 의해 VM들이 변할 때 발생할 수 있는 정책 래그를 제거한다. Advanced or next-generation firewalls may also be implemented using virtualized firewalls. Examples of such next-generation firewalls are commercially available from Palo Alto Networks, Inc. (eg, VMware® ESXi™ and NSX™, Citrix® Netscaler SDX™, KVM/OpenStack (Centos/RHEL, Ubuntu®) VM series firewalls from Palo Alto Networks, and Amazon Web Services (AWS)) supporting a variety of commercial virtualized environments, including It can support firewall and advanced threat prevention features, allowing enterprises to securely enable applications flowing into and across their private, public, and hybrid cloud computing environments.VM monitoring, dynamic address groups, and Automation features such as REST-based APIs allow enterprises to actively monitor VM changes that dynamically feed the context into security policies, thereby eliminating policy lag that can occur when VMs change.

II. 예시적인 환경II. Exemplary Environment

도 1은 악성 애플리케이션들("멀웨어")이 검출되고 피해를 야기하는 것을 방지하는 환경의 예를 예시한다. 이하에서 더 상세하게 설명될 바와 같이, 멀웨어 분류들(예컨대, 보안 플랫폼(122)에 의해 이루어지는 바와 같이)은 도 1에 도시된 환경에 포함된 다양한 엔티티들 중에서 다양하게 공유되고 및/또는 개선될 수 있다. 본 출원에서 설명된 기술들을 사용하여, 엔드포인트 클라이언트 디바이스들(104 내지 110)과 같은, 디바이스들이 이러한 멀웨어로부터 보호될 수 있다. 1 illustrates an example of an environment in which malicious applications (“malware”) are detected and prevented from causing harm. As will be described in greater detail below, malware classifications (eg, as made by security platform 122 ) may be variously shared and/or improved among various entities included in the environment illustrated in FIG. 1 . can Using the techniques described herein, devices, such as endpoint client devices 104-110, can be protected from such malware.

용어 "애플리케이션"은 형태/플랫폼에 관계없이, 프로그램들, 프로그램들의 묶음들, 시현들, 패킷들 등을 총괄하여 나타내기 위해 명세서 전체에 걸쳐 사용된다. "애플리케이션"(또한, 본 출원에서 "샘플"로서 불리운다)은 독립형 파일(예컨대, 파일명 "calculator.apk" 또는 "calculator.exe"를 가진 계산기 애플리케이션)일 수 있으며 또한 또 다른 애플리케이션의 독립적인 구성요소(예컨대, 계산기 앱 내에 내장된 모바일 광고 SDK 또는 라이브러리)일 수 있다.The term “application” is used throughout the specification to collectively refer to programs, bundles of programs, manifestations, packets, etc., regardless of form/platform. An "application" (also referred to herein as a "sample") may be a standalone file (eg, a calculator application with the file name "calculator.apk" or "calculator.exe") and is also an independent component of another application. (eg, a mobile advertising SDK or library embedded within the calculator app).

본 출원에서 사용된 바와 같이 "멀웨어"는 비밀인지 여부(및 불법인지 여부)에 관계없이, 거동들에 참여하는 애플리케이션을 나타내며, 그 사용자는 승인되지 않으며/완전히 알려진 경우 승인되지 않을 것이다. 멀웨어의 예들은 트로이들, 바이러스들, 루트킷들, 스파이웨어, 해킹 툴들, 키로거들 등을 포함한다. 멀웨어의 일 예는 최종 사용자의 위치를 수집하며 원격 서버로 보고하는(매핑 서비스와 같은, 위치-기반 서비스들을 사용자에게 제공하지 않는) 데스크탑 애플리케이션이다. 멀웨어의 또 다른 예는 최종 사용자에게 무료 게임인 것처럼 보이지만, SMS 프리미엄 메시지(예컨대, 각각 $10의 비용이 드는)를 몰래 전송하여, 최종 사용자의 전화 요금을 늘리는 악성 Android Application Package.apk(APK) 파일이다. 멀웨어의 또 다른 예는 사용자의 연락처들을 몰래 수집하고 이들 연락처들을 스패머에게 전송하는 Apple iOS 플래시라이트 애플리케이션이다. 다른 형태들의 멀웨어가 또한 본 출원에서 설명된 기술들을 사용하여 검출되고/좌절당할 수 있다(예컨대, 랜섬웨어). 뿐만 아니라, n-그램들/특징 벡터들/출력 누적 변수들은 본 출원에서 악성 애플리케이션들을 위해 생성되는 것으로 설명되지만, 본 출원에서 설명된 기술들은 또한 다른 종류들의 애플리케이션들을 위한 프로필들(예컨대, 애드웨어 프로필들, 굿웨어 프로필들 등)을 생성하기 위해 다양한 실시예들에서 사용될 수 있다. "Malware" as used in this application refers to an application that participates in behaviors, whether secret (and illegal), whether or not its user is disapproved/will not be disapproved if fully known. Examples of malware include Trojans, viruses, rootkits, spyware, hacking tools, keyloggers, and the like. One example of malware is a desktop application that collects the location of an end user and reports it to a remote server (which does not provide location-based services, such as mapping services, to the user). Another example of malware is a malicious Android Application Package.apk (APK) file that appears to be a free game to the end user, but secretly sends SMS premium messages (eg costing $10 each), increasing the end user's phone bill. am. Another example of malware is the Apple iOS Flashlight application, which secretly collects users' contacts and sends these contacts to spammers. Other forms of malware may also be detected/frustrated using the techniques described herein (eg, ransomware). Not only that, while n-grams/feature vectors/output accumulation variables are described herein as being generated for malicious applications, the techniques described herein also provide profiles for other kinds of applications (eg, adware profiles, goodware profiles, etc.) may be used in various embodiments.

본 출원에서 설명된 기술들은 다양한 플랫폼들(예컨대, 데스크탑들, 이동 디바이스들, 게이밍 플랫폼들, 내장형 시스템들 등) 및/또는 다양한 유형들의 애플리케이션들(예컨대, Android .apk 파일들, iOS 애플리케이션들, Windows PE 파일들, Adobe Acrobat PDF 파일들 등)과 함께 사용될 수 있다. 도 1에 도시된 예시적인 환경에서, 클라이언트 디바이스들(104 내지 108)은 기업 네트워크(140)에 존재하는 (각각) 랩탑 컴퓨터, 데스크탑 컴퓨터, 및 태블릿이다. 클라이언트 디바이스(110)는 기업 네트워크(140)의 밖에 존재하는 랩탑 컴퓨터이다. The techniques described in this application can be applied to various platforms (eg, desktops, mobile devices, gaming platforms, embedded systems, etc.) and/or various types of applications (eg, Android .apk files, iOS applications, Windows PE files, Adobe Acrobat PDF files, etc.). In the example environment shown in FIG. 1 , client devices 104 - 108 are (respectively) a laptop computer, a desktop computer, and a tablet residing in the enterprise network 140 . The client device 110 is a laptop computer that resides outside the corporate network 140 .

데이터 기기(102)는 클라이언트 디바이스들(104 및 106)과 같은 클라이언트 디바이스들, 및 기업 네트워크(140)의 밖에 있는(예컨대, 외부 네트워크(118)를 통해 도달 가능한) 노드들 간의 통신들에 관한 정책들을 시행하도록 구성된다. 이러한 정책들의 예들은 트래픽 쉐이핑, 서비스 품질, 및 트래픽의 라우팅을 통제하는 것들이다. 정책들의 다른 예들은 인입하는(및/또는 송출하는) 이메일 접속들, 웹사이트 접촉, 인스턴트 메시징 프로그램들을 통해 교환된 파일들, 및/또는 다른 파일 전달들에서의 위협들을 스캐닝하는 것을 요구하는 것들과 같은 보안 정책들을 포함한다. 몇몇 실시예들에서, 데이터 기기(102)는 또한 기업 네트워크(140)에서 벗어나지 않는 트래픽에 대하여 정책들을 시행하도록 구성된다. Data appliance 102 may have policies regarding communications between client devices, such as client devices 104 and 106 , and nodes that are external to enterprise network 140 (eg, reachable via external network 118 ). are designed to implement Examples of such policies are those governing traffic shaping, quality of service, and routing of traffic. Other examples of policies include those that require scanning for threats in incoming (and/or outgoing) email connections, website contacts, files exchanged via instant messaging programs, and/or other file transfers; It contains the same security policies. In some embodiments, data device 102 is also configured to enforce policies for traffic that does not leave corporate network 140 .

데이터 기기의 실시예가 도 2a에서 도시된다. 도시된 예는 다양한 실시예들에서, 데이터 기기(102)에 포함되는 물리 구성요소들의 표현이다. 구체적으로, 데이터 기기(102)는 고성능 다중-코어 중앙 프로세싱 유닛(CPU)(202) 및 랜덤 액세스 메모리(RAM)(204)를 포함한다. 데이터 기기(102)는 또한 저장장치(210)(하나 이상의 하드 디스크들 또는 고체 상태 저장 유닛들과 같은)를 포함한다. 다양한 실시예들에서, 데이터 기기(102)는 기업 네트워크(140)를 모니터링하고 개시된 기술들을 구현하는데 사용된 정보를 저장한다(RAM(204), 저장장치(210), 및/또는 다른 적절한 위치들에 관계없이). 이러한 정보의 예들은 애플리케이션 식별자들, 콘텐트 식별자들, 사용자 식별자들, 요청된 URL들, IP 어드레스 매핑들, 정책 및 다른 구성 정보, 서명들, 호스트명/URL 범주화 정보, 멀웨어 프로필들, 및 기계 학습 모델들을 포함한다. 데이터 기기(102)는 또한 하나 이상의 선택적 하드웨어 가속화기들을 포함할 수 있다. 예를 들어, 데이터 기기(102)는 암호화 및 복호화 동작들을 수행하도록 구성된 암호 엔진(206), 및 매칭을 수행하고, 네트워크 프로세서들로서 동작하며, 및/또는 다른 태스크들을 수행하도록 구성된 하나 이상의 필드 프로그램 가능한 게이트 어레이들(FPGA들)(208)을 포함할 수 있다. An embodiment of a data appliance is shown in Fig. 2a. The example shown is a representation of the physical components included in data device 102 in various embodiments. Specifically, the data device 102 includes a high performance multi-core central processing unit (CPU) 202 and a random access memory (RAM) 204 . Data device 102 also includes storage 210 (such as one or more hard disks or solid state storage units). In various embodiments, data device 102 monitors enterprise network 140 and stores information used to implement the disclosed techniques (RAM 204, storage 210, and/or other suitable locations). regardless). Examples of such information include application identifiers, content identifiers, user identifiers, requested URLs, IP address mappings, policy and other configuration information, signatures, hostname/URL categorization information, malware profiles, and machine learning. include models. Data device 102 may also include one or more optional hardware accelerators. For example, the data device 102 may include a cryptographic engine 206 configured to perform encryption and decryption operations, and one or more field programmable programs configured to perform matching, operate as network processors, and/or perform other tasks. gate arrays (FPGAs) 208 .

본 출원에서 데이터 기기(102)에 의해 수행되는 것으로서 설명된 기능은 다양한 방식들로 제공되고/구현될 수 있다. 예를 들어, 데이터 기기(102)는 전용 디바이스 또는 디바이스들의 세트일 수 있다. 데이터 기기(102)에 의해 제공된 기능은 또한 범용 컴퓨터, 컴퓨터 서버, 게이트웨이, 및/또는 네트워크/라우팅 디바이스 상에 통합되거나 또는 그것 상에서 소프트웨어로서 실행될 수 있다. 몇몇 실시예들에서, 데이터 기기(102)에 의해 제공되는 것으로서 설명된 적어도 몇몇 서비스들은 대신에(또는 그 외에) 클라이언트 디바이스 상에서 실행한 소프트웨어에 의해 클라이언트 디바이스(예컨대, 클라이언트 디바이스(104) 또는 클라이언트 디바이스(110))로 제공된다. Functionality described herein as being performed by data device 102 may be provided/implemented in various ways. For example, data appliance 102 may be a dedicated device or set of devices. The functionality provided by the data appliance 102 may also be integrated on or executed as software on a general purpose computer, computer server, gateway, and/or network/routing device. In some embodiments, at least some services described as being provided by data appliance 102 are instead (or otherwise) provided by software executing on the client device (eg, client device 104 or client device). (110)).

데이터 기기(102)가 태스크를 수행하는 것으로 설명될 때마다, 데이터 기기(102)의 단일 구성요소, 구성요소들의 서브세트, 또는 모든 구성요소들은 태스크를 수행하도록 협력할 수 있다. 유사하게, 데이터 기기(102)의 구성요소가 태스크를 수행하는 것으로 설명될 때마다, 서브구성요소는 태스크를 수행할 수 있으며 및/또는 구성요소는 다른 구성요소들과 함께 태스크를 수행할 수 있다. 다양한 실시예들에서, 데이터 기기(102)의 부분들은 하나 이상의 제3 자들에 의해 제공된다. 데이터 기기(102)에 이용 가능한 컴퓨팅 리소스들의 양과 같은 인자들에 의존하여, 데이터 기기(102)의 다양한 논리 구성요소들 및/또는 특징들이 생략될 수 있으며 본 출원에서 설명된 기술들은 그에 따라 적응된다. 유사하게, 부가적인 논리 구성요소들/특징들은 적용 가능한 경우 데이터 기기(102)의 실시예들에 포함될 수 있다. 다양한 실시예들에서 데이터 기기(102)에 포함된 구성요소의 일 예는 애플리케이션을 식별하도록 구성되는 애플리케이션 식별 엔진이다(예컨대, 패킷 흐름 분석에 기초하여 애플리케이션들을 식별하기 위한 다양한 애플리케이션 서명들을 사용하는). 예를 들어, 애플리케이션 식별 에진은 웹 브라우징 - 소셜 네트워킹; 웹 브라우징 - 뉴스; SSH 등과 같은, 세션이 어떤 유형의 트래픽을 수반하는지를 결정할 수 있다. Whenever data device 102 is described as performing a task, a single component, a subset of components, or all components of data device 102 may cooperate to perform the task. Similarly, whenever a component of data device 102 is described as performing a task, the subcomponent may perform the task and/or the component may perform the task in conjunction with other components. . In various embodiments, portions of data device 102 are provided by one or more third parties. Depending on factors such as the amount of computing resources available to the data device 102 , various logical components and/or features of the data device 102 may be omitted and the techniques described herein are adapted accordingly. . Similarly, additional logical components/features may be included in embodiments of data device 102 where applicable. One example of a component included in data device 102 in various embodiments is an application identification engine configured to identify an application (eg, using various application signatures to identify applications based on packet flow analysis). . For example, application identification engines include web browsing - social networking; web browsing - news; You can determine what type of traffic the session carries, such as SSH, etc.

도 2b는 데이터 기기의 실시예의 논리 구성요소들의 기능 다이어그램이다. 도시된 예는 다양한 실시예들에서 데이터 기기(102)에 포함될 수 있는 논리 구성요소들의 표현이다. 달리 특정되지 않는다면, 데이터 기기(102)의 다양한 논리 구성요소들은 일반적으로, 하나 이상의 스크립트들의 세트(예컨대, 적용 가능한 경우, 자바, 파이썬 등으로 기록된)로서 포함한, 다양한 방식들로 구현 가능하다. 2B is a functional diagram of the logical components of an embodiment of a data device; The illustrated example is a representation of logical components that may be included in data device 102 in various embodiments. Unless otherwise specified, the various logical components of data device 102 are generally implementable in a variety of ways, including as a set of one or more scripts (eg, written in Java, Python, etc., where applicable).

도시된 바와 같이, 데이터 기기(102)는 방화벽을 포함하며, 관리 평면(232) 및 데이터 평면(234)을 포함한다. 관리 평면은 정책들을 구성하고 로그 데이터를 보기 위한 사용자 인터페이스를 제공하는 것에 의해서와 같은, 사용자 상호작용들을 관리할 책임이 있다. 데이터 평면은 패킷 프로세싱 및 세션 핸들링을 수행하는 것에 의해서와 같은, 데이터를 관리할 책임이 있다. As shown, data device 102 includes a firewall, and includes a management plane 232 and a data plane 234 . The management plane is responsible for managing user interactions, such as by providing a user interface for configuring policies and viewing log data. The data plane is responsible for managing data, such as by performing packet processing and session handling.

네트워크 프로세서(236)는 클라이언트 디바이스(108)와 같은, 클라이언트 디바이스들로부터 패킷들을 수신하며, 프로세싱을 위해 그것들을 데이터 평면(234)으로 제공하도록 구성된다. 흐름 모듈(238)이 새로운 세션의 부분인 것으로 패킷들을 식별할 때마다, 그것은 새로운 세션 흐름을 생성한다. 뒤이은 패킷들은 흐름 검색에 기초하여 세션에 속하는 것으로 식별될 것이다. 적용 가능하다면, SSL 복호화는 SSL 복호화 엔진(240)에 의해 이용된다. 그렇지 않다면, SSL 복호화 엔진(240)에 의한 프로세싱은 생략된다. 복호화 엔진(240)은 데이터 기기(102)가 SSL/TLS 및 SSH 암호화된 트래픽을 검사하고 제어하도록 도우며, 따라서 그렇지 않다면 암호화된 트래픽에 은닉된 채로 있을 수 있는 위협들을 멈추도록 돕는다. 복호화 엔진(240)은 또한 민감형 콘텐트가 기업 네트워크(140)를 떠나는 것을 방지하도록 도울 수 있다. 복호화는 URL 카테고리, 트래픽 소스, 트래픽 목적지, 사용자, 사용자 그룹, 및 포트와 같은 파라미터들에 기초하여 선택적으로 제어(예컨대, 가능화 또는 불능화)될 수 있다. 복호화 정책들(예컨대, 어떤 세션들을 복호화할지를 특정하는) 외에, 복호화 프로필들이 정책에 의해 제어된 세션들에 대한 다양한 옵션들을 제어하기 위해 할당될 수 있다. 예를 들어, 특정 암호 묶음들 및 암호화 프로토콜 버전들의 사용이 요구될 수 있다. Network processor 236 is configured to receive packets from client devices, such as client device 108 , and provide them to data plane 234 for processing. Whenever flow module 238 identifies packets as being part of a new session, it creates a new session flow. Subsequent packets will be identified as belonging to the session based on the flow search. If applicable, SSL decryption is used by SSL decryption engine 240 . Otherwise, processing by the SSL decryption engine 240 is omitted. Decryption engine 240 helps data device 102 inspect and control SSL/TLS and SSH encrypted traffic, thus stopping threats that might otherwise remain cloaked in encrypted traffic. Decryption engine 240 may also help prevent sensitive content from leaving enterprise network 140 . Decryption may be selectively controlled (eg, enabled or disabled) based on parameters such as URL category, traffic source, traffic destination, user, user group, and port. In addition to decryption policies (eg, specifying which sessions to decrypt), decryption profiles may be assigned to control various options for sessions controlled by the policy. For example, the use of certain cryptographic suites and cryptographic protocol versions may be required.

애플리케이션 식별(APP-ID) 엔진(242)은 세션이 어떤 유형의 트래픽을 수반하는지를 결정하도록 구성된다. 일 예로서, 애플리케이션 식별 엔진(242)은 수신된 데이터에서 GET 요청을 인식하고 세션이 HTTP 디코더를 요구한다는 결론을 내릴 수 있다. 몇몇 경우들, 예컨대 웹 브라우징 세션에서, 식별된 애플리케이션은 변할 수 있으며, 이러한 변화들은 데이터 기기(102)에 의해 주지될 것이다. 예를 들어, 사용자는 처음에 기업 Wiki("웹 브라우징 - 생산성"으로서 방문된 URL에 기초하여 분류된)로 브라우징하며 그 후 다음에 소셜 네트워킹 사이트("웹 브라우징 - 소셜 네트워킹"으로 방문된 URL에 기초하여 분류된)로 브라우징할 수 있다. 상이한 유형들의 프로토콜들이 대응하는 디코더들을 갖는다.The application identification (APP-ID) engine 242 is configured to determine what type of traffic the session carries. As an example, application identification engine 242 may recognize a GET request in the received data and conclude that the session requires an HTTP decoder. In some cases, such as in a web browsing session, the identified application may change, and such changes will be noted by the data device 102 . For example, a user may initially browse to a corporate wiki (categorized based on URLs visited as "Web Browsing - Productivity") and then to a social networking site (to URLs visited as "Web Browsing - Social Networking"). classified on the basis of). Different types of protocols have corresponding decoders.

애플리케이션 식별 엔진(242)에 의해 이루어진 결정에 기초하여, 패킷들은, 위협 엔진(244)에 의해, 패킷들(순서 외로 수신될 수 있는)을 정확한 순서로 모으고, 토큰화를 수행하며, 정보를 추출하도록 구성된 적절한 디코더로 전송된다. 위협 엔진(244)은 또한 무엇이 패킷에 일어나야 하는지를 결정하기 위해 서명 매칭을 수행한다. 요구된 대로, SSL 암호화 엔진(246)은 복호화된 데이터를 재-암호화할 수 있다. 패킷들은 (예컨대, 목적지로의) 송신을 위해 포워드 모듈(248)을 사용하여 포워딩된다. Based on the determination made by the application identification engine 242 , the packets are, by the threat engine 244 , put the packets (which may be received out of order) into the correct order, perform tokenization, and extract information transmitted to an appropriate decoder configured to The threat engine 244 also performs signature matching to determine what should happen to the packet. As required, the SSL encryption engine 246 may re-encrypt the decrypted data. Packets are forwarded using forward module 248 for transmission (eg, to a destination).

도 2b에 또한 도시된 바와 같이, 정책들(252)이 관리 평면(232)에서 수신되고 저장된다. 정책들은 도메인 및/또는 호스트/서버 이름들을 사용하여 특정될 수 있는, 하나 이상의 규칙들을 포함할 수 있으며, 규칙들은 모니터링된 세션 트래픽 흐름들로부터의 다양한 추출된 파라미터들/정보에 기초하여 가입자/IP 흐름들에 대한 보안 정책 시행을 위해서와 같은, 하나 이상의 서명들 또는 다름 매칭 기준들 또는 휴리스틱스를 이용할 수 있다. 인터페이스(I/F) 전달기(250)는 관리 통신들을 위해 제공된다(예컨대, (REST) API들, 메시지들, 또는 네트워크 프로토콜 통신들 또는 다른 통신 메커니즘들을 통해). As also shown in FIG. 2B , policies 252 are received and stored in management plane 232 . Policies may include one or more rules, which may be specified using domain and/or host/server names, where the rules are based on the subscriber/IP based on various extracted parameters/information from monitored session traffic flows. One or more signatures or other matching criteria or heuristics may be used, such as for security policy enforcement on flows. Interface (I/F) forwarder 250 is provided for management communications (eg, via (REST) APIs, messages, or network protocol communications or other communications mechanisms).

III. 보안 플랫폼III. security platform

도 1로 돌아가면, 악의적인 개인(시스템(120)을 사용하는)이 멀웨어(130)을 생성하였다고 가정하자. 악의적인 개인은 클라이언트 디바이스(140)와 같은, 클라이언트 디바이스가 멀웨어(130)의 복사를 실행하여, 클라이언트 디바이스를 손상시키며, 예컨대 클라이언트 디바이스가 봇넷(botnet)에서의 보트가 되게 하는 것을 희망한다. 손상된 클라이언트 디바이스는 그 후 적용 가능한 경우, 태스크를 수행하며(예컨대, 암호화폐 채굴, 또는 서비스 공격들의 거부에 참여하는 것), 명령 및 제어(C&C) 서버(150)와 같은 외부 엔티티로 정보를 보고할 뿐만 아니라, C&C 서버(150)로부터 지시들을 수신하도록 지시받을 수 있다. Returning to FIG. 1 , assume that a malicious individual (using system 120 ) created malware 130 . A malicious individual hopes that a client device, such as client device 140 , executes a copy of malware 130 , compromising the client device, such as making the client device a bot in a botnet. The compromised client device then performs a task (eg, participating in cryptocurrency mining, or denial of service attacks), if applicable, and reports the information to an external entity, such as a command and control (C&C) server 150 . In addition, it may be instructed to receive instructions from the C&C server 150 .

데이터 기기(102)가 클라이언트 디바이스(140)를 동작시키는 사용자 "Alice"로(예컨대, 시스템(120)에 의해) 전송된 이메일을 가로챘다고 가정하자. 멀웨어(130)의 사본이 시스템(120)에 의해 메시지에 첨부되었다. 유사한 시나리오가 아닌, 대안으로서, 데이터 기기(102)는 (예컨대, 웹사이트로부터) 멀웨어(130)의 클라이언트 디바이스(140)에 의해 시도된 다운로드를 가로챌 수 있다. 어느 하나의 시나리오에서, 데이터 기기(102)는 파일에 대한 서명(예컨대, 멀웨어(130)의 이메일 접속 또는 웹사이트 다운로드)이 데이터 기기(102) 상에 존재하는지를 결정한다. 서명은, 존재한다면, 파일이 안전하다고(예컨대, 화이트리스트된) 알려진 것임을 나타낼 수 있으며, 또한 파일이 악성이라고(예컨대, 블랙리스트된) 알려진 것을 나타낼 수 있다.Assume that data device 102 intercepts an email sent (eg, by system 120 ) to user “Alice” operating client device 140 . A copy of malware 130 was attached to the message by system 120 . As an alternative, but not a similar scenario, data device 102 may intercept the attempted download by client device 140 of malware 130 (eg, from a website). In either scenario, data device 102 determines whether a signature for the file (eg, email access or website download of malware 130 ) is present on data device 102 . The signature, if present, may indicate that the file is known to be safe (eg, whitelisted), and may also indicate that the file is known to be malicious (eg, blacklisted).

다양한 실시예들에서, 데이터 기기(102)는 보안 플랫폼(122)과 협력하여 동작하도록 구성된다. 일 예로서, 보안 플랫폼(122)은 알려진-악성 파일들의 서명들의 세트(예컨대, 가입의 부분으로서)를 데이터 기기(102)로 제공할 수 있다. 멀웨어(130)에 대한 서명이 세트에 포함된다면(예컨대, 멀웨어(130)의 MD5 해시), 데이터 기기(102)는 그에 따라(예컨대, 클라이언트 디바이스(140)로 전송된 이메일 접속의 MD5 해시가 멀웨어(130)의 MD5 해시와 일치함을 검출함으로써) 클라이언트 디바이스(104)로의 멀웨어(130)의 송신을 방지할 수 있다. 보안 플랫폼(122)은 또한 알려진 악성 도메인들 및/또는 IP 어드레스들의 리스트를 데이터 기기(102)로 제공하여, 데이터 기기(102)가 기업 네트워크(140)와 C&C 서버(150)(예컨대, C&C 서버(150)가 악성인 것으로 알려진 경우) 간의 트래픽을 차단하도록 허용할 수 있다. 악성 도메인들(및/또는 IP 어드레스들)의 리스트는 또한 데이터 기기(102)가 그것의 노드들 중 하나가 손상되었을 때를 결정하도록 도울 수 있다. 예를 들어, 클라이언트 디바이스(140)가 C&C 서버(150)를 접촉하려고 시도한다면, 이러한 시도는 클라이언트(104)가 멀웨어에 의해 손상되었다는 강력한 표시자이다(및 클라이언트 디바이스(104)를 기업 네트워크(140) 내에서의 다른 노드들과 통신하는 것으로부터 격리하는 것과 같은, 시정 조치들이 그에 따라 취해져야 한다). 이하에서 더 상세하게 설명될 바와 같이, 보안 플랫폼(122)은 또한 파일들의 인라인 분석을 수행하기 위해 데이터 기기(102)에 의해 사용 가능한 기계 학습 모델들의 세트와 같은 다른 유형들의 정보를 데이터 기기(102)로 제공할 수 있다(예컨대, 가입의 부분으로서). In various embodiments, data device 102 is configured to operate in concert with secure platform 122 . As an example, secure platform 122 may provide a set of signatures of known-malicious files (eg, as part of a subscription) to data device 102 . If the signature for malware 130 is included in the set (eg, the MD5 hash of the malware 130 ), the data device 102 can respond accordingly (eg, the MD5 hash of the email connection sent to the client device 140 is the malware). By detecting a match with the MD5 hash of 130 ), transmission of the malware 130 to the client device 104 may be prevented. The security platform 122 also provides a list of known malicious domains and/or IP addresses to the data appliance 102 so that the data appliance 102 can communicate with the enterprise network 140 and the C&C server 150 (eg, a C&C server). (if 150 is known to be malicious) can be allowed to block traffic between them. The list of malicious domains (and/or IP addresses) can also help data device 102 determine when one of its nodes has been compromised. For example, if a client device 140 attempts to contact the C&C server 150, such an attempt is a strong indicator that the client 104 has been compromised by malware (and may cause the client device 104 to contact the enterprise network 140 ). ), corrective actions should be taken accordingly), such as isolating from communicating with other nodes in As will be described in more detail below, the security platform 122 also transmits other types of information to the data device 102 , such as a set of machine learning models that can be used by the data device 102 to perform inline analysis of the files. ) (eg, as part of a subscription).

다양한 실시예들에서, 접속에 대한 어떤 서명도 발견되지 않았다면 다양한 동작들이 데이터 기기(102)에 의해 취해질 수 있다. 제1 예로서, 데이터 기기(102)는 양성(benign)으로서 화이트리스트되지 않은(예컨대, 알려진 양호한 파일들의 서명들에 일치하지 않는) 임의의 접속들의 송신을 차단함으로써, 안전-보장될 수 있다. 이러한 접근법의 단점은 그것들이 사실상 양성일 때 잠재적인 멀웨어로서 불필요하게 차단된 많은 합법적인 접속들이 있을 수 있다는 것이다. 제2 예로서, 데이터 기기(102)는 악성인 것으로서 블랙리스트되지 않은(예컨대, 알려진 불량 파일들의 서명들에 일치하지 않는) 임의의 접속들의 송신을 허용함으로써, 안전 위협적일 수 있다. 이러한 접근법의 단점은 새롭게 생성된 멀웨어(플랫폼(122)에 의해 이전에 보여지지 않은)가 피해를 야기하는 것이 방지되지 않을 것이라는 것이다. In various embodiments, various actions may be taken by the data device 102 if no signature for the connection has been found. As a first example, data device 102 may be safety-guaranteed by blocking the transmission of any connections that are not whitelisted as benign (eg, do not match signatures of known good files). The downside of this approach is that when they are in fact benign there can be many legitimate connections that are unnecessarily blocked as potential malware. As a second example, the data device 102 can be a security threat by allowing the transmission of any connections that are not blacklisted as malicious (eg, do not match the signatures of known bad files). A disadvantage of this approach is that newly created malware (not previously seen by platform 122 ) will not be prevented from causing harm.

제3 예로서, 데이터 기기(102)는 정적/동적 분석을 위해 파일(예컨대, 멀웨어(130))을 보안 플랫폼(122)으로 제공하고, 그것이 악성인지를 결정하며 및/또는 그 외 그것을 분류하도록 구성될 수 있다. 접속(서명이 이미 존재하지 않는)의 보안 플랫폼(122)에 의한 분석이 수행되는 동안 다양한 동작들이 데이터 기기(102)에 의해 취해질 수 있다. 제1 예로서, 데이터 기기(102)는 보안 플랫폼(122)으로부터 응답이 수신될 때까지 이메일(및 첨부)이 Alice로 전달되는 것을 방지할 수 있다. 플랫폼(122)이 샘플을 철저하게 분석하는데 대략 15분이 걸린다고 가정하면, 이것은 Alice로의 인입 메시지가 15분만큼 지연될 것임을 의미한다. 이 예에서, 첨부가 악성이므로, 이러한 지연은 Alice에게 부정적으로 영향을 주지 않을 것이다. 대안적인 예에서, 누군가가 서명이 또한 존재하지 않는 양성 첨부를 갖고 시간 민감형 메시지를 Alice에게 전송한다고 가정하자. 15분만큼 Alice로의 메시지의 전달을 지연시키는 것은 수용 가능하지 않은 것으로 보여질 가능성이 있을 것이다(예컨대, Alice에 의해). 이하에서 더 상세하게 설명될 바와 같이, 대안적인 접근법은 데이터 기기(102) 상에서 첨부에 대한 적어도 몇몇 실시간 분석을 수행하는 것이다(예컨대, 플랫폼(122)으로부터의 판정(verdict)을 기다리는 동안). 데이터 기기(102)가 첨부가 악성인지 또는 양성인지를 독립적으로 결정할 수 있다면, 그것은 초기 동작을 취할 수 있으며(예컨대, Alice로의 전달을 차단하거나 또는 허용하는), 적용 가능한 경우 판정이 보안 플랫폼(122)으로부터 수신된다면 부가적인 동작들을 조정하고/취할 수 있다. As a third example, the data device 102 may provide a file (eg, malware 130 ) to the security platform 122 for static/dynamic analysis, determine whether it is malicious, and/or otherwise classify it. can be configured. Various actions may be taken by the data device 102 while analysis by the secure platform 122 of the connection (signature not already present) is being performed. As a first example, data device 102 may prevent email (and attachments) from being forwarded to Alice until a response is received from secure platform 122 . Assuming that platform 122 takes approximately 15 minutes to thoroughly analyze the sample, this means that the incoming message to Alice will be delayed by 15 minutes. In this example, since the attachment is malicious, this delay will not negatively affect Alice. In an alternative example, suppose someone sends a time-sensitive message to Alice with a positive attachment for which the signature also does not exist. Delaying delivery of the message to Alice by 15 minutes will likely be seen as unacceptable (eg, by Alice). As will be described in greater detail below, an alternative approach is to perform at least some real-time analysis of the attachment on the data device 102 (eg, while waiting for a verdict from the platform 122 ). If the data device 102 can independently determine whether the attachment is malicious or benign, it can take an initial action (eg, block or allow delivery to Alice) and, if applicable, determine whether the attachment is malicious or benign. ) may coordinate and/or take additional actions if received from

보안 플랫폼(122)은 저장장치(142)에 수신된 샘플들의 사본들을 저장하며 분석이 시작된다(또는 적용 가능한 경우, 스케줄링된다). 저장장치(142)의 일 예는 아파치 하둡 클러스터(Apache Hadoop Cluste; HDFS)이다. 분석의 결과들(및 애플리케이션들에 관한 부가적인 정보)은 데이터베이스(146)에 저장된다. 애플리케이션이 악성으로 결정되는 경우에, 데이터 기기는 분석 결과에 기초하여 파일 다운로드를 자동으로 차단하도록 구성될 수 있다. 뿐만 아니라, 악성으로 결정된 파일을 다운로드하기 위한 미래 파일 전달 요청들을 자동으로 차단하기 위해 서명이 멀웨어에 대해 생성되고 분배될 수 있다(예컨대, 데이터 기기들(102, 136 및 148)과 같은 데이터 기기들로). The secure platform 122 stores copies of the received samples in storage 142 and analysis is initiated (or scheduled, if applicable). An example of the storage device 142 is an Apache Hadoop Cluster (HDFS). The results of the analysis (and additional information about the applications) are stored in database 146 . If the application is determined to be malicious, the data device may be configured to automatically block the file download based on the analysis result. In addition, a signature may be generated and distributed for malware (eg, data devices such as data devices 102 , 136 and 148 ) to automatically block future file delivery requests to download files determined to be malicious. as).

다양한 실시예들에서, 보안 플랫폼(122)은 통상적인 서버-클래스 운영 시스템들(예컨대, Linux)을 구동하는 하나 이상의 전용 상업적으로 이용 가능한 하드웨어 서버들(예컨대, 다중-코어 프로세서(들), 32G+의 RAM, 기가비트 네트워크 인터페이스 어댑터(들), 및 하드 드라이브(들)를 가진)을 포함한다. 보안 플랫폼(122)은 다수의 이러한 서버들, 고체 상태 드라이브들, 및/또는 다른 적용 가능한 고-성능 하드웨어를 포함한 확장 가능한 기반시설에 걸쳐 구현될 수 있다. 보안 플랫폼(122)은 하나 이상의 제3 자들에 의해 제공된 구성요소들을 포함한, 여러 분산형 구성요소들을 포함할 수 있다. 예를 들어, 보안 플랫폼(122)의 부분들 또는 모두는 Amazon Elastic Compute Cloud(EC2) 및/또는 Amazon Simple Storage Service(S3)를 사용하여 구현될 수 있다. 뿐만 아니라, 데이터 기기(102)와 마찬가지로, 보안 플랫폼(122)이 데이터를 저장하거나 또는 데이터를 프로세싱하는 것과 같은, 태스크를 수행하는 것으로 참조될 때마다, 보안 플랫폼(122)의 서브-구성요소 또는 다수의 서브-구성요소들(개별적으로 또는 제3 자 구성요소들과 협력하는지에 관계없이)은 상기 태스크를 수행하기 위해 협력할 수 있다는 것이 이해될 것이다. 일 예로서, 보안 플랫폼(122)은 가상 기계(VM) 서버(124)와 같은 하나 이상의 VM 서버들과 협력하여 정적/동적 분석을 선택적으로 수행할 수 있다. In various embodiments, secure platform 122 is configured with one or more dedicated commercially available hardware servers (eg, multi-core processor(s), 32G+ running conventional server-class operating systems (eg, Linux)). RAM, gigabit network interface adapter(s), and hard drive(s)). The secure platform 122 may be implemented across a scalable infrastructure including a number of such servers, solid state drives, and/or other applicable high-performance hardware. The secure platform 122 may include several distributed components, including components provided by one or more third parties. For example, portions or all of the secure platform 122 may be implemented using Amazon Elastic Compute Cloud (EC2) and/or Amazon Simple Storage Service (S3). Furthermore, as with data device 102 , whenever secure platform 122 is referred to as performing a task, such as storing data or processing data, a sub-component or It will be appreciated that multiple sub-components (whether individually or in cooperation with third party components) may cooperate to perform the task. As an example, security platform 122 may optionally perform static/dynamic analysis in cooperation with one or more VM servers, such as virtual machine (VM) server 124 .

가상 기계 서버의 예는 VMware ESXi, Citrix XenServer, 또는 Microsoft Hyper-V와 같은, 상업적으로 이용 가능한 가상화 소프트웨어를 구동하는 상업적으로 이용 가능한 서버-클래스 하드웨어(예컨대, 다중-코어 프로세서, 32+ 기가바이트의 RAM, 및 하나 이상의 기가비트 네트워크 인터페이스 어댑터들)를 포함한 물리 기계이다. 몇몇 실시예들에서, 가상 기계 서버는 생략된다. 뿐만 아니라, 가상 기계 서버는 보안 플랫폼(122)을 관리하는 동일한 엔티티의 제어하에 있을 수 있지만, 또한 제3 자에 의해 제공될 수 있다. 일 예로서, 가상 기계 서버는 EC2에 의존할 수 있으며, 보안 플랫폼(122)의 나머지 부분들은 보안 플랫폼(122)의 운용자에 의해 소유되며 그것의 제어 하에서 전용 하드웨어에 의해 제공된다. VM 서버(124)는 클라이언트 디바이스들을 에뮬레이팅하기 위해 하나 이상의 가상 기계들(126 내지 128)을 제공하도록 구성된다. 가상 기계들은 다양한 운영 시스템들 및/또는 그것의 버전들을 실행할 수 있다. 가상 기계들에서 애플리케이션들을 실행하는 것에서 기인한 관찰 거동들이 로그되고 분석된다(예컨대, 애플리케이션이 악성이라는 표시들에 대해). 몇몇 실시예들에서, 로그 분석은 VM 서버(예컨대, VM 서버(124))에 의해 수행된다. 다른 실시예들에서, 분석은 조정기(144)와 같은, 보안 플랫폼(122)의 다른 구성요소들에 의해 적어도 부분적으로 수행된다. Examples of virtual machine servers include commercially available server-class hardware (eg, multi-core processors, 32+ gigabytes of RAM, and one or more gigabit network interface adapters). In some embodiments, the virtual machine server is omitted. Furthermore, the virtual machine server may be under the control of the same entity that manages the security platform 122 , but may also be provided by a third party. As an example, the virtual machine server may depend on EC2, and the remaining portions of the secure platform 122 are owned by the operator of the secure platform 122 and provided by dedicated hardware under its control. VM server 124 is configured to provide one or more virtual machines 126 - 128 for emulating client devices. Virtual machines may run various operating systems and/or versions thereof. Observed behaviors resulting from running applications on virtual machines are logged and analyzed (eg, for indications that the application is malicious). In some embodiments, log analysis is performed by a VM server (eg, VM server 124 ). In other embodiments, the analysis is performed at least in part by other components of the secure platform 122 , such as the coordinator 144 .

다양한 실시예들에서, 보안 플랫폼(122)은 가입의 부분으로서 데이터 기기(102)에 대한 서명들(및/또는 다른 식별자들)의 리스트를 통해 샘플들의 분석의 이용 가능한 결과들을 만든다. 예를 들어, 보안 플랫폼(122)은 멀웨어 앱들을 식별하는 콘텐트 패키지를 주기적으로 전송할 수 있다(예컨대, 매일, 매시간, 또는 몇몇 다른 간격, 및/또는 하나 이상의 정책들에 의해 구성된 이벤트에 기초하여). 예시적인 콘텐트 패키지는, 패키지 이름, 앱을 고유하게 식별하기 위한 해시 값, 및 각각의 식별된 멀웨어 앱에 대한 멀웨어 이름(및/또는 멀웨어 군 이름)과 같은 정보와 함께, 식별된 멀웨어 앱들의 목록을 포함한다. 가입은 데이터 기기(102)에 의해 가로채며 데이터 기기(102)에 의해 보안 플랫폼(122)으로 전송된 이들 파일들의 분석을 커버할 수 있으며, 또한 보안 플랫폼(122)에 알려진 모든 멀웨어(또는 다른 형태들의 멀웨어(예컨대, PDF 멀웨어)를 제외한 이동 멀웨어와 같은, 그것의 서브세트들)의 서명들을 커버할 수 있다. 이하에서 더 상세하게 설명될 바와 같이, 플랫폼(122)은 또한 데이터 기기(102)가 멀웨어를 검출하도록 도울 수 있는(예컨대, 해시-기반 서명 매칭이 아닌 기술들을 통해) 기계 학습 모델들과 같은, 이용 가능한 다른 유형들의 정보를 만들 수 있다. In various embodiments, secure platform 122 makes available results of analysis of samples via a list of signatures (and/or other identifiers) for data device 102 as part of a subscription. For example, security platform 122 may periodically send a content package identifying malware apps (eg, daily, hourly, or some other interval, and/or based on an event configured by one or more policies). . An example content package is a list of identified malware apps, along with information such as a package name, a hash value to uniquely identify the app, and a malware name (and/or malware family name) for each identified malware app. includes The subscription may cover analysis of these files intercepted by the data device 102 and transmitted by the data device 102 to the secure platform 122 , and may also cover any malware (or other form) known to the secure platform 122 . may cover the signatures of a subset of its malware (eg, mobile malware) except for its malware (eg, PDF malware). As will be described in more detail below, the platform 122 may also include machine learning models (eg, via techniques other than hash-based signature matching) that can help the data device 102 detect malware; You can make other types of information available.

다양한 실시예들에서, 보안 플랫폼(122)은 보안 서비스들을 데이터 기기(102)의 운용자 외에(또는 적용 가능한 경우, 그 대신에) 다양한 엔티티들로 제공하도록 구성된다. 예를 들어, 그 자신의 각각의 기업 네트워크들(114 및 116), 및 그 자신의 각각의 데이터 기기들(136 및 148)을 가진, 다른 기업들은 보안 플랫폼(122)의 운용자와 계약할 수 있다. 다른 유형들의 엔티티들은 또한 보안 플랫폼(122)의 서비스들을 이용할 수 있다. 예를 들어, 인터넷 서비스를 클라이언트 디바이스(110)에 제공하는 인터넷 서비스 제공자(ISP)는 클라이언트 디바이스(110)가 다운로드하려고 시도하는 애플리케이션들을 분석하기 위해 보안 플랫폼(122)과 계약할 수 있다. 또 다른 예로서, 클라이언트 디바이스(110)의 소유자는 보안 플랫폼(122)과 통신하는 클라이언트 디바이스(110) 상에 소프트웨어를 설치할 수 있다(예컨대, 보안 플랫폼(122)으로부터 콘텐트 패키지들을 수신하고, 본 출원에서 설명된 기술들에 따라 첨부들을 검사하기 위해 수신된 콘텐트 패키지들을 사용하며, 분석을 위해 애플리케이션들을 보안 플랫폼(122)으로 송신하기 위해).In various embodiments, the security platform 122 is configured to provide security services to various entities other than (or instead of, where applicable, the operator of the data device 102 ). For example, other enterprises, with their own respective enterprise networks 114 and 116 , and their own respective data devices 136 and 148 , may contract with the operator of the secure platform 122 . . Other types of entities may also utilize the services of the secure platform 122 . For example, an Internet service provider (ISP) that provides Internet services to the client device 110 may contract with the security platform 122 to analyze applications that the client device 110 attempts to download. As another example, the owner of the client device 110 may install software on the client device 110 in communication with the secure platform 122 (eg, receive content packages from the secure platform 122 , and using the received content packages to examine the attachments according to the techniques described in , and to send the applications to the secure platform 122 for analysis).

IV. 정적/동적 분석을 사용하여 샘플들을 분석하는 것IV. Analyzing samples using static/dynamic analysis

도 3은 샘플들을 분석하기 위해 시스템에 포함될 수 있는 논리 구성요소들의 예를 예시한다. 분석 시스템(300)은 단일 디바이스를 사용하여 구현될 수 있다. 예를 들어, 분석 시스템(300)의 기능은 데이터 기기(102)로 통합된 멀웨어 분석 모듈(112)에서 구현될 수 있다. 분석 시스템(300)은 또한, 다수의 별개의 디바이스들에 걸쳐, 총괄적으로 구현될 수 있다. 예를 들어, 분석 시스템(300)의 기능은 보안 플랫폼(122)에 의해 제공될 수 있다. 3 illustrates an example of logic components that may be included in a system for analyzing samples. The analysis system 300 may be implemented using a single device. For example, the functionality of the analysis system 300 may be implemented in the malware analysis module 112 integrated into the data instrument 102 . The analysis system 300 may also be implemented collectively, across multiple separate devices. For example, the functionality of the analytics system 300 may be provided by the secure platform 122 .

다양한 실시예들에서, 분석 시스템(300)은 알려진 안전한 콘텐트 및/또는 알려진 불량 콘텐트의 리스트들, 데이터베이스들, 또는 다른 컬렉션들(도 3에서 총괄하여 컬렉션(314)으로 도시된)을 이용한다. 컬렉션(314)은 가입 서비스(예컨대, 제3 자에 의해 제공된)를 통해 및/또는 다른 프로세싱(예컨대, 데이터 기기(102) 및/또는 보안 플랫폼(122)에 의해 수행된)의 결과로서를 포함한, 다양한 방식들로 획득될 수 있다. 컬렉션(314)에 포함된 정보의 예들은: 알려진 악성 서버들의 URL들, 도메인 이름들, 및/또는 IP 어드레스들; 알려진 안전한 서버들의 URL들, 도메인 이름들, 및/또는 IP 어드레스들; 알려진 명령 및 제어(C&C) 도메인들의 URL들, 도메인 이름들, 및/또는 IP 어드레스들; 알려진 악성 애플리케이션들의 서명들, 해시들, 및/또는 다른 식별자들; 알려진 안전한 애플리케이션들의 서명들, 해시들, 및/또는 다른 식별자들; 알려진 악성 파일들(예컨대, Android 이용 파일들)의 서명들, 해시들, 및/또는 다른 식별자들; 알려진 안전한 라이브러리들의 서명들, 해시들, 및/또는 다른 식별자들; 및 알려진 악성 라이브러리들의 서명들, 해시들, 및/또는 다른 식별자들이다. In various embodiments, the analysis system 300 uses lists, databases, or other collections (shown collectively as collection 314 in FIG. 3 ) of known safe content and/or known bad content. Collection 314 may include through subscription services (eg, provided by a third party) and/or as a result of other processing (eg, performed by data device 102 and/or secure platform 122 ). , can be obtained in various ways. Examples of information contained in collection 314 include: URLs, domain names, and/or IP addresses of known malicious servers; URLs, domain names, and/or IP addresses of known secure servers; URLs, domain names, and/or IP addresses of known command and control (C&C) domains; signatures, hashes, and/or other identifiers of known malicious applications; signatures, hashes, and/or other identifiers of known secure applications; signatures, hashes, and/or other identifiers of known malicious files (eg, Android-enabled files); signatures, hashes, and/or other identifiers of known secure libraries; and signatures, hashes, and/or other identifiers of known malicious libraries.

A. 수집A. Collection

다양한 실시예들에서, 새로운 샘플이 분석을 위해 수신될 때(예컨대, 샘플과 연관된 기존의 서명은 분석 시스템(300)에 존재하지 않는다), 그것은 큐(302)에 부가된다. 도 3에 도시된 바와 같이, 애플리케이션(130)은 시스템(300)에 의해 수신되며 큐(302)에 부가된다.In various embodiments, when a new sample is received for analysis (eg, an existing signature associated with the sample does not exist in the analysis system 300 ), it is added to the queue 302 . As shown in FIG. 3 , the application 130 is received by the system 300 and added to the queue 302 .

B. 정적 분석B. Static Analysis

조정기(304)는 큐(302)를 모니터링하며, 리소스들(예컨대, 정적 분석 작업기)이 이용 가능해짐에 따라, 조정기(304)는 프로세싱을 위해 큐(302)로부터 샘플을 인출한다(예컨대, 멀웨어(130)의 사본을 인출한다). 특히, 조정기(304)는 먼저 정적 분석을 위해 샘플을 정적 분석 엔진(306)으로 제공한다. 몇몇 실시예들에서, 하나 이상의 정적 분석 엔진들은 분석 시스템(300) 내에 포함되며, 여기에서 분석 시스템(300)은 단일 디바이스이다. 다른 실시예들에서, 정적 분석은 복수의 작동기들(즉, 정적 분석 엔진(306)의 복수의 인스턴스들)을 포함하는 별개의 정적 분석 서버에 의해 수행된다. The coordinator 304 monitors the queue 302 , and as resources (eg, a static analysis worker) become available, the coordinator 304 fetches samples from the queue 302 for processing (eg, malware (withdraw a copy of 130). In particular, the coordinator 304 first provides a sample to the static analysis engine 306 for static analysis. In some embodiments, one or more static analysis engines are included in analysis system 300 , where analysis system 300 is a single device. In other embodiments, the static analysis is performed by a separate static analysis server comprising a plurality of actuators (ie, a plurality of instances of the static analysis engine 306 ).

정적 분석 엔진은 샘플에 대한 일반적인 정보를 획득하며, 정적 분석 보고(308)에 그것을 포함시킨다(적용 가능한 경우, 휴리스틱 및 다른 정보와 함께). 보고는 정적 분석 엔진에 의해, 또는 정적 분석 엔진(306)으로부터 정보를 수신하도록 구성될 수 있는 조정기(304)에 의해(또는 또 다른 적절한 구성요소에 의해) 생성될 수 있다. 몇몇 실시예들에서, 수집된 정보는 생성되는 별개의 정적 분석 보고(308)(즉, 보고(308)로부터의 데이터베이스 레코드의 부분들) 대신에 또는 그 외에, 샘플에 대한 데이터베이스 레코드에(예컨대, 데이터베이스(316)에) 저장된다. 몇몇 실시예들에서, 정적 분석 엔진은 또한 애플리케이션에 대하여 판정(verdict)(예컨대, "안전", "의심", 또는 "악성")을 형성한다. 일 예로서, 판정은 하나의 "악성" 정적 특징이 애플리케이션에 존재한다면 "악성"일 수 있다(예컨대, 애플리케이션은 알려진 악성 도메인으로의 하드 링크를 포함한다). 또 다른 예로서, 포인트들은 특징들의 각각에 할당될 수 있으며(예컨대, 발견된다면, 심각도에 기초하여; 악의를 예측하기 위해 특징이 얼마나 신뢰 가능한지에 기초하여 등) 판정은 정적 분석 결과들과 연관된 포인트들의 수에 기초하여 정적 분석 엔진(306)(또는 적용 가능한 경우, 조정기(304))에 의해 할당될 수 있다. The static analysis engine obtains general information about the sample and includes it in the static analysis report 308 (along with heuristics and other information where applicable). The report may be generated by the static analysis engine, or by the coordinator 304 (or by another suitable component), which may be configured to receive information from the static analysis engine 306 . In some embodiments, the collected information is stored in a database record for the sample (eg, in place of or in addition to a separate static analysis report 308 that is generated (ie, portions of the database record from the report 308 ). database 316). In some embodiments, the static analysis engine also forms a verdict (eg, “safe,” “suspicious,” or “malicious”) for the application. As an example, the determination may be “malicious” if one “malicious” static characteristic is present in the application (eg, the application contains a hard link to a known malicious domain). As another example, points may be assigned to each of the features (eg, if found, based on severity; based on how reliable the feature is to predict malice, etc.) and the decision is a point associated with static analysis results. may be assigned by the static analysis engine 306 (or the coordinator 304, if applicable) based on the number of them.

C. 동적 분석C. Dynamic Analysis

일단 정적 분석이 완료되면, 조정기(304)는 애플리케이션에 대한 동적 분석을 수행하기 위해 이용 가능한 동적 분석 엔진(310)의 위치를 찾는다. 정적 분석 엔진(306)과 마찬가지로, 분석 시스템(300)은 하나 이상의 동적 분석 엔진들을 직접 포함할 수 있다. 다른 실시예들에서, 동적 분석은 복수의 작업기들(즉, 동적 분석 엔진(310)의 복수의 인스턴스들)을 포함하는 별개의 동적 분석 서버에 의해 수행된다. Once the static analysis is complete, the coordinator 304 locates an available dynamic analysis engine 310 to perform a dynamic analysis on the application. Like the static analysis engine 306 , the analysis system 300 may directly include one or more dynamic analysis engines. In other embodiments, the dynamic analysis is performed by a separate dynamic analysis server comprising a plurality of workers (ie, a plurality of instances of the dynamic analysis engine 310 ).

각각의 동적 분석 작업기는 가상 기계 인스턴스를 관리한다. 몇몇 실시예들에서, 정적 분석(예컨대, 정적 분석 엔진(306)에 의해 수행된)의 결과들은, 보고 형태(308)인지 및/또는 데이터베이스(316)에 저장되거나, 또는 그 외 저장되는 것으로서에 관계없이, 동적 분석 엔진(310)으로의 입력으로서 제공된다. 예를 들어, 정적 보고 정보는 동적 분석 엔진(310)에 의해 사용된 가상 기계 인스턴스를 선택/맞춤화하도록 돕기 위해 사용될 수 있다(예컨대, Microsoft Windows 7 SP2 대 Microsoft Windows 10 Enterprise, 또는 iOS 11.0 대 iOS 12.0). 다수의 가상 기계 인스턴스들이 동시에 실행되는 경우에, 적용 가능하다면, 단일 동적 분석 엔진은 인스턴스들 모두를 관리할 수 있거나, 또는 다수의 동작 분석 엔진들이 사용될 수 있다(예컨대, 각각은 그 자신의 가상 기계 인스턴스를 관리한다). 이하에서 더 상세하게 설명될 바와 같이, 분석의 동적 부분 동안, 애플리케이션에 의해 취해진 동작들(네트워크 활동을 포함한)이 분석된다. Each dynamic analysis worker manages a virtual machine instance. In some embodiments, the results of the static analysis (eg, performed by the static analysis engine 306 ) are in the form of a report 308 and/or stored in the database 316 , or otherwise stored as such. Regardless, it is provided as input to the dynamic analysis engine 310 . For example, the static reporting information may be used to help select/customize the virtual machine instance used by the dynamic analysis engine 310 (eg, Microsoft Windows 7 SP2 vs. Microsoft Windows 10 Enterprise, or iOS 11.0 vs. iOS 12.0). ). In the case where multiple virtual machine instances are running concurrently, a single dynamic analysis engine may manage all of the instances, if applicable, or multiple motion analysis engines may be used (eg, each with its own virtual machine). instance management). As will be described in more detail below, during the dynamic part of the analysis, actions taken by the application (including network activity) are analyzed.

다양한 실시예들에서, 샘플의 정적 분석은, 적용 가능하다면, 생략되거나 또는 별개의 엔티티에 의해 수행된다. 일 예로서, 종래의 정적 및/또는 동적 분석은 제1 엔티티에 의해 파일들에 대해 수행될 수 있다. 일단 주어진 파일이 악성이라고 결정되면(예컨대, 제1 엔티티에 의해), 파일은 구체적으로 네트워크 활동의 멀웨어의 사용에 대하여 부가적인 분석을 위해(예컨대, 동적 분석 엔진(310)에 의해) 제2 엔티티(예컨대, 보안 플랫폼(122)의 운용자)로 제공될 수 있다. In various embodiments, static analysis of the sample is omitted or performed by a separate entity, if applicable. As an example, conventional static and/or dynamic analysis may be performed on the files by a first entity. Once a given file is determined to be malicious (eg, by a first entity), the file is then transferred to a second entity (eg, by dynamic analysis engine 310 ) for further analysis (eg, by the dynamic analysis engine 310 ) specifically for the malware's use of network activity. (eg, the operator of the secure platform 122 ).

분석 시스템(300)에 의해 사용된 환경은 애플리케이션이 실행 중인 동안 관찰된 거동들이 그것들이 발생한 것으로 로그되도록(예컨대, 후킹 및 로그캣을 지원하는 맞춤화된 커널을 사용하여) 계장화/후킹된다. 에뮬레이터와 연관된 네트워크 트래픽이 또한 캡처된다(예컨대, pcap을 사용하여). 로그/네트워크 데이터는 분석 시스템(300) 상에서 임시 파일로서 저장될 수 있으며, 또한 더 영구적으로 저장될 수 있다(예컨대, HDFS 또는 또 다른 적절한 저장 기술 또는 MongoDB와 같은, 기술의 조합을 사용하여). 동적 분석 엔진(또는 또 다른 적절한 구성요소)은 도메인들, IP 어드레스들 등의 리스트들에 샘플들에 의해 이루어진 연결들을 비교하며(314) 샘플이 악성 엔티티들과 통신하였는지(또는 그것과 통신하려고 시도하였는지)를 결정할 수 있다. The environment used by the analytics system 300 is instrumented/hooked (eg, using a customized kernel that supports hooking and logcat) such that behaviors observed while the application is running are logged as they occurred. Network traffic associated with the emulator is also captured (eg, using pcap). The log/network data may be stored as temporary files on the analytics system 300 , and may also be stored more permanently (eg, using HDFS or another suitable storage technology or a combination of technologies, such as MongoDB). The dynamic analysis engine (or another suitable component) compares 314 the connections made by the samples to lists of domains, IP addresses, etc. and whether the sample has communicated with (or attempts to communicate with) malicious entities. did) can be determined.

정적 분석 엔진과 마찬가지로, 동적 분석 엔진은 테스트되는 애플리케이션과 연관된 기록에서 데이터베이스(316)에 그것의 분석의 결과들을 저장한다(및/또는 적용 가능하다면 보고(312)에 결과들을 포함시킨다). 몇몇 실시예들에서, 동적 분석 엔진은 또한 애플리케이션에 대하여 판정(예컨대, "안전", "의심", 또는 "악성")을 형성한다. 일 예로서, 판정은 하나의 "악성" 동작이 애플리케이션에 의해 취해진다면(예컨대, 알려진 악성 도메인을 접촉하려는 시도가 이루어지거나, 또는 민감한 정보를 유출하려는 시도가 관찰되는) "악성"일 수 있다. 또 다른 예로서, 포인트들이 취해진 동작들에 할당될 수 있으며(예컨대, 발견된다면 심각도에 기초하여; 악의를 예측하기 위해 동작이 얼마나 신뢰 가능한지에 기초하여; 등) 판정은 동적 분석 결과들과 연관된 포인트들의 수에 기초하여 동적 분석 엔진(310)(또는 적용 가능하다면, 조정기(304))에 의해 할당될 수 있다. 몇몇 실시예들에서, 샘플과 연관된 최종 판정은 보고(308) 및 보고(312)의 조합에 기초하여 이루어진다(예컨대, 조정기(304)에 의해). Like the static analysis engine, the dynamic analysis engine stores the results of its analysis in the database 316 in a record associated with the application being tested (and/or includes the results in the report 312 if applicable). In some embodiments, the dynamic analysis engine also forms a verdict (eg, “safe,” “suspicious,” or “malicious”) for the application. As an example, a determination may be "malicious" if a single "malicious" action is taken by the application (eg, an attempt is made to contact a known malicious domain, or an attempt to leak sensitive information is observed). As another example, points can be assigned to actions taken (eg, based on severity if found; based on how reliable the action is to predict malice; etc.) and the decision is a point associated with dynamic analysis results. may be assigned by the dynamic analysis engine 310 (or the coordinator 304, if applicable) based on the number of them. In some embodiments, a final determination associated with a sample is made based on a combination of report 308 and report 312 (eg, by coordinator 304 ).

V. 인라인 멀웨어 검출V. Inline Malware Detection

도 1의 환경으로 돌아가면, 수백만 개의 새로운 멀웨어 샘플들이 매달 생성될 수 있다(예컨대, 기존의 멀웨어에 대해 미묘한 변화들을 만듦으로써 또는 새로운 멀웨어를 저작함으로써에 관계없이, 시스템(120)의 운용자와 같은 비도덕적인 개인들에 의해). 따라서, 보안 플랫폼(122)(적어도 처음에)이 서명이 없는 많은 멀웨어 샘플들이 존재할 것이다. 뿐만 아니라, 보안 플랫폼(122)이 새롭게 생성된 멀웨어에 대한 서명들을 생성한 경우에도, 리소스 제약들은 데이터 기기(102)와 같은, 데이터 기기들이 임의의 주어진 시간에 모든 알려진 서명들의 목록(예컨대, 플랫폼(122) 상에 저장된 것으로서)을 갖고/사용하는 것을 방지한다. Returning to the environment of FIG. 1 , millions of new malware samples can be created each month (eg, by making subtle changes to existing malware or by authoring new malware, such as the operator of system 120 , by immoral individuals). Thus, there will be many malware samples for which the secure platform 122 (at least initially) does not have a signature. Furthermore, even when secure platform 122 has generated signatures for newly generated malware, resource constraints are imposed by data devices, such as data device 102 , on a list of all known signatures at any given time (eg, platform prevent having/using (as stored on 122).

때때로 멀웨어(130)와 같은, 멀웨어는 네트워크(140)를 성공적으로 뚫을 것이다. 이에 대한 하나의 이유는 데이터 기기(102)가 "처음 허용(first-time allow)" 원리에 따라 동작하는 경우이다. 데이터 기기(102)가 샘플(예컨대, 샘플(130))에 대한 서명을 갖지 않으며 그것을 분석을 위해 보안 플랫폼(122)으로 제출할 때, 보안 플랫폼(122)은 판정(예컨대, "양성(benign)", "악성(malicious)", "알 수 없음" 등)을 반환하는데 대략 5분이 걸린다고 가정하자. 상기 5분 시간 기간 동안 시스템(120)과 클라이언트 디바이스(104) 간의 통신들을 차단하는 대신에, 처음 허용 원리 하에서, 통신이 허용된다. 판정이 반환될 때(예컨대, 5분 뒤), 데이터 기기(102)는 네트워크(140)로 멀웨어(130)의 뒤이은 송신들을 차단하기 위해 판정(예컨대, "악성")을 사용할 수 있고, 시스템(120)과 네트워크(140) 간의 통신을 차단할 수 있다. 다양한 실시예들에서, 샘플(130)의 제2 사본이 상기 기간 동안 데이터 기기(102)에 도달한다면, 데이터 기기(102)는 보안 플랫폼(122)으로부터 판정을 기다리며, 샘플(130)의 제2 사본(및 임의의 뒤이은 사본들)은 시스템(120)에 의해 유지되어 보안 플랫폼(122)으로부터 응답을 유예한다.Sometimes malware, such as malware 130 , will successfully penetrate network 140 . One reason for this is if the data device 102 operates according to a “first-time allow” principle. When data device 102 does not have a signature for the sample (eg, sample 130 ) and submits it to secure platform 122 for analysis, secure platform 122 returns a verdict (eg, “benign”). , "malicious", "unknown", etc.) assume it takes about 5 minutes. Instead of blocking communications between the system 120 and the client device 104 during the five minute time period, under the first grant principle, the communications are allowed. When the verdict is returned (eg, after 5 minutes), the data device 102 can use the verdict (eg, “malicious”) to block subsequent transmissions of the malware 130 to the network 140 , and the system Communication between 120 and the network 140 may be blocked. In various embodiments, if the second copy of the sample 130 arrives at the data device 102 during the time period, the data device 102 awaits a determination from the secure platform 122 and the second copy of the sample 130 arrives at the data device 102 . A copy (and any subsequent copies) is maintained by the system 120 to defer the response from the secure platform 122 .

불운하게도, 데이터 기기(102)가 보안 플랫폼(122)으로부터 판정을 기다리는 5분 동안, 클라이언트 디바이스(104)의 사용자는 멀웨어(130)를 실행할 수 있어서, 잠재적으로 네트워크(140)에서 클라이언트 디바이스(104) 또는 다른 노드들을 손상시킬 수 있다. 상기 언급된 바와 같이, 다양한 실시예들에서, 데이터 기기(102)는 멀웨어 분석 모듈(112)을 포함한다. 멀웨어 분석 모듈(112)이 수행할 수 있는 하나의 태스크는 인라인 멀웨어 검출이다. 특히, 및 이하에서 더 상세하게 설명될 바와 같이, 파일(샘플(130)과 같은)이 데이터 기기(102)를 통과함에 따라, 기계 학습 기술들이 데이터 기기(102) 상에서 파일의 효율적인 분석을 수행하기 위해 적용될 수 있으며(예컨대, 데이터 기기(102)에 의해 파일에 대해 수행된 다른 프로세싱과 동시에) 초기 악성 판정(initial maliciousness verdict)은 데이터 기기(102)에 의해 결정될 수 있다(예컨대, 보안 플랫폼(122)으로부터 판정을 기다리는 동안). Unfortunately, while the data appliance 102 waits for a verdict from the secure platform 122 for five minutes, the user of the client device 104 could execute the malware 130 , potentially allowing the client device 104 in the network 140 . ) or other nodes. As noted above, in various embodiments, the data device 102 includes a malware analysis module 112 . One task that the malware analysis module 112 may perform is inline malware detection. In particular, and as will be described in greater detail below, as a file (such as sample 130 ) passes through data device 102 , machine learning techniques may be used to perform efficient analysis of the file on data device 102 . (eg, concurrently with other processing performed on the file by the data device 102 ) and an initial maliciousness verdict may be determined by the data device 102 (eg, the security platform 122 ). ) while waiting for a verdict from).

다양한 어려움들이 데이터 기기(102)와 같은 리소스 제한된 기기상에서 이러한 분석을 구현할 때 발생할 수 있다. 기기(102) 상에서의 하나의 중대한 리소스는 세션 메모리이다. 세션은 기기(102)가 본 출원에서 설명된 기술들에 따라 분석하는 파일들을 포함한, 정보의 네트워크 전달이다. 단일 기기는 수백만 개의 동시 발생 세션들을 가질 수 있으며, 주어진 세션 동안 지속하기 위해 이용 가능한 메모리는 극히 제한된다. 데이터 기기(102)와 같은 데이터 기기상에서 인라인 분석을 수행할 때 제1 어려움은 이러한 메모리 제약들로 인해, 데이터 기기(102)가 통상적으로 전체 파일을 한 번에 프로세싱할 수 없을 것이며, 대신에 패킷 단위로, 그것이 프로세싱할 필요가 있는 패킷들의 시퀀스를 수신한다는 것이다. 데이터 기기(102)에 의해 사용된 기계 학습 접근법은 따라서 다양한 실시예들에서 패킷 스트림들을 수용할 필요가 있을 것이다. 제2 어려움은 몇몇 경우들에서, 데이터 기기(102)가 프로세싱되는 주어진 파일의 끝(예컨대, 스트림에서 샘플(130)의 끝)이 발생하는 곳을 결정할 수 없을 것이라는 것이다. 데이터 기기(102)에 의해 사용된 기계 학습 접근법은 따라서 다양한 실시예들에서 잠재적으로 중간 스트림에(예컨대, 샘플(130)의 수신/프로세싱의 중간쯤 또는 그 외 실제 파일 끝 이전) 주어진 파일에 대한 판정을 낼 수 있도록 요구할 것이다. Various difficulties may arise when implementing such analysis on resource constrained devices, such as data devices 102 . One critical resource on device 102 is session memory. A session is a network transfer of information, including files, that device 102 parses according to the techniques described herein. A single device can have millions of concurrent sessions, and the memory available to persist for a given session is extremely limited. A first difficulty when performing inline analysis on a data device such as data device 102 is that, due to these memory constraints, data device 102 will typically not be able to process the entire file at once, and instead As a unit, it is receiving a sequence of packets it needs to process. The machine learning approach used by the data appliance 102 will therefore need to accommodate packet streams in various embodiments. A second difficulty is that, in some cases, data device 102 will not be able to determine where the end of a given file being processed (eg, end of sample 130 in the stream) occurs. The machine learning approach used by the data device 102 is thus, in various embodiments, potentially in an intermediate stream (eg, midway through the reception/processing of the sample 130 or otherwise before the end of the actual file) for a given file. You will be asked to make a decision.

A. 기계 학습 모델들A. Machine Learning Models

이하에서 더 상세하게 설명될 바와 같이, 다양한 실시예들에서, 보안 플랫폼(122)은 인라인 멀웨어 검출과 함께 사용할 데이터 기기(102)에 대한 기계 학습 모델들의 세트를 데이터 기기(102)로 제공한다. 모델들은 악성 파일들에 대응하는 것으로 보안 플랫폼(122)에 의해 결정된 특징들(예컨대, n-그램들 또는 다른 특징들)을 통합한다. 이러한 모델들의 두 개의 예시적인 유형들은 선형 분류 모델들 및 비-선형 분류 모델들을 포함한다. 데이터 기기(102)에 의해 사용될 수 있는 선형 분류 모델들의 예들은 로지스틱 회귀 및 선형 지원 벡터 기계들을 포함한다. 데이터 기기(102)에 의해 사용될 수 있는 비-서형 분류 모델의 예는 그래디언트 부스팅 트리(예컨대, eXtreme Gradient Boosting(XGBoost))를 포함한다. 비-선형 모델은 더 정확하지만(및 애매한/위상 멀웨어를 검출하는데 더 양호할 수 있다), 선형 모델은 기기(102) 상에서 상당히 더 적은 리소스들을 사용한다(및 JavaScript 또는 유사한 파일들을 효율적으로 분석하는데 더 적절하다).As will be described in greater detail below, in various embodiments, the security platform 122 provides to the data device 102 a set of machine learning models for the data device 102 for use with inline malware detection. The models incorporate characteristics (eg, n-grams or other characteristics) determined by the security platform 122 to correspond to malicious files. Two example types of such models include linear classification models and non-linear classification models. Examples of linear classification models that may be used by data appliance 102 include logistic regression and linear support vector machines. An example of a non-formal classification model that may be used by the data appliance 102 includes a gradient boosting tree (eg, eXtreme Gradient Boosting (XGBoost)). The non-linear model is more accurate (and may be better at detecting obfuscated/topological malware), but the linear model uses significantly fewer resources on the device 102 (and efficiently parsing JavaScript or similar files). more appropriate).

이하에서 더 상세하게 설명될 바와 같이, 어떤 유형의 분류 모델이 분석되는 주어진 파일을 위해 사용되는지는 파일과 연관된 파일유형에 기초할 수 있다(및 예컨대, 매직 넘버에 의해 결정된다).As will be described in more detail below, what type of classification model is used for a given file being analyzed may be based on (and, for example, determined by a magic number) the filetype associated with the file.

1. 위협 엔진에 대한 부가적인 세부사항1. Additional details about the threat engine

다양한 실시예들에서, 데이터 기기(102)는 위협 엔진(244)을 포함한다. 위협 엔진은 각각의 디코더 스테이지 및 패턴 매치 스테이지 동안 프로토콜 디코딩 및 위협 서명 매칭 양쪽 모두를 통합한다. 두 개의 스테이지들의 결과들은 검출기 스테이지에 의해 병합된다. In various embodiments, data device 102 includes a threat engine 244 . The threat engine integrates both protocol decoding and threat signature matching during each decoder stage and pattern match stage. The results of the two stages are merged by the detector stage.

데이터 기기(102)가 패킷을 수신할 때, 데이터 기기(102)는 패킷이 어떤 세션에 속하는지를 결정하기 위해 세션 매치를 수행한다(데이터 기기(102)가 동시 발생 세션들을 지원하도록 허용하는). 각각의 세션은 특정한 프로토콜 디코더(예컨대, 웹 브라우징 디코더, FTP 디코더, 또는 SMTP 디코더) 개입시키는 세션 상태를 가진다. 파일이 세션의 부분으로서 송신될 때, 적용 가능한 프로토콜 디코더는 적절한 파일-특정 디코더(예컨대, PE 파일 디코더, JavaScript 디코더, 또는 PDF 디코더)를 이용할 수 있다.When data device 102 receives a packet, data device 102 performs a session match to determine which session the packet belongs to (allowing data device 102 to support concurrent sessions). Each session has a session state that engages a specific protocol decoder (eg, a web browsing decoder, an FTP decoder, or an SMTP decoder). When a file is transmitted as part of a session, the applicable protocol decoder may use an appropriate file-specific decoder (eg, a PE file decoder, a JavaScript decoder, or a PDF decoder).

위협 엔진(244)의 예시적인 실시예의 부분들이 도 4에서 도시된다. 주어진 세션에 대해, 디코더(402)는 바이트스트림으로 트래픽을 거닐며, 대응하는 프로토콜을 따르고 콘텍스트들을 표시한다. 콘텍스트의 일 예는 파일-끝 콘텍스트이다(예컨대, JavaScript 파일을 프로세싱하는 동안 </script>를 마주하는). 디코더(402)는 패킷에서 파일-끝 콘텍스트를 표시할 수 있으며, 이것은 그 후 파일의 관찰된 특징들을 사용하여 적절한 모델의 실행을 트리거하기 위해 사용될 수 있다. 몇몇 경우들(예컨대, FTP 트래픽)에서, 명시적인 프로토콜-레벨 태그들은 콘텍스트를 식별/표시하기 위해 디코더(402)에 대해 존재하지 않을 수 있다. 이하에서 더 상세하게 설명될 바와 같이, 다양한 실시예들에서, 디코더(402)는 파일의 특징 추출이 종료되어야 하고(예컨대, 오버레이 섹션이 시작하고) 적절한 모델을 사용한 실행이 시작되어야 할 때를 결정하기 위해 다른 정보(예컨대, 헤더에서 보고된 바와 같은 파일 크기)를 사용할 수 있다. Portions of an exemplary embodiment of a threat engine 244 are shown in FIG. 4 . For a given session, the decoder 402 walks the traffic in a bytestream, following the corresponding protocol and indicating contexts. One example of a context is an end-of-file context (eg, encountering </script> while processing a JavaScript file). The decoder 402 may indicate an end-of-file context in the packet, which may then be used to trigger execution of the appropriate model using the observed characteristics of the file. In some cases (eg, FTP traffic), explicit protocol-level tags may not be present for the decoder 402 to identify/indicate the context. As will be described in greater detail below, in various embodiments, the decoder 402 determines when feature extraction of the file should end (eg, the overlay section begins) and execution with the appropriate model should begin. Other information (eg, file size as reported in the header) may be used to

디코더(402)는 두 개의 파트들을 포함한다. 디코더(402)의 제1 부분은 상태 기계 언어를 사용하여 상태 기계로서 구현될 수 있는 가상 기계부(404)이다. 디코더(402)의 제2 부분은 트래픽에 매칭될 때 상태 기계 전이들 및 동작들을 트리거하기 위한 토큰들(406)의 세트(예컨대, 결정론적 유한 자동화(DFA) 또는 정규 표현들)이다. 위협 엔진(244)은 또한 패턴 매칭(예컨대, 위협 패턴들에 대한)을 수행하는 위협 패턴 매칭기(408)(예컨대, 정규 표현들을 사용하여)를 포함한다. 일 예로서, 위협 패턴 매칭기(408)는 매칭시킬 스트링들의 테이블(정확한 스트링들인지 또는 와일드카드 스트링들인지), 및 스트링 매치가 발견되는 경우 취할 대응 동작들을 제공받을 수 있다(예컨대, 보안 플랫폼(122)에 의해). 검출기(410)는 다양한 동작들을 취하기 위해 디코더(402) 및 위협 패턴 매칭기(408)에 의해 제공된 출력들을 프로세싱한다. The decoder 402 includes two parts. The first part of the decoder 402 is a virtual machine portion 404, which may be implemented as a state machine using a state machine language. The second part of the decoder 402 is a set of tokens 406 (eg, deterministic finite automation (DFA) or regular expressions) for triggering state machine transitions and actions when matched to traffic. The threat engine 244 also includes a threat pattern matcher 408 (eg, using regular expressions) that performs pattern matching (eg, to threat patterns). As an example, threat pattern matcher 408 may be provided with a table of strings to match (whether exact strings or wildcard strings), and corresponding actions to take if a string match is found (eg, security platform 122 ). ) by). Detector 410 processes the outputs provided by decoder 402 and threat pattern matcher 408 to take various actions.

2. N-그램들2. N-grams

세션에서의 데이터는 n-그램들의 시퀀스 - 일련의 바이트 스트링들 - 로 분해될 수 있다. 예로서, 세션에서 16진수 데이터의 일 부분은: "1023ae42f6f28762aab"라고 가정하자. 시퀀스에서 2-그램들은 모두 "1023", "23ae", "ae42", "42f6" 등과 같은, 인접한 문자들의 쌍들이다. 다양한 실시예들에서, 위협 엔진(244)은 8-그램들을 사용하여 파일들을 분석하도록 구성된다. 7-그램들 또는 4-그램들과 같은, 다른 n-그램들이 또한 사용될 수 있다. 상기 예시적인 스트링에서, "1023ae42f6f28762"는 8-그램이고, "23ae42f6f28762aa"는 8-그램이다. 바이트 시퀀스에서 가능한 상이한 8-그램들의 총 수는 2⁶⁴(18,446,744,073,709,551,616)이다. 바이트 시퀀스에서 모든 가능한 8-그램들을 탐색하는 것은 데이터 기기(102)의 리소스들을 쉽게 초과할 것이다. 대신에, 및 이하에서 더 상세하게 설명될 바와 같이, 상당히 축소된 세트의 8-그램들이 위협 엔진(244)에 의한 사용을 위해 보안 플랫폼(122)에 의해 데이터 기기(102)로 제공된다. Data in a session can be decomposed into a sequence of n-grams - a series of byte strings. As an example, suppose a piece of hexadecimal data in a session is: "1023ae42f6f28762aab". The 2-grams in the sequence are all pairs of contiguous letters, such as "1023", "23ae", "ae42", "42f6", etc. In various embodiments, the threat engine 244 is configured to analyze files using 8-grams. Other n-grams may also be used, such as 7-grams or 4-grams. In the example string above, "1023ae42f6f28762" is 8-grams, and "23ae42f6f28762aa" is 8-grams. The total number of different 8-grams possible in a byte sequence is 2 ⁶⁴ (18,446,744,073,709,551,616). Searching for all possible 8-grams in a byte sequence would easily exceed the resources of the data device 102 . Instead, and as will be described in greater detail below, a significantly reduced set of 8-grams is provided by the security platform 122 to the data device 102 for use by the threat engine 244 .

파일에 대응하는 세션 패킷들이 위협 엔진(244)에 의해 수신됨에 따라, 위협 패턴 매칭기(408)는 테이블에서의 스트링들에 대한 매치들을 위해 패킷들을 파싱한다(예컨대, 정규 표현 및/또는 정확한 스트링 매치들을 수행함으로써). 매치들의 리스트(예컨대, 대응하는 패턴 ID에 의해 식별된 매치의 각각의 인스턴스를 가진) 및 어떤 오프셋에서 각각의 매치가 발생하였는지가 생성된다. 이들 매치들에 대한 동작들은 오프셋의 순서로(예컨대, 낮은 것에서 높은 것으로) 취해진다. 주어진 매치에 대해(즉, 특정한 패턴 ID에 대응하는), 취할 하나 이상의 동작들의 세트가 특정된다(예컨대, 동작들을 패턴 ID들에 매핑하는 동작 테이블을 통해).As session packets corresponding to the file are received by the threat engine 244 , the threat pattern matcher 408 parses the packets for matches to strings in a table (eg, a regular expression and/or an exact string). by performing matches). A list of matches (eg, with each instance of the match identified by the corresponding pattern ID) and at what offset each match occurred is generated. Actions for these matches are taken in order of offset (eg, from low to high). For a given match (ie, corresponding to a particular pattern ID), a set of one or more actions to take is specified (eg, via an action table mapping actions to pattern IDs).

보안 플랫폼(122)에 의해 제공된 8-그램들의 세트가 위협 패턴 매칭기(408)가 이미 수행하고 있는 매치들의 테이블로의 부가들로서 부가될 수 있다(예컨대, 정확한 스트링 매치들로서)(예컨대, JavaScript 파일이 패스워드 저장소를 액세스하거나, 또는 PE 파일이 로컬 보안 인가 서브시스템 서비스(LSASS) API를 호출하는 경우와 같은, 멀웨어의 특정 표시들을 찾는 휴리스틱 매치들). 이러한 접근법의 하나의 이점은, 패킷을 통한 다수의 패스들을 수행하는 대신에(예컨대, 먼저 휴리스틱 매치들에 대해 평가하며 그 후 8-그램 매치들에 대해 평가하는), 8-그램들은 위협 패턴 매칭기(408)에 의해 수행된 다른 탐색들과 동시에 탐색될 수 있다는 것이다. The set of 8-grams provided by the security platform 122 may be added (eg, as exact string matches) as additions to the table of matches that the threat pattern matcher 408 is already performing (eg, a JavaScript file). heuristic matches that look for specific indications of malware, such as accessing this password store, or when a PE file calls the Local Security Authorization Subsystem Service (LSASS) API). One advantage of this approach is that instead of performing multiple passes through the packet (eg, evaluating first for heuristic matches and then for 8-gram matches), 8-grams are matched against a threat pattern. It may be searched concurrently with other searches performed by group 408 .

이하에서 더 상세하게 설명될 바와 같이, 8-그램 매치들은 다양한 실시예들에서 선형 및 비-선형 분류 모델들 양쪽 모두에 의해 사용된다. n-그램 매치들에 대해 특정될 수 있는 예시적인 동작들은 가중 카운터를 증분시키는 것(예컨대, 선형 분류기에 대해) 및 특징 벡터에 매치를 저장하는 것(예컨대, 비-선형 분류기에 대해)을 포함한다. 어떤 동작이 취해질지는 패킷과 연관된 파일유형에 기초하여 특정될 수 있다(어떤 유형의 모델이 사용되는지를 결정하는).As will be described in more detail below, 8-gram matches are used by both linear and non-linear classification models in various embodiments. Example operations that may be specified for n-gram matches include incrementing a weight counter (eg, for a linear classifier) and storing the match in a feature vector (eg, for a non-linear classifier). do. What action to take may be specified based on the filetype associated with the packet (which determines what type of model is used).

3. 모델을 선택하는 것3. Choosing a model

몇몇 경우들에서, 주어진 파일유형은 파일의 헤더 내에서 특정된다(예컨대, 파일 자체의 첫 7바이트들에서 나타나는 매직 넘버로서). 이러한 시나리오에서, 위협 엔진(244)은 특정된 파일 유형에 대응하는 적절한 모델을 선택할 수 있다(예컨대, 파일유형들 및 대응하는 모델들을 나열하는 보안 플랫폼(122)에 의해 제공된 테이블에 기초하여). JavaScript와 같은, 다른 경우들에서, 매직 넘버 또는 다른 파일유형 식별자(헤더에 존재한다면)는 어떤 분류 모델이 사용되어야 하는지를 시험하지 않을 수 있다. 일 예로서, JavaScript는 "텍스트파일"의 파일유형을 가질 것이다. JavaScript와 같은 파일유형들을 식별하기 위해, 디코더(402)는 결정론적 유한 상태 자동화(DFA) 패턴 매칭을 수행하며 휴리스틱스(예컨대, <script> 및 파일이 JavaScript라는 다른 표시자들을 식별하는)를 이용하기 위해 사용될 수 있다. 결정된 파일유형 및/또는 선택된 분류 모델은 세션 상태에 저장된다. 세션과 연관된 파일유형은 적용 가능한 경우, 세션이 진행됨에 따라 업데이트될 수 있다. 예를 들어, 텍스트의 스트림에서, <script> 태그를 마주할 때, JavaScript 파일유형은 세션을 위해 할당될 수 있다. 대응하는 </script>를 마주할 때, 파일유형은 변경될 수 있다(예컨대, 다시 평문으로).In some cases, a given filetype is specified within the header of the file (eg, as a magic number appearing in the first 7 bytes of the file itself). In such a scenario, the threat engine 244 may select an appropriate model corresponding to the specified file type (eg, based on a table provided by the security platform 122 listing the file types and corresponding models). In other cases, such as JavaScript, a magic number or other filetype identifier (if present in the header) may not test which classification model should be used. As an example, JavaScript would have a file type of "text file". To identify filetypes such as JavaScript, the decoder 402 performs deterministic finite state automation (DFA) pattern matching and uses heuristics (eg, identifying <script> and other indicators that the file is JavaScript). can be used to The determined file type and/or the selected classification model is stored in the session state. The filetype associated with the session may be updated as the session progresses, if applicable. For example, in a stream of text, when encountering a <script> tag, a JavaScript filetype can be assigned for a session. When encountering a corresponding </script>, the filetype can be changed (eg back to plaintext).

4. 선형 분류 모델들4. Linear Classification Models

선형 모델을 표현하기 위한 하나의 방식은 다음의 선형 방정식을 사용하는 것에 의한다:One way to express a linear model is by using the following linear equation:

여기에서 P는 특징들의 총 수이고, x_i는 i번째 특징이고, β_i는 특징 x_i의 계수(가중치)이며, C는 임계 상수이다. 이 예에서, C는 악성(maliciousness)의 판정에 대한 임계치이며, 주어진 파일에 대한 합이 C 미만인 경우, 파일이 양성의 판정을 할당받으며, 합산이 C 이상이면, 파일이 악성의 판정을 할당받는다는 것을 의미한다. where P is the total number of features, x _i is the i-th feature, β _i is the coefficient (weight) of the feature x _i , and C is the critical constant. In this example, C is the threshold for the verdict of maliciousness, and if the sum for a given file is less than C, the file is assigned a benign verdict, and if the sum is C or greater, the file is assigned a malicious verdict. means that

데이터 기기(102)에 의해 선형 분류 모델을 사용하기 위한 하나의 접근법은 다음과 같다. 단일 플로트(single float)(d)는 인입 파일의 스코어를 추적하기 위해 사용되며, 해시 테이블은 관찰된 n-그램들 및 대응 계수들(즉, x_i 및 β_i)을 저장하기 위해 사용된다. 각각의 인입 패킷에 대해, n-그램 특징들의 각각은(예컨대, 보안 플랫폼(122)에 의해 제공된 바와 같이) 검사된다. 매치가 해시 테이블에서의 특징(x_i)에 대해 발견될 때마다, 해시 테이블에서 상기 특징에 매칭되는 플로트(β_i)가 부가된다(예컨대, d에). 파일의 끝에 이를 때, 임계 값(C)에 대한 단일 플로트(d)의 비교는 파일에 대한 판정을 결정하기 위해 수행된다. One approach for using the linear classification model by the data appliance 102 is as follows. A single float (d) is used to track the score of the incoming file, and a hash table is used to store the observed n-grams and corresponding coefficients (ie, x _i and β _i ). For each incoming packet, each of the n-gram characteristics is checked (eg, as provided by the security platform 122 ). Whenever a match is found for a feature (x _i ) in the hash table, a float (β _i ) matching that feature in the hash table is added (eg, to d ). When the end of the file is reached, a comparison of a single float d against a threshold C is performed to determine a decision for the file.

n-그램 카운팅에 대해, 특징(x_i)은 i번째 n-그램이 관찰되는 횟수와 동일하다. i번째 n-그램이 특정한 파일에 대해 4번 관찰된다고 가정하자. 4*β_i는 β_i+β_i+β_i+β_i로서 재기록될 수 있다. i번째 n-그램이 얼마나 많이(즉, 4번) 관찰되는지를 카운팅하고 그 후 β_i로 곱하는 대신에, 대안적인 접근법은 i번째 n-그램이 관찰될 때마다 β_i를 부가하는 것이다. 더욱이, j번째 n-그램이 파일에 대해 3번 관찰된다고 가정하자. 3*β_j는 유사하게 β_j+β_j+β_j로 기록될 수 있으며, 매번 β_j가 얼마나 많이 관찰되는지를 카운팅하고 그 후 끝에 부가하는 대신에 β_j를 부가한다. For n-gram counting, the feature (x _i ) is equal to the number of times the i-th n-gram is observed. Assume that the i-th n-gram is observed 4 times for a particular file. 4*β _i can be rewritten as β _i +β _i +β _i +β _i . Instead of counting how many times the i-th n-gram is observed (ie 4 times) and then multiplying by β _i , an alternative approach is to add β _i each time the ith n-gram is observed. Furthermore, suppose that the j-th n-gram is observed 3 times for the file. 3*β _j can be similarly written as β _j +β _j + β _j , counting how many times β _j is observed and then adding β _j instead of adding it at the end.

Σ(β_ix_i)를 찾기 위해, β_ix_i, β_jx_j, ...(여기에서, ...는 다른 특징들/가중치들 모두에 대응한다)의 각각이 부가된다. 이것은 β_i + β_i + β_i + β_j + β_j + β_j + β_j + ...로서 재기록될 수 있다. 부가가 누적되기 때문에, 값들의 부가는 임의의 순서로 부가되며(예컨대, β_i + β_j + β_i + β_j + β_i + β_i + β_j + 등) 단일 플로트로 축적된다. 여기에서, 플로트(d)는 0.0에서 시작한다고 가정하자. 특징(x_i)이 관찰될 때마다, β_i가 플로트(d)에 부가될 수 있으며, x_j가 관찰될 때마다, β_j가 플로트(d)에 부가될 수 있다. 이러한 접근법은 4바이트 플로트가 세션 메모리당 전체로서 사용되도록 허용하며, 각 세션 메모리가 특징들의 수에 비례한다는 접근법과 대조적이고, 여기에서 전체 특징 벡터는 그것이 가중 벡터로 곱하여질 수 있도록 메모리에 저장된다. 4바이트 * 1,000 4Kbyte 특징들의 예를 사용하면, 4K는 저장을 위해 요구될 것이며(단일 4바이트 플로트에 비교하여), 이것은 1,000배 더 비싸다. To find Σ(β _i x _i ), each of β _i x _i , β _j x _j , ... (where ... corresponds to all other features/weights) is added. This can be rewritten as β _i + β _i + β _i + β _j + β _j + β _j + β _j + .... Because additions are cumulative, additions of values are added in any order (eg, β _i + β _j + β _i + β _j + β _i + β _i + β _j + , etc.) and accumulate into a single float. Here, it is assumed that the float (d) starts at 0.0. Whenever feature x _i is observed, β _i can be added to float d, and whenever x _j is observed, β _j can be added to float d. This approach allows a 4 byte float to be used as a whole per session memory, in contrast to the approach where each session memory is proportional to the number of features, where the entire feature vector is stored in memory such that it can be multiplied by a weight vector . Using the example of 4 bytes * 1,000 4Kbyte features, 4K would be required for storage (compared to a single 4 byte float), which is 1,000 times more expensive.

5. 비-선형 분류 모델들5. Non-linear classification models

다양한 비-선형 분류 접근법들이 본 출원에서 설명된 기술들과 함께 사용될 수 있다. 비-선형 분류 모델의 일 예는 그래디언트 부스팅 트리이다. 이 예에서, 특징 벡터는 모두-제로 벡터들로 초기화된다. 불운하게도, 비-선형 모델들(선행 모델들과 달리)에 대해, 존재가 검출되는 특징들의 전체 세트(예컨대, 1,000 특징들)는 세션의 전체 지속 기간 동안 지속된다. 이것은 선형 접근법에서보다 덜 효율적이지만, 몇몇 효율성은 전체 4바이트 플로트(메모리 제한되지 않는 디바이스 상에서 사용될 수 있는 바와 같이)보다는 1 바이트(0 내지 255)로 특징들을 다운-샘플링함으로써 여전히 얻어질 수 있다. A variety of non-linear classification approaches may be used in conjunction with the techniques described herein. An example of a non-linear classification model is a gradient boosting tree. In this example, the feature vector is initialized to all-zero vectors. Unfortunately, for non-linear models (unlike previous models), the full set of features for which presence is detected (eg, 1,000 features) persists for the entire duration of the session. This is less efficient than the linear approach, but some efficiencies can still be obtained by down-sampling the features to 1 byte (0 to 255) rather than a full 4-byte float (as can be used on memory-unlimited devices).

데이터 기기(102)가 파일을 스캔함에 따라, 특징이 관찰될 때마다, 상기 특징의 값은 특징 벡터에서 1만큼 증가된다. 일단 파일의 끝에 이르면(또는 특징 관찰의 종료가 그 외 발생하면), 구성된 특징 벡터는 그래디언트 부스팅 트리 모델로 공급된다(예컨대, 보안 플랫폼(122)으로부터 수신되는). 이하에서 더 상세하게 설명될 바와 같이, 비-선형 분류 모델은 n-그램(예컨대, 8-그램) 및 비 n-그램 특징들 양쪽 모두를 사용하여 구축될 수 있다. 비 n-그램 특징의 일 예는 파일의 의도된 크기(purported size)이다(파일의 헤더를 포함한 패킷 외 값으로서 판독될 수 있는). 파일의 의도된 끝(purported end) 뒤에 나타나는 임의의 파일 데이터(예컨대, 헤더에서 특정된 파일 크기에 기초한 바와 같이)는 오버레이로서 불리운다. 특징으로서 작용하는 것 외에, 의도된 파일 길이(purported file length)는 파일이 얼마나 긴 것으로 예상되는지에 대한 프록시로서 사용될 수 있다. 비-선형 분류기는 의도된 파일 길이에 도달될 때까지 파일의 패킷 스트림과 부딪치며, 그 후 판정(verdict)이 파일의 끝에 사실상 이르렀는지에 관계없이 파일에 대해 형성될 수 있다. 주어진 파일이 오버레이를 포함한다는 것은 또한 비-선형 분류 모델의 부분으로서 사용될 수 있는 특징의 예이다. 다양한 실시예들에서, 파일의 오버레이 부분은 분석되지 않으며, 다시 - 분석은 파일의 실제 끝 이전에 수행될 수 있다. 다른 실시예들에서, 특징 추출이 발생하며, 악성 판정은 파일의 실제 끝에 도달될 때까지 형성되지 않는다.As the data device 102 scans the file, each time a feature is observed, its value is incremented by one in the feature vector. Once the end of the file has been reached (or the end of feature observation has otherwise occurred), the constructed feature vector is fed into the gradient boosting tree model (eg, received from the secure platform 122 ). As will be described in more detail below, a non-linear classification model can be built using both n-grams (eg, 8-grams) and non-n-gram features. One example of a non-n-gram characteristic is the intended size of a file (which can be read as an out-of-packet value including the file's header). Any file data that appears after the intended end of the file (eg, as based on the file size specified in the header) is referred to as an overlay. Besides acting as a feature, the intended file length can be used as a proxy for how long the file is expected to be. A non-linear classifier hits the file's packet stream until the intended file length is reached, after which a verdict can be formed for the file regardless of whether the end of the file is actually reached. That a given file contains an overlay is also an example of a feature that can be used as part of a non-linear classification model. In various embodiments, the overlay portion of the file is not parsed, again - parsing may be performed before the actual end of the file. In other embodiments, feature extraction occurs and a malicious verdict is not formed until the actual end of the file is reached.

예시적인 실시예에서, 트리 모델은 5,000개 이진 트리들을 포함한다. 각각의 트리 상에서의 모든 노드는 특징 및 대응하는 임계치를 포함한다. 트리의 일 부분의 예는 도 5에서 묘사된다. 도 5에 도시된 예에서, 특징(예컨대, 특징 F4)에 대한 값이 임계치(예컨대, 30) 미만이면, 좌측 브랜치가 취해진다(502). 특징에 대한 값이 임계치 이상이면, 우측 브랜치가 취해진다(504). 트리는 리프 노드에 도달될 때까지(예컨대, 노드 506) 탐색되며, 이것은 연관된 값(예컨대, 0.7)을 가진다. 도달된 각각의 리프의 값들(트리들의 각각에 대한)은 판정을 산출하도록 최종 스코어를 얻기 위해 합산된다(곱해지기보다는). 스코어가 임계치 미만이면, 파일은 양성인 것으로 고려될 수 있으며, 그것이 임계치 이상이면, 파일은 악성인 것으로 고려될 수 있다. 최종 스코어를 얻을 때 곱셈의 결핍은 데이터 기기(102)의 리소스 제한된 환경에서 모델의 사용을 더 효율적이게 만들도록 돕는다.In an exemplary embodiment, the tree model includes 5,000 binary trees. Every node on each tree contains a feature and a corresponding threshold. An example of a portion of a tree is depicted in FIG. 5 . In the example shown in FIG. 5 , if the value for the feature (eg, feature F4) is less than a threshold (eg, 30), the left branch is taken 502 . If the value for the feature is above the threshold, the right branch is taken (504). The tree is searched until a leaf node is reached (eg, node 506), which has an associated value (eg, 0.7). The values of each leaf reached (for each of the trees) are summed (rather than multiplied) to obtain a final score to yield a decision. If the score is below the threshold, the file may be considered benign, if it is above the threshold, the file may be considered malicious. The lack of multiplication in obtaining the final score helps to make the use of the model more efficient in the resource constrained environment of the data device 102 .

다양한 실시예들에서, 트리들 자체는 데이터 기기(102) 상에 고정되며(업데이트된 모델이 수신될 때까지) 동시에 다수의 세션들에 의해 액세스될 수 있는 공유 메모리에 저장될 수 있다. 세션당 비용은 세션의 특징 벡터를 저장하는 비용이며, 이것은 세션의 분석이 완료되면 제조 아웃될 수 있다.In various embodiments, the trees themselves may be stored in a shared memory that is fixed on the data device 102 (until an updated model is received) and can be accessed by multiple sessions concurrently. The cost per session is the cost of storing the feature vector of the session, which can be manufactured out when the analysis of the session is completed.

6. 예시적인 프로세스6. Exemplary process

도 6은 데이터 기기상에서 인라인 멀웨어 검출을 수행하기 위한 프로세스의 예를 예시한다. 다양한 실시예들에서, 프로세스(600)는 데이터 기기(102)에 의해, 및 특히 위협 엔진(244)에 의해 수행된다. 위협 엔진(244)은 적절한 스크립팅 언어(예컨대, 파이썬)로 저작된 스크립트(또는 스크립트들의 세트)를 사용하여 구현될 수 있다. 프로세스(600)는 또한 클라이언트 디바이스(110)와 같은 엔드포인트 상에서 수행될 수 있다(예컨대, 클라이언트 디바이스(110) 상에서 실행하는 엔드포인트 보호 애플리케이션에 의해).6 illustrates an example of a process for performing inline malware detection on a data device. In various embodiments, process 600 is performed by data appliance 102 , and in particular by threat engine 244 . The threat engine 244 may be implemented using a script (or set of scripts) written in a suitable scripting language (eg, Python). Process 600 may also be performed on an endpoint, such as client device 110 (eg, by an endpoint protection application running on client device 110 ).

프로세스(600)는 파일이 세션의 부분으로서 송신되고 있다는 표시가 기기(102)에 의해 수신될 때 602에서 시작한다. 602에서 수행된 프로세싱의 일 예로서, 주어진 세션에 대해, 연관된 프로토콜 디코더는 파일의 시작이 프로토콜 디코더에 의해 검출될 때 적절한 파일-특정 디코더를 호출하거나 또는 그 외 이를 이용할 수 있다. 상기 설명된 바와 같이, 파일유형이 결정되며(예컨대, 디코더(402)에 의해) 세션과 연관된다(예컨대, 뒤이은 파일유형 분석이 파일유형이 변하거나 또는 파일 패킷들이 송신되는 것을 중지할 때까지 수행될 필요가 없도록).Process 600 begins at 602 when an indication is received by device 102 that a file is being transmitted as part of a session. As an example of the processing performed at 602 , for a given session, the associated protocol decoder may invoke or otherwise use the appropriate file-specific decoder when the start of a file is detected by the protocol decoder. As described above, a filetype is determined (eg, by the decoder 402 ) and associated with a session (eg, until subsequent filetype analysis changes the filetype or file packets cease to be transmitted). so that it does not have to be done).

604에서, n-그램 분석이 수신된 패킷들의 시퀀스에 대해 수행된다. 상기 설명된 바와 같이, n-그램 분석은 기기(102)에 의해 세션 상에서 수행되고 있는 다른 분석들과 인라인으로 수행될 수 있다. 예를 들어, 기기(102)가 특정한 패킷에 대해 분석을 수행하는 동안(예컨대, 특정한 휴리스틱스의 존재를 검사하기 위해), 그것은 또한 패킷에서의 임의의 8-그램들이 보안 플랫폼(122)에 의해 제공된 8-그램들에 매칭되는지를 결정할 수 있다. 604에서 수행된 프로세싱 동안, n-그램 매치가 발견될 때, 대응하는 패턴 ID가 파일유형에 기초한 동작에 조건을 매핑시키기 위해 사용된다. 동작은 가중 카운터를 증분시키거나(예컨대, 파일유형이 선형 분류기와 연관되는 경우) 또는 매치를 감안하기 위해 특징 벡터를 업데이트한다(예컨대, 파일유형이 비-선형 분류기와 연관되는 경우).At 604 , n-gram analysis is performed on the sequence of received packets. As described above, the n-gram analysis may be performed inline with other analyzes being performed on the session by the device 102 . For example, while device 102 is performing analysis on a particular packet (eg, to check for the presence of a particular heuristic), it also ensures that any 8-grams in the packet are It can determine whether it matches the provided 8-grams. During processing performed at 604 , when an n-gram match is found, the corresponding pattern ID is used to map the condition to an action based on the filetype. The operation increments the weight counter (eg, when the filetype is associated with a linear classifier) or updates the feature vector to account for a match (eg, when the filetype is associated with a non-linear classifier).

n-그램 분석은 파일-끝 조건 또는 체크포인트에 도달될 때까지, 패킷 단위로 계속된다. 상기 포인트(606)에서, 적절한 모델은 파일에 대한 판정을 결정하기 위해 사용된다(즉, 악성 임계치(maliciousness threshold)에 대하여 모델을 사용하여 획득된 최종 값을 비교하는). 상기 언급된 바와 같이, 모델들은 n-그램 특징들을 통합하며 또한 다른 특징들을 통합할 수 있다(예컨대, 비-선형 분류기의 경우에). Analysis of n-grams continues on a packet-by-packet basis until an end-of-file condition or checkpoint is reached. At point 606, the appropriate model is used to determine a verdict for the file (ie, comparing the final values obtained using the model against a maliciousness threshold). As mentioned above, models incorporate n-gram features and may also incorporate other features (eg, in the case of a non-linear classifier).

최종적으로, 608에서, 606에서 이루어진 결정에 응답하여 동작이 취해진다. 응답 동작의 일 예는 세션을 종료하는 것이다. 응답 동작의 또 다른 예는 세션을 계속하도록 허용하지만, 파일이 송신되는 것을 방지하는 것이다(및, 대신에 격리 영역에 위치시키는 것이다). 다양한 실시예들에서, 기기(102)는 보안 플랫폼(122)과 그것의 판정들(양성 판정들, 악성 판정들, 또는 둘 모두인지에 관계없이)을 공유하도록 구성된다. 보안 플랫폼(122)이 파일에 대한 그것의 독립적인 분석을 완료할 때, 그것은 판정을 형성한 모델의 성능을 평가하는 것을 포함하여, 다양한 목적들로 기기(102)에 의해 보고된 판정을 사용할 수 있다. Finally, at 608 , an action is taken in response to the determination made at 606 . An example of a response action is to end the session. Another example of a response action is to allow the session to continue, but prevent the file from being sent (and place it in an isolated area instead). In various embodiments, device 102 is configured to share its verdicts (whether positive verdicts, malicious verdicts, or both) with secure platform 122 . When the security platform 122 completes its independent analysis of the file, it may use the verdict reported by the device 102 for a variety of purposes, including evaluating the performance of the model that formed the verdict. there is.

샘플에 대한 예시적인 위협 서명이 도 7b에 도시된다. 특히, "4d73f42438fb5a857915219cdfa9cbb4ce3f771ffed93af81b0528931e4813f8"의 SHA-256 해시를 가진 샘플에 대해, 각각의 쌍에서의 제1 값은 특징에 대응하며, 제2 값은 카운트에 대응한다. 도 7b에 도시된 예에서, 숫자들을 포함한 특징들(예컨대, 특징 "3905")은 n-그램 특징들에 대응하며, "J" 및 숫자를 포함한 특징들(예컨대, 특징 "J18")은 비 n-그램 특징들에 대응한다.An exemplary threat signature for a sample is shown in FIG. 7B . In particular, for a sample with a SHA-256 hash of "4d73f42438fb5a857915219cdfa9cbb4ce3f771ffed93af81b0528931e4813f8", the first value in each pair corresponds to a feature, and the second value corresponds to a count. In the example shown in FIG. 7B , features comprising numbers (eg, feature “3905”) correspond to n-gram features, and features including “J” and numbers (eg, feature “J18”) are non Corresponds to n-gram features.

예시적인 실시예에서, 보안 플랫폼(122)은 데이터 기기(102)와 같은 기기들에 의한 사용을 위해 모델들을 생성할 때 특정 위양성율(예컨대, 0.001)을 타겟팅하도록 구성된다. 따라서, 몇몇 경우들에서(예컨대, 모든 1000개 파일들 중 하나), 데이터 기기(102)는 본 출원에서 설명된 기술들에 따른 모델을 사용하여 인라인 분석을 수행할 때 양성 파일이 악성이라고 부정확하게 결정할 수 있다. 이러한 시나리오에서, 보안 플랫폼(122)이 그 다음에 파일이 사실상 양성이라고 결정하면, 그것은 그 다음에 악성인 것으로 플래그되지 않도록 화이트리스트에 부가될 수 있다(예컨대, 또 다른 기기에 의해). In an exemplary embodiment, the secure platform 122 is configured to target a specific false positive rate (eg, 0.001) when generating models for use by devices such as the data device 102 . Thus, in some cases (eg, one of all 1000 files), data device 102 incorrectly indicates that a benign file is malicious when performing an inline analysis using a model according to the techniques described herein. can decide In such a scenario, if the security platform 122 then determines that the file is in fact benign, it may then be whitelisted (eg, by another device) so that it is not flagged as malicious.

화이트리스팅을 위한 하나의 접근법은 보안 플랫폼(122)이 기기(102)에 저장된 화이트리스트에 파일을 부가하도록 기기(102)에 지시하는 것이다. 또 다른 접근법은 보안 플랫폼(122)이 거짓 양성들(false positives)에 대해 화이트리스트 시스템(154)에 지시하고 화이트리스트 시스템(154)이 기기(102)와 같은 기기들을 최신의 거짓 양성 정보(false positive information)로 유지하도록 지시하는 것이다. 이전에 언급된 바와 같이, 기기(102)와 같은 기기들이 가진 하나의 문제는 그것들이 리소스 제한적이라는 것이다. 기기에서 화이트리스트를 유지하는데 사용된 리소스들을 최소화하기 위한 하나의 접근법은 최소 최근 사용(Least Recently Used; LRU) 캐시를 사용하여 화이트리스트를 유지하는 것이다. 화이트리스트는 파일 해시들을 포함할 수 있으며, 또한 특징 벡터들 또는 특징 벡터들의 해시들과 같은, 다른 요소들에 기초할 수 있다. One approach for whitelisting is for the security platform 122 to instruct the device 102 to add the file to a whitelist stored on the device 102 . Another approach is that the security platform 122 instructs the whitelist system 154 for false positives and the whitelist system 154 returns devices such as device 102 to the latest false positives. It is instructing to keep it as positive information). As mentioned previously, one problem with devices such as device 102 is that they are resource limited. One approach to minimizing the resources used to maintain a whitelist on a device is to use a Least Recently Used (LRU) cache to maintain the whitelist. The whitelist may include file hashes, and may also be based on other factors, such as feature vectors or hashes of feature vectors.

VI. 모델들을 구축하는 것VI. building models

도 1에 묘사된 환경으로 돌아가면, 이전에 설명된 바와 같이, 보안 플랫폼(122)은 그것이 수신하는 샘플들에 대한 정적 및 동적 분석을 수행하도록 구성된다. 보안 플랫폼(122)은 다양한 소스들로부터 분석을 위한 샘플들을 수신할 수 있다. 이전에 언급된 바와 같이, 샘플 소스의 하나의 예시적인 유형은 데이터 기기(예컨대, 데이터 기기들(102, 136, 및 148)이다. 다른 소스들(예컨대, 다른 보안 기기 벤더들, 보안 연구자들 등과 같은, 샘플들의 하나 이상의 제3 자 제공자들)이 또한 적용 가능하다면 사용될 수 있다. 이하에서 더 상세하게 설명될 바와 같이, 보안 플랫폼(122)은 모델들을 구축하기 위해 그것이 수신하는 샘플들의 집성물을 사용할 수 있다(예컨대, 본 출원에서 설명된 기술들의 실시예들에 따라 보안 기기(102)에 의해 사용될 수 있는).Returning to the environment depicted in FIG. 1 , as previously described, the security platform 122 is configured to perform static and dynamic analysis on the samples it receives. The secure platform 122 may receive samples for analysis from a variety of sources. As previously mentioned, one exemplary type of sample source is a data device (eg, data devices 102, 136, and 148). Other sources (eg, other security device vendors, security researchers, etc.) one or more third-party providers of samples) may also be used, if applicable.As will be described in more detail below, the secure platform 122 uses the aggregate of samples it receives to build models. may be used (eg, may be used by the security device 102 in accordance with embodiments of the techniques described herein).

다양한 실시예들에서, 정적 분석 엔진(306)은 그것이 수신하는 샘플들에 대한 특징 추출을 수행하도록 구성된다(예컨대, 또한 상기 설명된 바와 같이 다른 정적 분석 기능들을 수행하는 동안). 특징 추출을 수행하기 위한 예시적인 프로세스(예컨대, 보안 플랫폼(122)에 의해)는 도 8a에서 묘사된다. 프로세스(800)는 샘플의 정적 분석이 시작될 때 802에서 시작된다. 특징 추출(804) 동안, 모든 8-그램들(또는 8-그램들이 사용되지 않는 실시예들에서 다른 적용 가능한 n-그램들)은 프로세싱되는 샘플(예컨대, 도 3에서의 샘플(130)) 중에서 추출된다. 특히, 분석되는 샘플에서 8-그램들의 히스토그램이 추출되며(예컨대, 해시 테이블로), 이것은 주어진 8-그램이 프로세싱되는 샘플에서 관찰되는 횟수들을 나타낸다. 정적 분석 엔진(306)에 의한 특징 분석 동안 8-그램들을 추출하는 하나의 이점은 원래 파일이 결과적인 히스토그램으로부터 재구성될 수 없으므로, 제3 자들로부터 획득된 샘플들을 사용할 때(예컨대, 모델들을 구성할 때) 잠재적인 프라이버시 및 계약상 문제들이 완화될 수 있다는 것이다. 추출된 히스토그램은 806에서 저장된다. In various embodiments, the static analysis engine 306 is configured to perform feature extraction on samples it receives (eg, while also performing other static analysis functions as described above). An example process (eg, by secure platform 122 ) for performing feature extraction is depicted in FIG. 8A . Process 800 begins at 802 when static analysis of the sample begins. During feature extraction 804 , all 8-grams (or other applicable n-grams in embodiments where 8-grams are not used) are extracted from among the processed sample (eg, sample 130 in FIG. 3 ). is extracted In particular, a histogram of 8-grams is extracted (eg with a hash table) in the sample being analyzed, indicating the number of times a given 8-gram is observed in the processed sample. One advantage of extracting 8-grams during feature analysis by the static analysis engine 306 is that when using samples obtained from third parties (e.g., constructing models ) that potential privacy and contractual issues could be mitigated. The extracted histogram is stored at 806 .

다양한 실시예들에서, 정적 분석 엔진(306)은 다른 샘플들로부터 추출된 히스토그램들과 함께 주어진 샘플에 대한 추출된 히스토그램(예컨대, 해시 테이블을 사용하여 표현된)을 저장장치(142)(예컨대, 하둡(Hadoop) 클러스터)에 저장한다. 하둡에서의 데이터는 압축되며 동작들이 하둡 데이터에 대해 수행될 때, 요구된 데이터는 즉석으로 압축 해제된다. 파일에 대한 예시적인 해시 테이블(JSON으로 표현된)이 도 7a에서 도시된다. 라인(702)은 파일의 SHA-256 해시를 나타낸다. 라인(704)은 샘플(130)이 보안 플랫폼(122)에 도달하는 UNIX 시간을 나타낸다. 라인(706)은 오버레이 섹션에서 n-그램들의 카운트를 나타낸다(예컨대, d00fbf4e088bc366':1은 'd00fb4e088bc366'의 하나의 인스턴스가 오버레이 섹션에서 발견되었음을 나타낸다). 라인(708)은 파일에 존재하는 8-그램들의 각각의 카운트를 나타낸다. 라인(710)은 파일이 오버레이를 갖는다는 것을 나타낸다. 라인(712)은 파일들의 파일유형이 ".exe"임을 나타낸다. 라인(714)은 보안 플랫폼(122)이 샘플(130)을 프로세싱하는 것을 마치는 UNIX 시간을 나타낸다. 라인(716)은 파일이 히트하는 비 8-그램 특징들의 각각의 카운트를 나타낸다. 최종적으로, 라인(718)은 파일이 악성인 것으로 결정됨을(예컨대, 보안 플랫폼(122)에 의해) 나타낸다. In various embodiments, the static analysis engine 306 stores the extracted histogram (eg, represented using a hash table) for a given sample along with the histograms extracted from other samples in storage 142 (eg, stored in a Hadoop cluster). Data in Hadoop is compressed and when operations are performed on Hadoop data, the requested data is decompressed on the fly. An exemplary hash table (expressed in JSON) for a file is shown in FIG. 7A. Line 702 represents the SHA-256 hash of the file. Line 704 represents the UNIX time at which sample 130 arrives at secure platform 122 . Line 706 indicates the count of n-grams in the overlay section (eg, d00fbf4e088bc366':1 indicates that one instance of 'd00fb4e088bc366' was found in the overlay section). Line 708 represents a count of each of the 8-grams present in the file. Line 710 indicates that the file has an overlay. Line 712 indicates that the file type of the files is ".exe". Line 714 represents the UNIX time when secure platform 122 finishes processing sample 130 . Line 716 represents each count of non-eight-gram features that the file hits. Finally, line 718 indicates (eg, by security platform 122 ) that the file has been determined to be malicious.

예시적인 실시예에서, 하둡 클러스터에 저장된 8-그램 히스토그램들의 세트는 매일 대략 3테라바이트의 8-그램 히스토그램 데이터만큼 커진다. 히스토그램들은 악성 및 양성 샘플들 모두에 대응할 것이다(예컨대, 상기 설명된 바와 같이 보안 플랫폼(122)에 의해 수행된 다른 정적 및 동적 분석들의 결과들에 기초하여, 이와 같이 라벨링될 것이다).In an exemplary embodiment, the set of 8-gram histograms stored in the Hadoop cluster grows by approximately 3 terabytes of 8-gram histogram data each day. Histograms will correspond to both malicious and benign samples (eg, will be labeled as such, based on the results of other static and dynamic analyzes performed by the security platform 122 as described above).

분석되고 있는 샘플로부터 추출된 8-그램들의 히스토그램은 파일 자체보다 대략 10%더 클 것이며, 통상적인 샘플은 대략 백만 개의 상이한 8-그램들을 포함한 히스토그램을 가질 것이다. 상이한 가능한 8-그램들의 총 수는 2⁶⁴이다. 상기 언급된 바와 같이, 반대로, 보안 플랫폼(122)에 의해 데이터 기기(102)와 같은 디바이스들로 전송된 분류 모델들(예컨대, 가입의 부분으로서)은, 다양한 실시예들에서, 단지 수천 개의 특징들(예컨대, 1,000개 특징들)만을 포함한다. 모델에서의 사용을 위해 잠재적으로 2⁶⁴개 특징들의 세트를 가장 중요한 1,000개 특징들로 축소시키기 위한 하나의 예시적인 방식은 상호 정보 기술을 사용하는 것이다. 다른 접근법들이 또한 적용 가능한 경우 사용될 수 있다(예컨대, 카이-제곱(Chi-squared) 스코어). 4개의 요구된 파라미터들은 주어진 특징을 가진 악성 샘플들의 수, 주어진 특징들을 가진 양성 샘플들의 수, 악성 샘플들의 총 수, 및 양성 샘플들의 총 수를 포함한다. 상호 정보의 하나의 이점은 그것이 매우 큰 데이터 세트들에 대해 효율적으로 사용될 수 있다는 것이다. 하둡에서, 상호 정보 접근법은 그 각각이 특정 특징을 핸들링할 책임이 있는, 다수의 매퍼들에 걸쳐 태스크를 분배함으로써 단일 패스에서 수행될 수 있다(즉, 주어진 파일유형에 대해 하둡 클러스터 데이터세트에 저장된 8-그램 히스토그램들 모두를 통해). 최고 상호 정보를 가진 이들 특징들은 적용 가능하다면, 악성(maliciousness)을 가장 잘 나타내며 및/또는 양성(benignness)을 가장 잘 나타내는 특징들의 세트로서 선택될 수 있다. 결과적인 1,000개 특징들은 그 후 적용 가능하다면 모델들(예컨대, 선형 분류 모델들 및 비-선형 분류 모델들)을 구축하기 위해 사용될 수 있다. 예를 들어, 선형 분류 모델을 구축하기 위해, 모델 구축기(152)(파이썬과 같은 적절한 언어로 저작된 개방 소스 툴들 및/또는 스크립트들의 세트를 사용하여 구현된)는 (예컨대, 상기 섹션 V.A.4에서 설명된 바와 같이) 기기(102)가 검사할 n-그램 특징들의 세트로서 최상위 1,000개 특징들 및 적용 가능한 가중치들을 저장한다. A histogram of 8-grams extracted from the sample being analyzed will be approximately 10% larger than the file itself, and a typical sample will have a histogram containing approximately 1 million different 8-grams. The total number of different possible 8-grams is 2 ⁶⁴ . As noted above, conversely, classification models (eg, as part of a subscription) transmitted by secure platform 122 to devices such as data appliance 102 may, in various embodiments, contain only a few thousand features. (eg, 1,000 features). One exemplary way to reduce a set of potentially 2 ⁶⁴ features to the 1,000 most important features for use in a model is to use mutual information technology. Other approaches may also be used where applicable (eg, a Chi-squared score). The four required parameters include the number of malignant samples with the given characteristics, the number of benign samples with the given characteristics, the total number of malignant samples, and the total number of benign samples. One advantage of mutual information is that it can be used efficiently for very large data sets. In Hadoop, the mutual information approach can be performed in a single pass by distributing the task across multiple mappers, each responsible for handling a particular feature (i.e., stored in the Hadoop cluster dataset for a given filetype). through all 8-gram histograms). Those features with the highest reciprocal information may be selected as the set of features that, if applicable, are most representative of maliciousness and/or most representative of benignness. The resulting 1,000 features can then be used to build models (eg, linear classification models and non-linear classification models), if applicable. For example, to build a linear classification model, model builder 152 (implemented using a set of open source tools and/or scripts written in a suitable language such as Python) (e.g., described in section VA4 above) ) the device 102 stores the top 1,000 features and applicable weights as a set of n-gram features to test.

몇몇 실시예들에서, 비-선형 분류 모델이 또한 최상위 1,000개(또는 다른 원하는 수)의 특징들을 사용하여 모델 구축기(152)에 의해 구축된다. 다른 실시예들에서, 비-선형 분류 모델은 주로 최상위 특징들(예컨대, 950)을 사용하여 구성될 뿐만 아니라, 패킷 단위 특징 추출 및 분석 동안 또한 검출될 수 있는 다른, 비 n-그램 특징들(예컨대, 50개의 이러한 특징들)을 통합한다. 비-선형 분류 모델로 통합될 수 있는 비 n-그램 특징들의 몇몇 예들은: (1) 헤더의 크기, (2) 파일에서 검사합(checksum)의 존재 또는 부재, (3) 파일에서 섹션들의 수, (4) 파일의 의도된 길이(PE 파일의 헤더에 표시된 바와 같이), (5) 파일이 오버레이 부분을 포함하는지, 및 (6) 파일이 PE를 실행하도록 Windows EFI 서브시스템에 요구하는지를 포함한다. In some embodiments, a non-linear classification model is also built by model builder 152 using the top 1,000 (or other desired number) features. In other embodiments, the non-linear classification model is constructed primarily using the highest order features (eg, 950), as well as other, non-n-gram features that may also be detected during packet-by-packet feature extraction and analysis (eg, 950). For example, 50 such features). Some examples of non-n-gram features that can be incorporated into a non-linear classification model are: (1) the size of the header, (2) the presence or absence of a checksum in the file, (3) the number of sections in the file. , (4) the intended length of the file (as indicated in the PE file's header), (5) whether the file contains overlay parts, and (6) whether the file requires the Windows EFI subsystem to run the PE. .

몇몇 실시예들에서, 최상위 1,000개 특징들을 선택하기 위해 상호 정보를 사용하기보다는, 더 큰 세트의 특징들(특징들의 과도 생성된 세트)이 결정된다. 예로서, 최상위 5,000개 특징들이 처음에 상호 정보를 사용하여 선택될 수 있다. 상기 5,000개의 세트는 그 후 매우 큰 데이터세트들(예컨대, 전체 하둡 데이터세트)로 잘 스케일링하지 않으며, 축소된 세트(예컨대, 5,000개 특징들) 상에서 더 효과적인 종래의 특징 선택 기술(예컨대, 배깅(bagging))로의 입력으로서 사용될 수 있다. 종래의 특징 선택 기술은 상호 정보를 사용하여 식별된 5,000개 특징들의 세트로부터 최종 1,000개 특징들을 선택하기 위해 사용될 수 있다. In some embodiments, rather than using mutual information to select the top 1,000 features, a larger set of features (over-generated set of features) is determined. As an example, the top 5,000 features may be initially selected using mutual information. The 5,000 sets then do not scale well to very large datasets (eg full Hadoop dataset), and more effective conventional feature selection techniques (eg, bagging (eg, bagging) on a reduced set (eg 5,000 features). bagging)). Conventional feature selection techniques can be used to select the final 1,000 features from a set of 5,000 features identified using mutual information.

일단 최종 1,000개 특징들이 선택되면, 비-선형 모델을 구성하기 위한 예시적인 방식은 scikit-학습 또는 XGBoost와 같은 개방 소스 툴을 사용하는 것이다. 적용 가능하다면, 교차-검증을 사용하는 것에 의해서와 같은, 파라미터 튜닝이 수행될 수 있다.Once the final 1,000 features have been selected, an exemplary way to construct a non-linear model is to use an open source tool such as scikit-learning or XGBoost. If applicable, parameter tuning may be performed, such as by using cross-validation.

모델을 생성하기 위한 예시적인 프로세스가 도 8b에서 묘사된다. 다양한 실시예들에서, 프로세스(850)는 보안 플랫폼(122)에 의해 수행된다. 프로세스(850)는 추출된 특징들의 세트(예컨대, n-그램 특징들을 포함한)가 수신될 때 852에서 시작된다. 특징들의 세트가 수신될 수 있는 하나의 예시적인 방식은 프로세스(800)의 결과로서 저장된 특징들을 판독하는 것에 의한다. 854에서, 특징들의 축소 세트는 852에서 수신된 특징들로부터 결정된다. 상기 설명된 바와 같이, 특징들의 축소 세트를 결정하는 예시적인 방식은 상호 정보를 사용하는 것에 의한다. 다른 접근법들(예컨대, 카이-제곱 스코어)이 또한 사용될 수 있다. 뿐만 아니라, 또한 상기 설명된 바와 같이, 상호 정보를 사용하여 특징들의 초기 세트를 선택하고 배깅 또는 또 다른 적절한 기술을 사용하여 초기 세트를 정제하는 것과 같은, 기술들의 조합이 또한 852/854에서 사용될 수 있다. 최종적으로, 또한 상기 설명된 바와 같이, 특징들이 선택된다면(예컨대, 854에서), 적절한 모델들이 856에서 구축된다(예컨대, 개방 소스 또는 다른 툴들을 사용하여, 및 적용 가능하다면, 파라미터 튜닝을 수행하여). 모델들(예컨대, 프로세스(850)를 사용하여 모델 구축기(152)에 의해 생성된)은 데이터 기기(102) 및 다른 적용 가능한 수신인들(예컨대, 데이터 기기들(136 및 148))로 전송될 수 있다(예컨대, 가입 서비스의 부분으로서).An exemplary process for generating a model is depicted in FIG. 8B . In various embodiments, process 850 is performed by secure platform 122 . Process 850 begins at 852 when a set of extracted features (eg, including n-gram features) is received. One exemplary way in which a set of features may be received is by reading features stored as a result of process 800 . At 854 , a reduced set of features is determined from the features received at 852 . As described above, an exemplary way to determine a reduced set of features is by using mutual information. Other approaches (eg, a chi-square score) may also be used. Not only that, but also as described above, combinations of techniques may also be used in 852/854, such as selecting an initial set of features using mutual information and refining the initial set using bagging or another suitable technique. there is. Finally, as also described above, once features are selected (eg, at 854), appropriate models are built at 856 (eg, using open source or other tools, and, if applicable, performing parameter tuning) ). Models (eg, generated by model builder 152 using process 850 ) may be sent to data device 102 and other applicable recipients (eg, data devices 136 and 148 ). There is (eg, as part of a subscription service).

다양한 실시예들에서, 모델 구축기(152)는 매일(또는 다른 적용 가능한) 기반으로 모델들(예컨대, 선형 및 비-선형 분류 모델들)을 생성한다. 프로세스(850)를 수행하거나 또는 그 외 모델들을 주기적으로 생성함으로써, 보안 플랫폼(122)은 기기(102)와 같은 기기들에 의해 사용된 모델들이 멀웨어 위협들의 가장 현재 유형들(예컨대, 비도덕적인 개인들에 의해 가장 최근 전개된 것들)을 검출하는 것을 보장하도록 도울 수 있다.In various embodiments, model builder 152 generates models (eg, linear and non-linear classification models) on a daily (or other applicable) basis. By performing process 850 or otherwise periodically generating models, security platform 122 ensures that models used by devices, such as device 102 , represent the most current types of malware threats (eg, immoral individuals). ) can help ensure that the most recently deployed ones are detected.

새롭게-생성된 모델이 기존의 모델보다 양호한 것으로 결정될 때마다(예컨대, 임계치를 초과하는 품질 평가 메트릭들의 세트에 기초하여 결정된 바와 같이), 업데이트된 모델들은 데이터 기기(102)와 같은 데이터 기기들로 송신될 수 있다. 몇몇 경우들에서, 이러한 업데이트들은 특징들에 할당된 가중치들을 조정한다. 이러한 업데이트들은 기기들로 쉽게 배치되고 채택될 수 있다(예컨대, 실-시간 업데이트들로서). 다른 경우들에서, 이러한 업데이트들은 특징들 자체를 조정한다. 이러한 업데이트들은 그것들이 디코더와 같은, 기기의 구성요소들로의 패치들을 요구할 수 있기 때문에, 전개하는 것이 더 복잡해질 수 있다. 모델 생성 동안 오버트레이닝을 사용하는 하나의 이점은 디코더가 특정한 특징들을 검출할 수 있는지 여부를 고려할 수 있다는 것이다. Whenever the newly-generated model is determined to be better than the existing model (eg, as determined based on a set of quality evaluation metrics exceeding a threshold), the updated models are sent to data devices such as data device 102 . can be transmitted. In some cases, these updates adjust the weights assigned to features. These updates can be easily deployed and adopted (eg, as real-time updates) to devices. In other cases, these updates adjust the features themselves. These updates can become more complex to deploy, as they may require patches to components of the device, such as a decoder. One advantage of using overtraining during model generation is that the decoder can take into account whether or not certain features can be detected.

다양한 실시예들에서, 기기들은 그것들이 수신됨에 따라 모델들에 대한 업데이트들을 전개하도록 요구된다(예컨대, 보안 플랫폼(122)에 의해). 다른 실시예들에서, 기기들은 업데이트들을 선택적으로 전개하도록 허용된다(적어도 일정 기간 동안). 일 예로서, 새로운 모델이 기기(102)에 의해 수신될 때, 기존의 모델 및 새로운 모델 양쪽 모두는 일정 기간 동안 기기(102) 상에서 동시에 구동될 수 있다(예컨대, 기존의 모델은 생산 시 사용되며 새로운 모델은 실제로 그것들을 취하지 않고 취할 동작들에 대해 보고한다). 기기의 관리자는 기존의 모델 또는 새로운 모델이 기기상에서 트래픽을 프로세싱하기 위해 사용되어야 하는지를 나타낼 수 있다(예컨대, 어떤 모델이 더 양호하게 수행하는지에 기초하여). 다양한 실시예들에서, 기기(102)는 어떤 모델(들)이 기기(102) 상에서 구동하고 있는지 및 모델(들)이 얼마나 효과적인지와 같은 정보(예컨대, 거짓 양성 통계 정보)를 나타내는 텔레메트리를 다시 보안 플랫폼(122)으로 제공한다. In various embodiments, devices are required to deploy updates to models as they are received (eg, by secure platform 122 ). In other embodiments, devices are allowed to selectively deploy updates (at least for a period of time). As an example, when a new model is received by the device 102 , both the old model and the new model may be run simultaneously on the device 102 for a period of time (eg, the old model is used in production and The new model reports actions to be taken without actually taking them). The administrator of the device may indicate whether an existing model or a new model should be used to process traffic on the device (eg, based on which model performs better). In various embodiments, device 102 provides telemetry indicating information (eg, false positive statistical information) such as which model(s) are running on device 102 and how effective the model(s) are. is provided back to the security platform 122 .

앞서 말한 실시예들은 이해의 명료함을 위해 약간 상세하게 설명되었지만, 본 발명은 제공된 세부사항들에 제한되지 않는다. 본 발명을 구현하는 많은 대안적인 방식들이 있다. 개시된 실시예들은 예시적이며 제한적이지 않다. Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the present invention. The disclosed embodiments are illustrative and not restrictive.

Claims

In the system,
As a processor:
store a set comprising one or more sample classification models on the networking device;
perform n-gram analysis on the sequence of received packets associated with the received file, wherein performing n-gram analysis includes using at least one stored sample classification model;
determine whether the received file is malicious based at least in part on an n-gram analysis of the received sequence of packets, and in response to determining that the file is malicious, prevent propagation of the received file;
configured, the processor; and
and a memory coupled to the processor and configured to provide instructions to the processor.

The system of claim 1 , wherein the processor is configured to perform the n-gram analysis at least in part by comparing n-grams of the received packet against a predetermined list of n-grams.

The system of claim 2 , wherein the predetermined list of n-grams is generated using a plurality of previously collected malware samples.

The system of claim 1 , wherein the processor is further configured to determine a filetype associated with the file.

5. The system of claim 4, wherein the processor is configured to select a linear classification model from the set of one or more sample classification models based on the determined filetype associated with the file.

6. The system of claim 5, wherein performing n-gram analysis comprises accumulating a set of weights corresponding to observed n-grams.

7. The system of claim 6, wherein the weights are accumulated to a single float value.

5. The system of claim 4, wherein the processor is configured to select, from the one or more sample classification models, a non-linear classification model based on the determined filetype associated with the file.

9. The system of claim 8, wherein the non-linear classification model includes n-gram features and non-n-gram features.

The system of claim 9 , wherein the at least one non-n-gram characteristic is associated with a file size.

The system of claim 9 , wherein the at least one non-n-gram characteristic is associated with the presence of an overlay.

9. The system of claim 8, wherein performing the n-gram analysis comprises updating a value for a feature in a feature vector whenever the feature is matched.

The system of claim 1 , wherein using the at least one stored sample classification model comprises running a non-linear classifier on the packet stream until a desired file length is reached.

14. The system of claim 13, wherein the intended file length is not an actual file length and a verdict is determined before the actual end of the file is reached.

The system of claim 1 , wherein the processor is further configured to receive at least one updated classification model.

2. The system of claim 1, wherein the n-gram analysis is performed inline with other packet analyzes as a single pass analysis of a traffic stream.

The system of claim 1 , wherein the processor is further configured to use a set of whitelisted n-grams when performing the n-gram analysis.

The system of claim 1 , wherein the processor is further configured to send a copy of the received file to a secure platform and perform the n-gram analysis while awaiting a decision from the secure platform.

A method comprising:
storing a set comprising one or more sample classifications on a networking device;
performing n-gram analysis on a sequence of received packets associated with a received file, wherein performing n-gram analysis comprises using at least one stored sample classification model performing the steps; and
determining that the received file is malicious based at least in part on n-gram analysis of the received sequence of packets, and in response to determining that the file is malicious, preventing propagation of the received file; How to.

A computer program product embodied in a tangible computer-readable storage medium and including computer instructions, the computer program product comprising:
The computer instructions are:
store a set comprising one or more sample classifications on the networking device;
perform n-gram analysis on the sequence of received packets associated with the received file, wherein performing n-gram analysis includes using at least one stored sample classification model;
determining that the received file is malicious based at least in part on n-gram analysis of the received sequence of packets, and in response to determining that the file is malicious, prevent propagation of the received file. , computer program products.

In the system,
As a processor:
receive a set of features, including a plurality of n-grams, extracted from the set of files;
determine a reduced set of features comprising at least some of the plurality of n-grams;
to use the reduced set of features to create a model usable by data instruments to perform inline malware analysis.
configured, the processor; and
and a memory coupled to the processor and configured to provide instructions to the processor.

The system of claim 1 , wherein the set of features includes features extracted from a set of known malicious files.

The system of claim 1 , wherein the set of features comprises features extracted from a set of known benign files.

The system of claim 1 , wherein the reduced set of features is determined using mutual information.

The system of claim 1 , wherein the reduced set of features is determined using a Chi-squared score.

The system of claim 1 , wherein the generated model comprises n-gram features.

7. The system of claim 6, wherein the generated model further comprises non-n-gram features.

8. The system of claim 7, wherein the at least one non-n-gram characteristic is associated with a file size.

8. The system of claim 7, wherein the at least one non-n-gram characteristic is associated with a header size.

The system of claim 7 , wherein the at least one non-n-gram characteristic is associated with at least one of the presence or absence of a checksum in the file.

8. The system of claim 7, wherein the at least one non-n-gram characteristic is associated with a number of sections in the file.

8. The system of claim 7, wherein the at least one non-n-gram characteristic is associated with a desired length of the file.

The system of claim 7 , wherein the at least one non-n-gram characteristic is associated with whether the file includes an overlay.

The system of claim 1 , wherein the model is a linear model.

The system of claim 1 , wherein the model is a non-linear model.

The system of claim 1 , wherein the plurality of n-grams are extracted during static analysis of the set of files.

The system of claim 1 , wherein the model is transmitted to a first data device.

18. The method of claim 17, wherein in response to a false positive result reported by a second data device, the processor is configured to generate an updated model and send the updated model to the first data device. , system.

In the method,
receiving a set of features extracted from the set of files, the set of features comprising a plurality of n-grams;
determining a reduced set of features comprising at least some of the plurality of n-grams; and
using the reduced set of features to create a model usable by a data instrument to perform inline malware analysis.

A computer program product embodied in a tangible computer-readable storage medium and including computer instructions, the computer program product comprising:
The computer instructions are:
receive a set of features, including a plurality of n-grams, extracted from the set of files;
determine a reduced set of features comprising at least some of the plurality of n-grams;
and use the reduced set of features to generate a model usable by a data instrument to perform inline malware analysis.