[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2020094664A1 - Dispositif d'interface réseau - Google Patents

Dispositif d'interface réseau Download PDF

Info

Publication number
WO2020094664A1
WO2020094664A1 PCT/EP2019/080281 EP2019080281W WO2020094664A1 WO 2020094664 A1 WO2020094664 A1 WO 2020094664A1 EP 2019080281 W EP2019080281 W EP 2019080281W WO 2020094664 A1 WO2020094664 A1 WO 2020094664A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
function
network interface
interface device
data
Prior art date
Application number
PCT/EP2019/080281
Other languages
English (en)
Inventor
Steven Pope
Neil Turton
David Riddoch
Dmitri Kitariev
Ripduman Sohan
Derek Roberts
Original Assignee
Xilinx, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/180,883 external-priority patent/US11012411B2/en
Priority claimed from US16/395,027 external-priority patent/US11082364B2/en
Application filed by Xilinx, Inc. filed Critical Xilinx, Inc.
Priority to JP2021523691A priority Critical patent/JP2022512879A/ja
Priority to EP19798619.3A priority patent/EP3877851A1/fr
Priority to KR1020217017269A priority patent/KR20210088652A/ko
Priority to CN201980087757.XA priority patent/CN113272793A/zh
Publication of WO2020094664A1 publication Critical patent/WO2020094664A1/fr
Priority to JP2024083450A priority patent/JP2024116163A/ja

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30079Pipeline control instructions, e.g. multicycle NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/74Address processing for routing
    • H04L45/742Route cache; Operation thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/15Interconnection of switching modules
    • H04L49/1515Non-blocking multistage, e.g. Clos
    • H04L49/1546Non-blocking multistage, e.g. Clos using pipelined operation

Definitions

  • This application relates to network interface devices for performing a function with respect to data packets.
  • Network interface devices are known and are typically used to provide an interface between a computing device and a network.
  • the network interface device can be configured to process data which is received from the network and/or process data which is to be put on the network.
  • a network interface device for interfacing a host device to a network
  • the network interface device comprising: a first interface, the first interface being configured to receive a plurality of data packets; a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predefined type of operation executable in a single step, wherein at least some of said plurality of processing units are associated with different predefined types of operation, wherein the hardware module is configurable to interconnect at least some of said plurality of said processing units to provide a first data processing pipeline for processing one or more of said plurality of data packets to perform a first function with respect to said one or more of said plurality of data packets.
  • the first function comprises a filtering function. In some embodiments, the function comprises at least one of a tunnelling, encapsulation, and routing function. In some embodiments, the first function comprises an extended Berkley packet filter function.
  • the first function comprises a distributed denial of service scrubbing operation.
  • the first function comprises a firewall operation.
  • the first interface is configured to receive the first data packet from the network.
  • the first interface is configured to receive the first data packet from the host device.
  • two or more of the at least some of the plurality of processing units are configured to perform their associated at least one predefined operation in parallel. In some embodiments, two or more of the at least some of the plurality of processing units are configured to perform their associated predefined type of operation according to a common clock signal of the hardware module.
  • each of two or more of the at least some of the plurality of processing units is configured to perform its associated predefined type of operation within a predefined length of time defined by a clock signal.
  • two or more of the at least some of the plurality of processing units are configured to: access the first data packet within a time period of the predefined length of time; and in response to the end of the predefined length of time, transfer results of the respective at least one operation to a next processing unit.
  • the results comprise at least one or more of: at least value from the one or more of the plurality of data packets; updates to map state; and metadata.
  • each of the plurality of processing units comprises an application specific integrated circuit configured to perform the at least one operation associated with the respective processing unit.
  • each of the processing units comprises a field programmable gate array. In some embodiments, each of the processing units comprises any other type of soft logic.
  • At least one of the of the plurality of processing units comprises a digital circuit and a memory storing state related to processing carried out by the digital circuit, wherein the digital circuit is configured to, in communication with the memory, perform the predefined type of operation associated with the respective processing unit.
  • the network interface device comprises a memory accessible to two or more of the plurality of processing units, wherein the memory is configured to store state associated with a first data packet, wherein during performance of the first function by the hardware module, two or more of the plurality of processing units are configured to access and modify the state.
  • a first of the at least some of the plurality of processing units is configured to stall during access of a value of the state by a second of the plurality of processing units.
  • one or more of the plurality of processing units are individually configurable to, based on their associated predefined type of operation, perform an operation specific to a respective pipeline.
  • the hardware module is configured to receive an instruction, and in response to said instruction, at least one of: interconnect at least some of said plurality of said processing units to provide a data processing pipeline for processing one or more of said plurality of data packets; cause one or more of said plurality of processing units to perform their associated predefined type of operation with respect to said one or more data packets; add one or more of said plurality of processing units into a data processing pipeline; and remove one or more of said plurality of processing units from a data processing pipeline.
  • the predefined operation comprises at least one of: loading at least one value of the first data packet from a memory; storing at least one value of a data packet in a memory; and performing a look up into a look up table to determine an action to be carried out with respect to a data packet.
  • the hardware module is configured to receive an instruction, wherein the hardware module is configurable to, in response to said instruction, interconnect at least some of said plurality of said processing units to provide a data processing pipeline for processing one or more of said plurality of data packets, wherein the instruction comprises a data packet sent through the third processing pipeline.
  • one or more the at least some of the plurality of processing units are configurable to, in response to said instruction, perform a selected operation of their associated predefined type of operation with respect to said one or more of the plurality of data packets.
  • the plurality of components comprises a second of the plurality of components configured to provide the first function in circuitry different to the hardware module, wherein the network interface device comprises at least one controller configured to cause data packets passing through the processing pipeline to be processed by one of: the first of the plurality of components and the second of the plurality of components.
  • the network interface device comprises at least one controller configured to issue an instruction to cause the hardware module to begin performing the first function with respect to data packets, wherein the instruction is configured to cause the first of the plurality of components to be inserted into the processing pipeline.
  • the network interface device comprises at least one controller configured to issue an instruction to cause the hardware module to begin performing the first function with respect to data packets, wherein the instruction comprises a control message sent through the processing pipeline and configured to cause the first of the plurality of components to be activated.
  • the associated at least one operation comprises at least one of: loading at least one value of the first data packet from a memory of the network interface device; storing at least one value of the first data packet in a memory of the network interface device; and performing a look up into a look up table to determine an action to be carried out with respect to the first data packet.
  • one or more of the at least some of the plurality of processing units is configured to pass at least one result of its associated at least one predefined operation to a next processing unit in the first processing pipeline, the next processing unit being configured to perform a next predefined operation in dependence upon the at least one result.
  • each of the different predefined types of operation is defined by a different template.
  • the types of predefined operation comprise at least one of: accessing a data packet; accessing a lookup table stored in a memory of the hardware module; performing logic operations on data loaded from a data packet; and performing logic operations on data loaded from the lookup table.
  • the hardware module comprises routing hardware, wherein the hardware module is configurable to interconnect at least some of said plurality of said processing units to provide the first data processing pipeline by configuring the routing hardware to route data packets between the plurality of processing units in a particular order defined by the first data processing pipeline.
  • the hardware module is configurable to interconnect at least some of said plurality of said processing units to provide a second data processing pipeline for processing one or more of said plurality of data packets to perform a second function different to the first function.
  • the hardware module is configurable to interconnect at least some of said plurality of said processing units to provide a second data processing pipeline after interconnecting at least some of the plurality of said processing units to provide the first data processing pipeline.
  • the network interface device comprises further circuitry separate to the hardware module and configured to perform the first function for one or more of said plurality of data packets.
  • the further circuitry comprises at least one of: a field programmable gate array; and a plurality of central processing units.
  • the network interface device comprises at least one controller, wherein the further circuitry is configured to perform the first function with respect to data packets during a compilation process for the first function to be performed in the hardware module, wherein the at least one controller is configured to, in response to completion of the compilation process, control the hardware module to begin performing the first function with respect to data packets.
  • the further circuitry comprises a plurality of central processing units.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to be performed in the hardware module is complete, control the further circuitry to cease performing the first function with respect to data packets.
  • the network interface device comprises at least one controller, wherein the hardware module is configured to perform the first function with respect to data packets during a compilation process for the first function to be performed in the further circuitry, wherein the at least one controller is configured to determine that the compilation process for the first function to be performed in the further circuitry is complete and, in response to said determination, control the further circuitry to begin performing the first function with respect to data packets.
  • the further circuitry comprises a field programmable gate array.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to performed in the further circuitry is complete, control the hardware module to cease performing the first function with respect to data packets.
  • the network interface device comprises at least one controller configured to perform a compilation process to provide the first function to be performed in the hardware module.
  • the compilation process comprises providing instructions to provide a control plane interface in the hardware module that responds to control messages.
  • a data processing system comprising the network interface device according to the first aspect and the host device and, wherein the data processing system comprises at least one controller configured to perform a compilation process to provide the first function to be performed in the hardware module.
  • the at least one controller is provided by one or more of: the network interface device; and the host device.
  • the compilation process is performed in response to a determination by the at least one controller that a computer program expressing the first function is safe for execution in kernel mode of the host device.
  • the at least one controller is configured to perform the compilation process by assigning each of the at least some of the plurality of processing units to perform in a particular order of the first data processing pipeline, at least one operation from a plurality of operations expressed by a sequence of computer code instructions, wherein the plurality of operations provides the first function with respect to the one or more of the plurality of data packets.
  • the at least one controller is configured to: prior to completion of the compilation process, send a first instruction to cause a further circuitry of the network interface device to perform the first function with respect to data packets; and send a second instruction to cause the hardware module to, following completion of the compilation process, begin performing the first function with respect to data packets.
  • a method for implementation in a network interface device comprising: receiving, at a first interface, a plurality of data packets; and configuring a hardware module to interconnect at least some of a plurality of processing units of the hardware module so as to provide a first data processing pipeline for processing one or more of said plurality of data packets to perform a first function with respect to said one or more of said plurality of data packets, wherein each processing unit is associated with a predefined type of operation executable in a single step, wherein at least some of said plurality of processing units are associated with different predefined types of operation.
  • a non-transitory computer readable medium comprising program instructions for causing a network interface device to perform a method comprising: receiving, at a first interface, a plurality of data packets; and configuring a hardware module to interconnect at least some of a plurality of processing units of the hardware module so as to provide a first data processing pipeline for processing one or more of said plurality of data packets to perform a first function with respect to said one or more of said plurality of data packets, wherein each processing unit is associated with a predefined type of operation executable in a single step, wherein at least some of said plurality of processing units are associated with different predefined types of operation.
  • a processing unit configured to: perform at least one predefined operation with respect to a first data packet received at a network interface device; be connected to a first further processing unit configured to perform a first further at least one predefined operation with respect to the first data packet; be connected to a second further processing unit configured to perform a second further at least one predefined operation with respect to the first data packet; receive from the first further processing unit, results of the first further at least one predefined operation; perform the at least one predefined operation in dependence upon the results of the first further at least one predefined operation; send results of the at least one predefined operation to the second further processing unit for processing in the second further at least one predefined operation.
  • the processing unit is configured to receive a clock signal for timing the at least one predefined operation, wherein the processing unit is configured to perform the at least one predefined operation in at least one cycle of the clock signal.
  • the processing unit is configured to perform the at least one predefined operation in a single cycle of the clock signal.
  • the at least one predefined operation, the first further at least one predefined operation, and the second further at least one predefined operation form part of a function performed with respect to a first data packet received at the network interface device.
  • the first data packet is received from a host device, wherein the network interface device is configured to interface the host device to a network.
  • the first data packet is received from a network, wherein the network interface device is configured to interface a host device to the network.
  • the function is a filtering function.
  • the filtering function is an extended Berkley packet filter function.
  • the processing unit comprises an application specific integrated circuit configured to perform the at least one predefined operation.
  • the processing unit comprises: a digital circuit configured to perform the at least one predefined operation; and a memory storing state related to the at least one predefined operation carried.
  • the processing unit configured to access a memory accessible to the first further processing unit and the second further processing unit, wherein the memory is configured to store state associated with the first data packet, wherein the at least one predefined operation comprises modifying the state stored in the memory.
  • the processing unit is configured during a first clock cycle to read a value of said state from the memory and provide said value to the second further processing unit for modification by the second further processing unit, wherein the processing unit is configured during a second clock cycle following the first clock cycle to stall.
  • the at least one predefined operation comprises at least one of: loading the first data packet from a memory of the network interface device; storing the first data packet in a memory of the network interface device; and performing a look up into a look up table to determine an action to be carried out with respect to the first data packet.
  • a method implemented in a processing unit comprising: performing at least one predefined operation with respect to a first data packet received at a network interface device; connecting to a first further processing unit configured to perform a first further at least one predefined operation with respect to the first data packet; connecting to a second further processing unit configured to perform a second further at least one predefined operation with respect to the first data packet; receiving from the first further processing unit, results of the first further at least one predefined operation; performing the at least one predefined operation in dependence upon the results of the first further at least one predefined operation; and sending results of the at least one predefined operation to the second further processing unit for processing in the second further at least one predefined operation.
  • a computer readable non-transitory storage device storing instructions that, when executed by a processing unit, cause the processing unit to perform a method comprising: performing at least one predefined operation with respect to a first data packet received at a network interface device; connecting to a first further processing unit configured to perform a first further at least one predefined operation with respect to the first data packet; connecting to a second further processing unit configured to perform a second further at least one predefined operation with respect to the first data packet; receiving from the first further processing unit, results of the first further at least one predefined operation; performing the at least one predefined operation in dependence upon the results of the first further at least one predefined operation; and sending results of the at least one predefined operation to the second further processing unit for processing in the second further at least one predefined operation.
  • a network interface device for interfacing a host device to a network
  • the network interface device comprising: at least one controller; a first interface, the first interface being configured to receive data packets; first circuity configured to perform a first function with respect to data packets received at the first interface; and second circuity, wherein the first circuitry is configured to perform the first function with respect to data packets received at the first interface during a compilation process for the first function to be performed in the second circuitry, wherein the at least one controller is configured to determine that the compilation process for the first function to performed in the second circuitry is complete and, in response to said determination, control the second circuitry to begin performing the first function with respect to data packets received at the first interface.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to performed in the second circuitry is complete, control the first circuitry to cease performing the first function with respect to data packets received at the first interface.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to performed in the second circuitry is complete: begin performing the first function with respect to data packets of a first data flow received at the first interface; and control the first circuitry to cease performing the first function with respect to data packets of the first data flow.
  • the first circuitry comprises at least one central processing unit, wherein each of the at least one central processing unit is configured to perform the first function with respect to at least one data packet received at the first interface.
  • the second circuitry comprises a field programmable gate array configured to begin performing the first function with respect to data packets received at the first interface.
  • the second circuitry comprises a hardware module comprising a plurality of processing units, each processing unit being associated with at least one predefined operation, wherein the first interface is configured to receive a first data packet, wherein the hardware module is configured to, following the compilation process for the first function to performed in the second circuitry, cause at least some of the plurality of processing units to perform their associated at least one predefined operation in a particular order so as to perform a first function with respect to the first data packet.
  • the first circuitry comprises a hardware module comprising a plurality of processing units, each processing unit being associated with at least one predefined operation, wherein the first interface is configured to receive a first data packet, wherein the hardware module is configured to, during the compilation process for the first function to be performed in the second circuitry, cause at least some of the plurality of processing units to perform their associated at least one predefined operation in a particular order so as to perform a first function with respect to the first data packet.
  • the at least one controller is configured to, perform the compilation process for compiling the first function to be performed by the second circuitry.
  • the at least one controller is configured to: prior to completion of the compilation process, instruct the first circuitry to perform the first function with respect to data packets received at the first interface.
  • the compilation process for compiling the first function to be performed by the second circuitry is performed by the host device, wherein the at least one controller is configured to determine that the compilation process has been completed in response to receiving an indication of the completion of the compilation process from the host device.
  • a processing pipeline for processing data packets received at the first interface
  • the processing pipeline comprises a plurality of components each configured to perform one of a plurality of functions with respect to data packets received at the first interface, wherein a first of the plurality of components is configured to provide the first function when provided by the first circuitry, wherein a second of the plurality of components is configured to provide the first function when provided by the second at least one processing unit.
  • the at least one controller is configured to control the second circuitry to begin performing the first function with respect to data packets received at the first interface by inserting the second of the plurality of components into the processing pipeline.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to performed in the second circuitry is complete, control the first circuitry to cease performing the first function with respect to data packets received at the first interface by removing the first of the plurality of components from the processing pipeline.
  • the at least one controller is configured to control the second circuitry to begin performing the first function with respect to data packets received at the first interface by sending a control message through the processing pipeline to activate the second of the plurality of components.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to performed in the second circuitry is complete, control the first circuitry to cease performing the first function with respect to data packets received at the first interface by sending a control message through the processing pipeline to deactivate the second of the plurality of components.
  • the first of the plurality of components is configured to provide the first function with respect to data packets a first data flow passing through the processing pipeline, wherein the second of the plurality of components is configured to provide the first function with respect to data packets of a second data flow passing through the processing pipeline.
  • the first function comprises filtering data packets.
  • the first interface is configured to receive the data packets from the network.
  • the first interface is configured to receive the data packets from the host device.
  • a compilation time of the first function for the second circuitry is greater than a compilation time of the first function for the first circuitry.
  • a method comprising: receiving data packets at a first interface of the network interface device; performing in first circuitry of the network interface device, a first function with respect to data packets received at the first interface; and wherein the first circuitry is configured to perform the first function with respect to data packets received at the first interface during a compilation process for the first function to be performed in the second circuitry, the method comprising: determining that the compilation process for the first function to performed in the second circuitry is complete; and in response to said determination, controlling second circuitry of the network interface device to begin performing the first function with respect to data packets received at the first interface.
  • a non-transitory computer readable medium comprising program instructions for causing a data processing system to perform a method comprising: receiving data packets at a first interface of the network interface device; performing in first circuitry of the network interface device, a first function with respect to data packets received at the first interface, wherein the first circuitry is configured to perform the first function with respect to data packets received at the first interface during a compilation process for the first function to be performed in the second circuitry, the method comprising: determining that the compilation process for the first function to performed in the second circuitry is complete; and in response to said determination, controlling second circuitry of the network interface device to begin performing the first function with respect to data packets received at the first interface.
  • a non-transitory computer readable medium comprising program instructions for causing a data processing system to perform the following: performing a compilation process to compile a first function to be performed by a second circuitry of a network interface device; prior to completion of the compilation process, sending a first instruction to cause a first circuitry of the network interface device to perform the first function with respect to data packets received at a first interface of the network interface device; and sending a second instruction to cause the second circuitry to, following completion of the compilation process, begin performing the first function with respect to data packets received at the first interface.
  • the non-transitory computer readable medium comprises program instructions for causing a data processing system to perform a further compilation process to compile the first function to be performed by the first circuitry, wherein the time taken for the compilation process is longer than the time taken for the further compilation process.
  • the data processing system comprises a host device, wherein the network interface device is configured to interface the host device with a network.
  • the data comprising system comprises the network interface device, wherein the network interface device is configured to interface a host device with a network.
  • the data processing system comprises a host device and the network interface device, wherein the network interface device is configured to interface the host device with a network.
  • the first function comprises filtering data packets received at the first interface from a network.
  • the non-transitory computer readable medium comprises comprising program instructions for causing the data processing system to perform the following: sending a third instruction to cause the first circuitry to, following completion of the compilation process, cease performing the function with respect to data packets received at the first interface.
  • the non-transitory computer readable medium comprises program instructions for causing the data processing system to perform the following: sending an instruction to cause the second circuitry to perform the first function with respect to data packets of a first data flow; and sending an instruction to cause the first circuitry to cease performing the first function with respect to data packets of the first data flow.
  • the first circuitry comprises at least one central processing unit, wherein prior to completion of the second compilation process, each of the at least one central processing units is configured to perform the first function with respect to at least one data packet received at the first interface.
  • the second circuity comprises a field programmable gate array configured to begin performing the first function with respect to data packets received at the first interface.
  • the second circuity comprises a hardware module comprising a plurality of processing units, each processing unit being associated with at least one predefined operation, wherein the data packets received at the first interface comprise a first data packet, wherein the hardware module is configured to, following completion of the second compilation process, perform the first function with respect to the first data packet by each processing unit at least some of the plurality of processing units performing its respective at least one operation with respect to the first data packet.
  • the first circuitry comprises a hardware module comprising a plurality of processing units configured to provide the first function with respect to a data packet, each processing unit being associated with at least one predefined operation wherein the data packets received at the first interface comprise a first data packet, wherein the hardware module is configured to, prior to completion of the second compilation process, perform the first function with respect to the first data packet by each processing unit of at least some of the plurality of processing units performing its respective at least one operation with respect to the first data packet.
  • the compilation process comprises assigning each of a plurality of processing units of the second circuitry to perform, in a particular order, at least one operation associated with one of a plurality of processing stages in a sequence of computer code instructions.
  • the first function provided by the first circuitry is provided as a component of a processing pipeline for processing data packets received at the first interface, wherein the first function provided by the second circuitry is provided as a component of the processing pipeline.
  • the first instruction comprises an instruction configured to cause the first of the plurality of components to be inserted into the processing pipeline.
  • the second instruction comprises an instruction configured to cause the second of the plurality of components to be inserted into the processing pipeline.
  • the non-transitory computer readable medium comprises comprising program instructions for causing the data processing system to perform the following: sending a third instruction to cause the first circuitry to, following completion of the compilation process, cease performing the first function with respect to data packets received at the first interface, wherein the third instruction comprises an instruction configured to cause the first of the plurality of components to be removed from the processing pipeline.
  • the first instruction comprises a control message to be sent through the processing pipeline to activate the second of the plurality of components.
  • the second instruction comprises a control message to be sent through the processing pipeline to activate the second of the plurality of components.
  • the non-transitory computer readable medium comprises program instructions for causing the data processing system to perform the following: sending a third instruction to cause the first circuitry to, following completion of the compilation process, cease performing the function with respect to data packets received at the first interface, wherein the third instruction comprises a control message through the processing pipeline to deactivate the first of the plurality of components.
  • a data processing system comprising at least one processor and at least one memory including computer program code, wherein the at least one memory and the computer program code are configured, with the at least one processor, to cause the data processing system to: perform a compilation process to compile a function to be performed by a second circuitry of a network interface device; prior to completion of the compilation process, instructing a first circuitry of the network interface device to perform the function with respect to data packets received at a first interface of a network interface device; and instructing the second at least one processing unit to, following completion of the second compilation process, begin performing the function with respect to data packets received at the first interface.
  • a method for implementation in a data processing system comprising: performing a compilation process to compile a function to be performed by a second circuitry of a network interface device; prior to completion of the compilation process, sending a first instruction to cause a first circuitry of the network interface device to perform the function with respect to data packets received at a first interface of the network interface device; and sending a second instruction to cause the second circuitry to, following completion of the compilation process, begin performing the function with respect to data packets received at the first interface.
  • a non-transitory computer readable medium comprising program instructions for causing a data processing system to assign each of a plurality of processing units to perform, in a particular order, at least one operation associated with one of a plurality of processing stages in a sequence of computer code instructions, wherein the plurality of processing stages provides a first function with respect to a first data packet received at a first interface of a network interface device, wherein each of the plurality of processing units is configured to perform one of a plurality of types of processing, wherein at least some of the plurality of processing units are configured to perform different types of processing, wherein for each of the plurality of processing units, the assigning is performed in dependence upon determining that the processing unit is configured to perform a type of processing suitable for performing the respective at least one operation.
  • each of the types of processing is defined by one of a plurality of templates.
  • the types of processing include at least one of: accessing a data packet received at the network interface device; accessing a lookup table stored in a memory of the hardware module; performing logic operations on data loaded from the data packet; and performing logic operations on data loaded from the look table.
  • two or more of the at least some of the plurality of processing units are configured to perform their associated at least one operation according to a common clock signal of the hardware module.
  • the assigning comprises assigning each of two or more of the at least some of the plurality of processing units to perform its associated at least one operation within a predefined length of time defined by a clock signal.
  • the assigning comprises assigning two or more of the at least some of the plurality of processing units to access the first data packet within a time period of the predefined length of time.
  • the assigning comprises assigning each of the two or more of the at least some of the plurality of processing units to, in response to the end of a time period of the predefined length of time, transfer results of the respective at least one operation to a next processing unit.
  • the non-transitory computer readable medium comprises program instructions for causing the data processing system to perform the following: assigning at least some of the plurality of stages to occupy a single clock cycle. In some embodiments, the non-transitory computer readable medium comprises program instructions for causing the data processing system to assign two or more of the plurality of processing units to execute their assigned at least one operation to be executed in parallel.
  • the network interface device comprises a hardware module comprising the plurality of processing units.
  • the non-transitory computer readable medium comprises computer program instructions for causing the data processing system to perform the following: performing a compilation process comprising the assigning; prior to completion of the compilation process, sending a first instruction to cause a circuitry of the network interface device to perform the first function with respect to data packets received at the first interface; and sending a second instruction to cause the plurality of processing units to, following completion of the compilation process, begin performing the first function with respect to data packets received at the first interface.
  • the non-transitory computer readable medium comprises, for one or more of the at least some of the plurality of processing units, the assigned at least one operation comprises at least one of: loading at least one value of the first data packet from a memory of the network interface device; storing at least one value of the first data packet in a memory of the network interface device; and performing a look up into a look up table to determine an action to be carried out with respect to the first data packet.
  • the non-transitory computer readable medium comprises computer program instructions for causing the data processing system to issue an instruction to configure routing hardware of the network interface device to route the first data packet between the plurality of processing units in the particular order so as to perform the first function with respect to the first data packet.
  • the first function provided by the plurality of processing units is provided as a component of a processing pipeline for processing data packets received at the first interface.
  • the non-transitory computer readable medium comprises computer program instructions for causing the plurality of processing units to begin performing the first function with respect to data packets received at the first interface by causing the data processing system to issue an instruction to cause the component to be inserted into the processing pipeline. In some embodiments, the non-transitory computer readable medium comprises computer program instructions for causing the plurality of processing units to begin performing the first function with respect to data packets received at the first interface by causing the data processing system to issue an instruction to cause the component to be activated in the processing pipeline.
  • the data processing system comprises a host device, wherein the network interface device is configured to interface the host device with a network.
  • the data processing system comprises the network interface device.
  • the data processing system comprises: the network interface device; and a host device, wherein the network interface device is configured to interface the host device with a network.
  • a data processing system comprising at least one processor and at least one memory comprising computer program code, wherein the at least one memory and the computer program code are configured, with the at least one processor, to cause the data processing system to assign each of a plurality of processing units to perform, in a particular order, at least one operation associated with one of a plurality of processing stages in a sequence of computer code instructions, wherein the plurality of processing stages provides a first function with respect to a first data packet received at a first interface of a network interface device, wherein each of the plurality of processing units is configured to perform one of a plurality of types of processing, wherein at least some of the plurality of processing units are configured to perform different types of processing, wherein for each of the plurality of processing units, the assigning is performed in dependence upon determining that the processing unit is configured to perform a type of processing suitable for performing the respective at least one operation.
  • a method comprising assigning each of a plurality of processing units to perform, in a particular order, at least one operation associated with one of a plurality of processing stages in a sequence of computer code instructions, wherein the plurality of processing stages provides a first function with respect to a first data packet received at a first interface of a network interface device, wherein each of the plurality of processing units is configured to perform one of a plurality of types of processing, wherein at least some of the plurality of processing units are configured to perform different types of processing, wherein for each of the plurality of processing units, the assigning is performed in dependence upon determining that the processing unit is configured to perform a type of processing suitable for performing the respective at least one operation.
  • processing units of the hardware module have been described as executing their type of operation in a single step. However, the skilled person would recognise that this feature is a preferred feature only and it not essential or indispensable for the function of the invention.
  • a method comprising: receiving at a compiler a bit file description and a program, said bit file description comprising a description of routing of a part of a circuit; and compiling said program using said bit file description to output a bit file for said program.
  • the method may comprise using said bit file to configure at least a part of said part of said circuit to perform a function associated with said program.
  • the bit file description may comprise information about the routing between a plurality of processing units of said part of the circuit.
  • the bit file description may comprise for at least one of said plurality of processing units routing information indicating at least one of: to which one or more other processing units data can be output; and from which one or more other processing units data can be received.
  • the bit file description may comprise routing information indicating one or more routes between two or more respective processing units.
  • the bit file description may comprise information indicating only routes which are usable by the compiler when compiling the program to provide the bit file for the program.
  • the bit file may comprise information indicating for a respective processing unit, at least one of: from which one or more of said one or more other processing unit in the bit file description for the respective processing unit an input is to be provided; to which one or more of said one or more other processing units in the bit file description for the respective processing unit an output is to be provided.
  • the part of the circuit may comprise at least a part of a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predefined type of operation executable in a single step, at least some of said plurality of processing units being associated with different predefined types of operation, said bit file description comprising information about the routing between at least some of the plurality of processing units wherein said method may comprise using said bit file to cause the hardware to interconnect at least some of said plurality of said processing units to provide a first data processing pipeline for processing one or more of said plurality of data packets to perform a first function with respect to said one or more of said plurality of data packets.
  • the bit file description may be of at least a portion of an FPGA.
  • the bit file description may be of a portion of an FPGA which is dynamically programmable.
  • the program may comprise one of an eBPF program and a P4 program.
  • the compiler and the FPGA may be provided in a network interface device.
  • an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured, with the at least one processor, to cause the apparatus at least to: receive a bit file description and a program, said bit file description comprising a description of routing of a part of a circuit; and compile said program using said bit file description to output a bit file for said program.
  • the at least one memory and the computer code may be configured, with the at least one processor, to cause the apparatus to use said bit file to configure at least a part of said part of said circuit to perform a function associated with said program.
  • the bit file description may comprise information about the routing between a plurality of processing units of said part of the circuit.
  • the bit file description may comprise for at least one of said plurality of processing units routing information indicating at least one of: to which one or more other processing units data can be output; and from which one or more other processing units data can be received.
  • the bit file description may comprise routing information indicating one or more routes between two or more respective processing units.
  • the bit file description may comprise information indicating only routes which are usable by the compiler when compiling the program to provide the bit file for the program.
  • the bit file may comprise information indicating for a respective processing unit, at least one of: from which one or more of said one or more other processing units in the bit file description for the respective processing unit an input is to be provided; to which one or more of said one or more other processing units in the bit file description for the respective processing unit an output is to be provided.
  • the part of the circuit may comprise at least a part of a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predefined type of operation executable in a single step, at least some of said plurality of processing units being associated with different predefined types of operation, said bit file description comprising information about the routing between at least some of the plurality of processing units, wherein the at least one memory and the computer code are configured, with the at least one processor, to cause the apparatus to use said bit file to cause the hardware to interconnect at least some of said plurality of said processing units to provide a first data processing pipeline for processing one or more of said plurality of data packets to perform a first function with respect to said one or more of said plurality of data packets.
  • the bit file description may be of at least a portion of the FPGA.
  • the bit file description may be of a portion of the FPGA which is dynamically programmable.
  • the program may comprise one of an eBPF program and a P4 program.
  • a network interface device comprising: a first interface, the first interface being configured to receive a plurality of data packets; a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predefined type of operation executable in a single step; a compiler, said compiler configured to receive a bit file description and a program, said bit file description comprising a description of routing of at least a part of said configurable hardware module, and to compile said program using said bit file description to output a bit file for said program, wherein said hardware module is configurable using said bit file to perform a first function associated with the program.
  • the network interface device may be for interfacing a host device to a network.
  • At least some of said plurality of processing units may be associated with different predefined types of operation.
  • the hardware module may be configurable to interconnect at least some of said plurality of said processing units to provide a first data processing pipeline for processing one or more of said plurality of data packets to perform the first function with respect to said one or more of said plurality of data packets.
  • the first function comprises a filtering function. In some embodiments, the function comprises at least one of a tunnelling, encapsulation, and routing function. In some embodiments, the first function comprises an extended Berkley packet filter function.
  • the first function comprises a distributed denial of service scrubbing operation.
  • the first function comprises a firewall operation.
  • the first interface is configured to receive the first data packet from the network. In some embodiments, the first interface is configured to receive the first data packet from the host device.
  • two or more of the at least some of the plurality of processing units are configured to perform their associated at least one predefined operation in parallel.
  • two or more of the at least some of the plurality of processing units are configured to perform their associated predefined type of operation according to a common clock signal of the hardware module.
  • each of two or more of the at least some of the plurality of processing units is configured to perform its associated predefined type of operation within a predefined length of time defined by a clock signal.
  • two or more of the at least some of the plurality of processing units are configured to: access the first data packet within a time period of the predefined length of time; and in response to the end of the predefined length of time, transfer results of the respective at least one operation to a next processing unit.
  • the results comprise at least one or more of: at least value from the one or more of the plurality of data packets; updates to map state; and metadata.
  • each of the plurality of processing units comprises an application specific integrated circuit configured to perform the at least one operation associated with the respective processing unit.
  • each of the processing units comprises a field programmable gate array. In some embodiments, each of the processing units comprises any other type of soft logic.
  • At least one of the of the plurality of processing units comprises a digital circuit and a memory storing state related to processing carried out by the digital circuit, wherein the digital circuit is configured to, in communication with the memory, perform the predefined type of operation associated with the respective processing unit.
  • the network interface device comprises a memory accessible to two or more of the plurality of processing units, wherein the memory is configured to store state associated with a first data packet, wherein during performance of the first function by the hardware module, two or more of the plurality of processing units are configured to access and modify the state.
  • a first of the at least some of the plurality of processing units is configured to stall during access of a value of the state by a second of the plurality of processing units.
  • one or more of the plurality of processing units are individually configurable to, based on their associated predefined type of operation, perform an operation specific to a respective pipeline.
  • the hardware module is configured to receive an instruction, and in response to said instruction, at least one of: interconnect at least some of said plurality of said processing units to provide a data processing pipeline for processing one or more of said plurality of data packets; cause one or more of said plurality of processing units to perform their associated predefined type of operation with respect to said one or more data packets; add one or more of said plurality of processing units into a data processing pipeline; and remove one or more of said plurality of processing units from a data processing pipeline.
  • the predefined operation comprises at least one of: loading at least one value of the first data packet from a memory; storing at least one value of a data packet in a memory; and performing a look up into a look up table to determine an action to be carried out with respect to a data packet.
  • the hardware module is configured to receive an instruction, wherein the hardware module is configurable to, in response to said instruction, interconnect at least some of said plurality of said processing units to provide a data processing pipeline for processing one or more of said plurality of data packets, wherein the instruction comprises a data packet sent through the third processing pipeline.
  • one or more the at least some of the plurality of processing units are configurable to, in response to said instruction, perform a selected operation of their associated predefined type of operation with respect to said one or more of the plurality of data packets.
  • the plurality of components comprises a second of the plurality of components configured to provide the first function in circuitry different to the hardware module, wherein the network interface device comprises at least one controller configured to cause data packets passing through the processing pipeline to be processed by one of: the first of the plurality of components and the second of the plurality of components.
  • the network interface device comprises at least one controller configured to issue an instruction to cause the hardware module to begin performing the first function with respect to data packets, wherein the instruction is configured to cause the first of the plurality of components to be inserted into the processing pipeline.
  • the network interface device comprises at least one controller configured to issue an instruction to cause the hardware module to begin performing the first function with respect to data packets, wherein the instruction comprises a control message sent through the processing pipeline and configured to cause the first of the plurality of components to be activated.
  • the associated at least one operation comprises at least one of: loading at least one value of the first data packet from a memory of the network interface device; storing at least one value of the first data packet in a memory of the network interface device; and performing a look up into a look up table to determine an action to be carried out with respect to the first data packet.
  • one or more of the at least some of the plurality of processing units is configured to pass at least one result of its associated at least one predefined operation to a next processing unit in the first processing pipeline, the next processing unit being configured to perform a next predefined operation in dependence upon the at least one result.
  • each of the different predefined types of operation is defined by a different template.
  • the types of predefined operation comprise at least one of: accessing a data packet; accessing a lookup table stored in a memory of the hardware module; performing logic operations on data loaded from a data packet; and performing logic operations on data loaded from the lookup table.
  • the hardware module comprises routing hardware, wherein the hardware module is configurable to interconnect at least some of said plurality of said processing units to provide the first data processing pipeline by configuring the routing hardware to route data packets between the plurality of processing units in a particular order defined by the first data processing pipeline.
  • the hardware module is configurable to interconnect at least some of said plurality of said processing units to provide a second data processing pipeline for processing one or more of said plurality of data packets to perform a second function different to the first function.
  • the hardware module is configurable to interconnect at least some of said plurality of said processing units to provide a second data processing pipeline after interconnecting at least some of the plurality of said processing units to provide the first data processing pipeline.
  • the network interface device comprises further circuitry separate to the hardware module and configured to perform the first function for one or more of said plurality of data packets.
  • the further circuitry comprises at least one of: a field programmable gate array; and a plurality of central processing units.
  • the network interface device comprises at least one controller, wherein the further circuitry is configured to perform the first function with respect to data packets during a compilation process for the first function to be performed in the hardware module, wherein the at least one controller is configured to, in response to completion of the compilation process, control the hardware module to begin performing the first function with respect to data packets.
  • the further circuitry comprises a plurality of central processing units.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to be performed in the hardware module is complete, control the further circuitry to cease performing the first function with respect to data packets.
  • the network interface device comprises at least one controller, wherein the hardware module is configured to perform the first function with respect to data packets during a compilation process for the first function to be performed in the further circuitry, wherein the at least one controller is configured to determine that the compilation process for the first function to be performed in the further circuitry is complete and, in response to said determination, control the further circuitry to begin performing the first function with respect to data packets.
  • the further circuitry comprises a field programmable gate array.
  • the at least one controller is configured to, in response to said determination that the compilation process for the first function to performed in the further circuitry is complete, control the hardware module to cease performing the first function with respect to data packets.
  • the network interface device comprises at least one controller configured to perform a compilation process to provide the first function to be performed in the hardware module.
  • the compilation process comprises providing instructions to provide a control plane interface in the hardware module that responds to control messages.
  • a computer implemented method comprising: determining routing information for at least a part of a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predefined type of operation executable in a single step, at least some of said plurality of processing units are associated with different predefined types of operation, said routing information providing information as to available routes between at least a plurality of processing units.
  • the configurable hardware module may comprise a substantially static part and a substantially dynamic part, said determining comprising determining routing information for said substantially dynamic part.
  • the determining routing information for said substantially dynamic part may comprise determining routing in said substantially dynamic part which is used by one or more of the processing units in said substantially static part.
  • the determining may comprises analysing a bit file description of at least a part of said configurable hardware module to determine said routing information.
  • a non-transitory computer readable medium comprising program instructions for: determining routing information for at least a part of a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predefined type of operation executable in a single step, at least some of said plurality of processing units are associated with different predefined types of operation, said routing information providing information as to available routes between at least a plurality of processing units.
  • a computer program comprising program code means adapted to perform the method(s) may also be provided.
  • the computer program may be stored and/or otherwise embodied by means of a carrier medium.
  • Figure 1 shows a schematic view of a data processing system coupled to a network
  • Figure 2 shows a schematic view of a data processing system comprising a filtering operation application configured to run in user mode on a host computing device
  • Figure 3 shows a schematic view of a data processing system comprising a filtering operation configured to run in kernel mode on a host computing device;
  • Figure 4 shows a schematic view of a network interface device comprising a plurality of CPUs for performing a function with respect to data packets;
  • Figure 5 shows a schematic view of a network interface device comprising a field programmable gate array running an application for performing a function with respect to data packets;
  • Figure 6 shows a schematic view of a network interface device comprising a hardware module for performing a function with respect to data packets
  • Figure 7 shows a schematic view of a network interface device comprising a field programmable gate array and at least one processing unit for performing a function with respect to data packets;
  • Figure 8 illustrates a method implemented in a network interface device according to some embodiments
  • Figure 9 illustrates a method implemented in a network interface device according to some embodiments.
  • Figure 10 illustrates an example of processing a data packet by a series of programs
  • Figure 1 1 illustrates an example of processing a data packet by a plurality of processing units
  • Figure 12 illustrates an example of processing a data packet by a plurality of processing units
  • Figure 13 illustrates an example of a pipeline of processing stages for processing a data packet
  • Figure 14 illustrates an example of a slice architecture having a plurality of pluggable components
  • Figure 15 illustrates an example representation of the arrangement and order of processing of plurality of processing units.
  • Figure 16 illustrates an example method of compiling the function
  • Figure 17 illustrates an example of a stateful processing unit
  • Figure 18 illustrates an example of a stateless processing unit
  • Figure 19 shows a method of some embodiments
  • Figure 20a and 20b illustrate routing between slices in an FPGA; and Figure 21 illustrates schematically a partition on an FGPA.
  • each of the data processing systems has a suitable network interface to allow it to communicate across the channel.
  • the network is based on Ethernet technology.
  • Data processing systems that are to communicate over a network are equipped with network interfaces that are capable of supporting the physical and logical requirements of the network protocol.
  • the physical hardware component of network interfaces are referred to as network interface devices or network interface cards (NICs).
  • OS operating system
  • kernel includes a protocol stack for translating commands and data between the applications and a device driver specific to the network interface device.
  • the device driver may directly control the network interface device.
  • the data processing system 100 comprises a host computing device 101 coupled to a network interface device 102 that is arranged to interface the host to network 103.
  • the host computing device 101 includes an operating system 104 supporting one or more user level applications 105.
  • the host computing device 101 may also include a network protocol stack (not shown).
  • the protocol stack may be a component of the application, a library with which the application is linked, or be provided by the operating system. In some embodiments, more than one protocol stack may be provided.
  • the network protocol stack may be a Transmission Control Protocol (TCP) stack.
  • TCP Transmission Control Protocol
  • the application 105 can send and receive TCP/IP messages by opening a socket and reading and writing data to and from the socket, and the operating system 104 causes the messages to be transported across the network. For example, the application can invoke a system call (syscall) for transmission of data through the socket and then via the operating system 104 to the network 103.
  • This interface for transmitting messages may be known as the message passing interface.
  • the network interface device 102 may comprise a TCP Offload Engine (TOE) for performing the TCP protocol processing.
  • TOE TCP Offload Engine
  • the demand on the host system’s 101 processor/s may be reduced.
  • Data to be transmitting over the network may be sent by an application 105 via a TOE-enabled virtual interface driver, by-passing the kernel TCP/IP stack in part or entirely. Data sent along this fast path therefore need only be formatted to meet the requirements of the TOE driver.
  • the host computing device 101 may comprise one or more processors and one or more memories.
  • the host computing device 101 and the network interface device 102 may communicate via a bus, for example a peripheral component interconnect express (PCIe bus).
  • PCIe bus peripheral component interconnect express
  • data to be transmitted onto the network may be transferred from the host computing device 101 to the network interface device 102 for transmission.
  • data packets may be transferred from the host to the network interface device directly by the host processor.
  • the host may provide data to one or more buffers 106 located on the network interface device 102.
  • the network interface device 102 may then prepare the data packets and transmit them over the network 103.
  • the data may be written to a buffer 107 in the host system 101.
  • the data may then be retrieved from the buffer 107 by the network interface device and transmitted over the network 103.
  • data is temporarily stored in one or more buffers prior to transmission over the network.
  • Data sent over the network could be returned to the host (in a lookback).
  • filtering processes may be carried out on received data packets so as to protect the host system 101 from distributed denial of service (DDOS) filtering.
  • DDOS distributed denial of service
  • Such filtering processes may be carried out by a simple pack examination or an extended Berkley packet filter (eBPF).
  • eBPF extended Berkley packet filter
  • encapsulation and forwarding may be carried out for data packets to be transmitted over the network 103.
  • FIG. 2 illustrates one way in which a filtering operation or other packet processing operation may be implemented in a host system 220.
  • the processes performed by the host system 220 are shown as being performed either in user space or kernel space.
  • a receive path for delivering data packets received from a network at the network interface device 210 to a terminating application 250 is present in kernel space.
  • This receive path comprises a driver 235, a protocol stack 240, and a socket 245.
  • the filtering operation 230 is implemented in user space.
  • the incoming packets that are provided by the network interface device 210 to the host system 220 bypass the kernel (where protocol processing takes place) and are provided directly to the filtering operation 230.
  • the filtering operation 230 is provided with a virtual interface (which may be an ether fabric virtual interface (EFVI) or data plane development kit (DPDK) or any other suitable interface) for exchanging the data packets with other elements in the host system 220.
  • the filtering operation 230 may perform DDOS scrubbing and/or other forms of filtering.
  • a DDOS scrubbing process may execute on all packets which are easily recognized as DDOS candidates - for example, a sample packet, a copy of a packet, and packets which have not yet been categorized.
  • the packets not delivered to the filtering operation 230 may be passed from the network interface to the driver 235 directly.
  • the operation 230 may provide an extended Berkeley packet filter (eBPF) for performing the filtering.
  • eBPF extended Berkeley packet filter
  • the operation 230 is configured to re-inject the packets into the receive path in the kernel for processing received packets. Specifically, the packets are provided to the driver 235 or stack 240. The packets are then protocol processed by the protocol stack 240. The packets are then passed to the socket 245 associated with the terminating application 250. The terminating application 250 issues a recv() call to retrieve the data packets from a buffer of the associated socket.
  • the filtering operation 230 runs on the host CPU.
  • the host CPU In order to run the filtering 230, the host CPU must process the data packets at the rate at which they are received from the network. In cases, where the rate at which data is sent and received from the network is high, this can constitute a large drain on the processing resources of the host CPU.
  • a high data flow rate to the filtering operation 230 may result in heavy consumption of other limited resources - such as I/O bandwidth and internal memory/cache bandwidth.
  • the filtering operation 230 In the order to perform the re-injection of the data packets into the kernel, it is necessary to provide the filtering operation 230 with a privileged API for performing the re-injection.
  • the re-injection process may be cumbersome requiring attention to packet ordering.
  • the operation 230 In order to perform the re-injection, the operation 230 may in many cases require a dedicated CPU core.
  • the steps of providing the data to the operation and re-injecting require the data to be copied into and out of memory. This copying is a resource burden on the system.
  • Some operations may require the forwarding of processed packets back onto the network.
  • an additional layer known as the express data path (XDP) 310 is inserted into the transmit and receive path in the kernel.
  • XDP express data path
  • An extension to XDP 310 allows insertion into the transmit path.
  • XDP helpers allow packets to be transmitted (as a result of a receive operation).
  • the XDP 310 is inserted at the driver level of the operating system and allows for programs to be executed at this level so as to perform operations on the data packets received from the network prior to them being protocol processed by stack 240.
  • the XDP 310 also allows for programs to be executed at this level so as to perform operations on data packets to be sent over the network. eBPF programs and other programs can, therefore, operate in the transmit and receive paths.
  • the filtering operation 320 may be inserted from user space into the XDP to form a program 330 that is part of the XDP 310.
  • the operation 320 is inserted using the XDP control plane that is to be executed on the data receive path to provide a program 330 which performs the filtering operations (e.g. DDOS scrubbing) for packets on the receive path.
  • a program 330 may be an eBPF program.
  • the program 330 is shown inserted into the kernel between the driver 235 and the protocol stack 240. However, in other examples, the program 330 may be inserted at other points in the receive path in the kernel.
  • the program 330 may be part of a separate control path that receives data packets.
  • the program 330 may be provided by an application by providing extensions to an application programming interface (API) of the socket 245 for that application. This program 330 may additionally or alternatively perform one or more operations on data being sent over the transmit path.
  • the XDP 310 then invokes the driver’s 235 transmit function to send data over the network via the network interface device 210.
  • the program 330 in this case may provide a load balancing or routing operation with respect to data packets to be sent over the network.
  • the program 330 may provide a segment re-encapsulation and forwarding operation with respect to data packets to be sent over the network
  • the program 330 may be used for firewalling and virtual switching or other operations not requiring protocol termination or application processing.
  • One advantage of the use of the XDP 310 in this way, is that the program 330 can directly access the memory buffers handled by the driver without intermediate copies.
  • a verifier may run on the host system 220 to verify the safety of the program 330.
  • the verifier may be configured to ensure that no loops exists. Backward jump operations may be permitted provided they do not cause loops.
  • the verifier may be configured to ensure that the program 330 has no more than a predefined number (e.g. 4000) instructions.
  • the verifier may perform checks on the validity of register usage by traversing through data paths of the program 330. If there are too many possible paths, the program 330 will be rejected as being unsafe to run in kernel mode. For example if there are more than 1000 branches, the program 330 may be rejected.
  • XDP is one example by which a safe program 330 may be installed in the kernel, and that there are other ways in which this could be accomplished.
  • the approach discussed above with respect to Figure 3 may be as efficient as the approach discussed above with respect to Figure 2 if, for example, the operation can be expressed in a safe (or sandboxed) language required for executing code in the kernel.
  • the eBPF language can be executed efficiently on an x86 processor and JIT (Just in Time) compilation techniques enable eBPF programs to be compiled to native machine code.
  • the language is designed to be safe, e.g. state is limited to map only constructs which are shared data structures (such as a hash table). There is limited looping allowed, instead one eBPF program is allowed to tail-call another.
  • the state space is constrained. However, in some implementations, with this approach there may be a large drain on the resources (e.g.
  • Host CPU of the host system 220.
  • the operations on the data packets are still being performed by the Host CPU, which is required to perform such operations at the rate at which the data is being sent/received.
  • Another proposal is to perform the above discussed operations in the network interface device instead of in the host system. Doing so may free up the CPU cycles used by the host CPU when executing the operations in addition to the I/O bandwidth, memory and cache bandwidth consumed. Moving execution of the processing operation from the host to hardware of the network interface device may present some challenges.
  • NPU network processing unit
  • FIG 4 illustrates an example of a network interface device 400 comprising an array 410 of central processing units (CPUs), e.g. CPU 420.
  • the CPUs are configured to perform functions, such as filtering data packets sent and received from the network.
  • Each CPU of the array 410 of CPUs may be an NPU.
  • the CPUs may additionally or alternatively be configured to perform operations, such as load-balancing on data packets received from the host for transmission over the network.
  • These CPUs are specialized for such packet processing/manipulation operations.
  • the CPUs execute an instruction set which is optimized for such packet processing/manipulation operations.
  • the network interface device 400 additionally comprises memory (not shown) that is shared amongst and accessible to the array 410 of CPUs.
  • the network interface device 400 comprises a network medium access control (MAC) layer 430 for interfacing the network interface device 400 with the network.
  • the MAC layer 430 is configured to receive data packets from over the network and send data packets over the network.
  • the operations on packets received at the network interface device 400 are parallelized over the CPUs.
  • a data flow is received at the MAC layer 430, it is passed to a spread function 440, which is configured to extract data packets from a flow and distribute them over a plurality of CPUs in the NPU 410 for the CPUs to perform processing, e.g. filtering, of these data packets.
  • the spread function 440 may parse the received data packets so as to identify the data flows to which they belong.
  • the spread function 440 generates for each packet, an indication of the respective packet’s position in the data flow to which it belongs.
  • the indications may, for example, be tags.
  • the spread function 440 adds the respective indication to each packet’s associated metadata.
  • the associated metadata for each data packet may be appended to the data packet.
  • the associated metadata could be passed to the spread function 440 as side-band control information.
  • the indication is added in dependence upon the flow to which the data packet belongs, such that the order of data packets for any particular flow may be reconstructed.
  • the data packets are then passed to a re-order function 450, which re-orders the packets of the data flow into their correct order before passing them to the host interface layer 460.
  • the re-order function 450 may re-order the data packets within a flow by comparing the indications (e.g. tags) within the data packets of the flow to reconstruct the order of the data packets.
  • the re-ordered data packets then traverse the host interface 460 and are delivered to the host system 220.
  • Figure 4 illustrates the array 410 of CPUs operating only on data packets received from the network
  • similar principles including spreading and re-ordering
  • the program that is executed by the CPUs may be a compiled or transcoded version of the program that would execute on the host CPU in the example described above with respect to Figure 3.
  • the instruction set that would execute on a host CPU to perform the operations is translated for execution on each CPU the array of specialized CPUs in the network interface 400.
  • each instance of the program may be responsible for processing a different set of data packets received at the network interface device. However, each individual data packet is processed by a single CPU when providing the function of the program with respect to that data packet.
  • the overall effect of the execution of the parallel programs may be the same as the execution of a single program (e.g. program 330) on the host CPU.
  • One of the specialized CPUs may process data packets at an order of 50 million packets per second. This operating speed may be lower than the operating speed of the host CPU. Therefore, parallelization may be used to achieve the same performance as would be achieved by executing an equivalent program on the host CPU.
  • the data packets are spread over the CPUs and then re-ordered after processing by the CPUs.
  • the requirement to process data packets of each flow in order along with the re-ordering step 450 may introduce bottlenecks, increase memory resource overheads and may limit the available throughput of the device. This requirement and the re-ordering step 450 may increase the jitter of the device, since the processing throughput may fluctuate depending on the contents of the network traffic and the degree to which the parallelism can be applied.
  • One advantage of the use of such specialized CPUs may be the short compile time. For example, it may be possible to compile a filtering application to run on such a CPU in less than 1 second.
  • Another proposal is to include in the network interface device, a field programmable gate array (FPGA) and to use the FPGA to perform the operations on data packets received from the network.
  • FPGA field programmable gate array
  • Figure 5 illustrates an example of the use, in a network interface device 500, of an FPGA 510 having an FPGA application 515 for performing operations on data packets received at the network interface device 500.
  • FPGA 510 having an FPGA application 515 for performing operations on data packets received at the network interface device 500.
  • Like elements as those in Figure 4 are referred to with like reference numerals.
  • FIG. 5 illustrates the FPGA application 515 operating only on data packets received from the network
  • an FPGA application 515 may be used to perform functions (e.g. load balancing and/or a firewall function) on these data packets received from the host for transmission over the network or back to the host or another network interface on the system.
  • the FPGA application 515 may be provided by compiling a program written in a common system-level language, such as C or C++ or scala to run on an FPGA 510.
  • That FPGA 510 may have network interface functionality and FPGA functionality.
  • the FPGA functionality may provide an FPGA application 515, which may be programmed into the FPGA 510 according to the needs of the network interface device user.
  • the FPGA application 515 may, for example, provide filtering of the messages on the receive path from the network 230 to the host.
  • the FPGA application 515 may provide a firewall.
  • the FPGA 510 may be programmable to provide the FPGA application 515.
  • Some of the network interface device functionality may be implemented as“hard” logic within the FPGA 510.
  • the hard logic may be application specific integrated circuit (ASIC) gates.
  • the FPGA application 515 may be implemented as“soft” logic.
  • the soft logic may be provided by programming the FPGA LUTs (look up tables).
  • the hard logic may be capable of being clocked at a higher rate as compared to the soft logic.
  • the network interface device 500 comprises a host interface 505 configured to send and receive data with the host.
  • the network interface device 520 comprises a network medium access control (MAC) interface 520 configured to send and receive data with the network.
  • MAC network medium access control
  • the data packet When a data packet is received from the network at the MAC interface 520, the data packet is passed to the FPGA application 515, which is configured to perform a function, such as filtering, with respect to the data packet. The data packet (if it passes any filtering) is then passed to the host interface 505 from where it is passed to the host. Alternatively, the data packet FPGA application 515 may determine to drop or re-transmit the data packet.
  • the FPGA application 515 may determine to drop or re-transmit the data packet.
  • the FPGA is composed of many logic elements (e.g. logic cells) which individually represent a primitive logic operation, such as AND, OR, NOT, etc. These logic elements are arranged into a matrix with a programmable interconnect. In order to provide a function, these logic cells may need to operate together to implement the circuit definition and synchronous clock timing constraints. Placing each logic cell and routing between cells may algorithmically be a difficult challenge. When compiling on an FPGA having lower levels of utilisation, the compile time may be less than ten minutes.
  • One approach is to design hardware using specific processing primitives, such as parse, match and action primitives. These may be used to construct a processing pipeline where all packets undergo each of the three processes. Firstly, a packet is parsed to construct a metadata representation of the protocol headers. Secondly, the packet is flexibly matched against rules held in tables. Finally, when a match is found the packet is actioned in dependence upon the entry from the table selected in the match operation.
  • processing primitives such as parse, match and action primitives.
  • the P4 programming language (or a similar language) may be used.
  • the P4 programming language is target independent, meaning that a program written in P4 can be compiled to run in different types of hardware such as CPUs, FPGAs, ASICs, NPUs, etc. Each different type of target is provided with its own compiler that maps the P4 source code into the appropriate target switch model.
  • P4 may be used to provide a programming model which allows a high-level program to express packet processing operations for a packet processing pipeline. This approach works well for operations which naturally express themselves in a declarative style. In the P4 language, the programmer expresses the parsing, matching, and action stages as operations to be performed for the received data packets. These operations are gathered together for dedicated hardware to perform efficiently. However, this declarative style may not be appropriate for expressing programs of an imperative natures, such as eBPF programs.
  • a sequence of eBPF programs may be required to execute serially.
  • a chain of eBPF programs are generated, one calling another.
  • Each program can modify state and the output is as if the entire chain of programs has executed serially. It may be challenging for a compiler to gather all the parsing, matching and actioning steps. However, even in the case that the chain of eBPF programs has already been installed, it might be necessary to install, remove, or modify the chain, which may present further challenges.
  • FIG. 10 illustrates an example of a sequence of programs ei, e 2 , e , that are configured to process a data packet.
  • Each of the programs may be an eBPF program, for example.
  • Each of the programs is configured to parse the receive data packet, perform look up into table 1010 to determine an action in matching entry in the table 1010, and then perform the action with respect to the data packet. The action may comprise modifying the packet.
  • Each of the eBPF programs may also perform an action in dependent upon local and shared state.
  • the data packet Po is initially processed by eBPF program ei, before being passed, modified, to the next program e 2 in the pipeline.
  • the output of the sequence of programs is the output of the final program in the pipeline, i.e. e 3 .
  • a network interface device comprising a plurality of processing units.
  • Each processing unit is configured to perform at least one predefined operation in hardware.
  • Each processing unit comprises a memory storing its own local state.
  • Each processing unit comprises a digital circuit modifying this state.
  • the digital circuit may be an application specific integrated circuit.
  • Each processing unit is configured to run a program comprising configurable parameters so as to perform the respective plurality of operations.
  • Each processing unit may be an atom.
  • An atom is defined by the specific programing and routing of a pre-defmed template. This defines its specific operational behaviour and logical place in the flow provided by the connected plurality of processing units. Where the term‘atom’ is used in the specification, this may be understood to refer to a data processing unit that is configured to execute its operations in a single step. In other words, the atom executes its operations as an atomic operation.
  • An atom may be regarded as a collection of hardware structures which can be configured to repeatedly perform one of a range of computations, taking one or more inputs and producing one or more outputs.
  • An atom is provided by hardware.
  • An atom may be configured by a compiler.
  • An atom may be configured to perform computations.
  • At least some of the plurality of processing units are arranged to perform operations such that a function is performed with respect to a data packet received at the network interface device by the at least some of the plurality of processing units.
  • Each of the at least some of the plurality of processing units is configured to perform its respective at least one predefined operation so as to perform the function with respect to a data packet.
  • the operations which the connected processing units are configured to perform are performed with respect to a received data packet.
  • the operations are performed sequentially by the at least some of the plurality of processing units.
  • each of the atoms By arranging each of the atoms to execute their respective at least one predefined operation so as to perform the function, the compilation time may be reduced as compared to the FPGA application example described above with respect to Figure 5. Furthermore, by performing the function using processing units specifically dedicated to performing particular operations in hardware, the speed at which the function can be performed may be improved with respect to using a CPU executing software in the network interface device to perform the function for each data packet as discussed above with respect to Figure 4.
  • FIG. 6 illustrates an example of a network interface device 600 according to embodiments of the application.
  • the network interface device comprises a hardware module 610 configured to perform the processing of data packets received at an interface of the network interface device 600.
  • Figure 6 illustrates the hardware module 610 performing a function (e.g. filtering) for data packets on the receive path
  • the hardware module 610 may also be used for performing a function (e.g. load balancing or a firewall) for data packets on the transmit path that are received from the host.
  • the network interface device 600 comprises a host interface 620 for sending and receiving data packets with the host and a network MAC interface 630 for sending and receiving data packets with the network.
  • the network interface device 600 comprises a hardware module 610 comprising a plurality of processing units 640a, 640b, 640c, 640d.
  • Each of the processing units may be an atom processing unit.
  • the term atom is used in the description to refer to processing units.
  • Each of the processing units is configured to perform at least one operation in hardware.
  • Each of the processing units comprises a digital circuit 645 configured to perform the at least one operation.
  • the digital circuit 645 may be an application specific integrated circuit.
  • Each of the processing units additionally comprises a memory 650 storing state information. The digital circuit 645 updates the state information when executing the respective plurality of operations.
  • each of the processing units has access to a shared memory 660, which may also store state information accessible to each of the plurality of processing units.
  • the state information in the shared memory 660 and/or the state information in the memory 650 of the processing units may include at least one of: metadata which is passed between processing units, temporary variables, the contents of the data packets, the contents of one or more shared maps.
  • the plurality of processing units are capable of providing a function to be performed with respect to data packets received at the network interface device 600.
  • the compiler outputs instructions to configure the hardware module 610 to perform a function with respect to incoming data packets by arranging at least some of the plurality of processing units to perform their respective at least one predefined operation with respect to each incoming data packet. This may be achieved by chaining (i.e. connecting) together the at least some of the processing units 640a, 640b, 640c, 640d so that each of the connected processing units will perform their respective at least one operation with respect to each incoming data packet. Each of the processing units performs their respective at least one operation in a particular order so as to perform the function.
  • the order may be such that two or more of the processing units execute in parallel with each other, i.e. at the same time.
  • one processing unit may read from a data packet during a time period (defined by a periodic signal (e.g. clock signal) of the hardware module 610) in which a second processing unit also reads from a different location in the same data packet.
  • the data packet passes through each stage represented by the processing units in a sequence. In this case, each processing unit completes its processing before passing the data packet to the next processing unit for performing its processing.
  • processing units 640a, 640b, and 640d are connected together at compile time, such that each of them performs their respective at least one operation so as to perform a function, e.g. filtering, with respect to the received data packet.
  • the processing units 640a, 640b, 640d form a pipeline for processing the data packet.
  • the data packet may move along this pipeline in stages, each having an equal time period.
  • the time period may be defined according to a period signal or beat.
  • the time period may be defined by a clock signal. Several periods of the clock may define one time period for each stage of the pipeline.
  • the data packet moves along one stage in the pipeline at the end of each occurrence of the repeating time period.
  • the time period may be a fixed interval.
  • each time period for a stage in the pipeline may take a variable amount of time.
  • a signal indicating the next stage in the pipeline may be generated when the previous processing stage has finished an operation, which may take a variable amount of time.
  • a stall may be introduced at any stage in the pipeline by delaying the signal for some pre-determined amount of time
  • Each of the processing units 640a, 640b, 640d may be configured to access shared memory 660 as part of their respective at least one operation.
  • Each of the processing units 640a, 640b, 640d may be configured to pass metadata between one another as part of their respective at least one operation.
  • Each of the processing units 640a, 640b, 640d may be configured to access the data packet received from the network as part of their respective at least one operation.
  • the processing unit 640c is not used to perform processing of received data packets so as to provide the function, but is omitted from the pipeline.
  • a data packet received at the network MAC layer 630 may be passed to the hardware module 610 for processing.
  • the processing performed by the hardware module 610 may be part of a larger processing pipeline providing additional functions with respect to the data packet other than the function provided by the hardware module 610. This is illustrated with respect to Figure 14, and will be explained in more detail below.
  • the first processing unit 640a is configured to perform a first at least one operation with respect to the data packet. This first at least one operation may comprise at least one of: reading from the data packet, reading and writing to shared state in memory 660, and/or performing a look up into a table to determine an action.
  • the first processing unit 640a is then configured to produce results from its at least one operation.
  • the results may be in the form of metadata.
  • the results may comprise a modification to the data packet.
  • the results may comprise a modification to shared state in memory 660.
  • the second processing unit 640b is configured to perform its at least one operation with respect to the first data packet in dependence upon the results from the operation carried out by the first processing unit 640a.
  • the second processing unit 640b produce results from its at least one operation and passes the results to a third processing unit 640d that is configured to perform its at least one operation with respect to the first data packet.
  • a third processing unit 640d that is configured to perform its at least one operation with respect to the first data packet.
  • the data packet may then be passed to the host interface 620, from where it is passed to the host system.
  • the connected processing units form a pipeline for processing a data packet received at the network interface device.
  • This pipeline may provide the processing of an eBPF program.
  • the pipeline may provide the processing of a plurality of eBPF programs.
  • the pipeline may provide the processing of a plurality of modules which execute in a sequence.
  • the connecting together of processing units in the hardware module 610 may be performed by programming a routing function of a pre-synthesised interconnection fabric of the hardware module 610.
  • This interconnection fabric provides connections between the various processing units of the hardware module 610.
  • the interconnection fabric is programmed according to the topology supported by the fabric. A possible example topology is discussed below with reference to Figure 15.
  • the hardware module 610 supports at least one bus interface.
  • the at least one bus interface receives data packets at the hardware module 610 (e.g. from the host or network).
  • the at least one bus interface outputs data packets from the hardware module 610 (e.g. to the host or network).
  • the at least one bus interface receives control messages at the hardware module 610.
  • the control messages may be for configuring the hardware module 610.
  • the example shown in Figure 6 has the advantage of a reduced compile time with respect to the FPGA application 515 shown in Figure 5.
  • the hardware module 610 of Figure 6 may require less than 10 seconds to compile a filtering function, for example.
  • the example shown in Figure 6 has the advantage improved processing speed with respect to the example of an array of CPUs shown in Figure 4.
  • An application may be complied for execution in such a hardware module 610 by mapping a generic program (or multiple programs) to a pre-synthesised data path.
  • the compiler builds the data-path by linking an arbitrary number of processing stage instances, where each instance is built from one of the pre-synthesised processing stage atoms.
  • Each of the atoms is built from a circuit.
  • Each circuit may be defined using an RTL (register transfer language) or high level language.
  • Each circuit is synthesised using a compiler or tool chain.
  • the atoms may be synthesised into hard-logic and so be available as a hard (ASIC) resource in a hardware module of the network interface device.
  • the atoms may be synthesised into soft-logic.
  • the atoms in soft-logic may be provided with constraints which allocate and maintain the place and route information of the synthesised logic on the physical device.
  • An atom may be designed with configurable parameters that specifies an atom’s behaviour. Each parameter may be a variable, or even a sequence of operations (a micro program), which may specify at least one operation to be performed by a processing unit during a clock cycle of the processing pipeline.
  • the logic implementing the atoms may be synchronously or asynchronously clocked.
  • the processing pipeline of atoms itself may be configured to operate according to a periodic signal.
  • each the data packet and metadata moves one stage along the pipeline in response to each occurrence of the signal.
  • the processing pipeline may operate in an asynchronous manner. In this case, back pressure at higher levels in the pipeline will cause each downstream stage to start processing only when data from an upstream stage has been presented to it.
  • a sequence of computer code instructions is separated into a plurality of operations, each of which is mapped to a single atom.
  • Each operation may represent a single line of disassembled instruction in the computer code instruction.
  • Each operation is assigned to one of the atoms to be carried out by one of the atoms.
  • There may be one atom per expression in the computer code instructions.
  • Each atom is associated with a type of operation, and is selected to carry out at least one operation in the computer code instructions based on its associated type of operation. For example, an atom may be preconfigured to perform a load operation from a data packet. Therefore, such an atom is assigned to carry out an instruction representing a load operation from a data packet in the computer code.
  • One atom may be selected per line in the computer code instructions. Therefore, when implementing a function in a hardware module containing such atoms, there may be lOOs of such atoms, each performing their respective operations so as to perform the function with respect to that data packet.
  • Each atom may be constructed according to one of a set of processing stage templates that determine its associated type of operation/s.
  • the compilation process is configured to generate instructions to control each atom to perform a specific at least one operation based on its associated type. For example, if an atom is preconfigured to perform packet access operations, the compilation process may assign to that atom, an operation to load certain information (e.g. the packet’s source ID) from the header of the packet.
  • the compilation process is configured to send instructions to the hardware module, in which the atoms are configured to perform the operations assigned to them by the compilation process.
  • the processing stage templates that specify an atom’s behaviour are logic stage templates (e.g. providing operations over registers, scratch pad memory, and stack, as well as branches) packet access state templates (e.g. providing packet data loads and/or packet data stores), and map access stage templates (e.g. map lookup algorithms, map table sizes).
  • logic stage templates e.g. providing operations over registers, scratch pad memory, and stack, as well as branches
  • packet access state templates e.g. providing packet data loads and/or packet data stores
  • map access stage templates e.g. map lookup algorithms, map table sizes
  • a packet access stage can comprise at least one of: reading a sequence of bytes from the data packet; replacing one sequence of bytes with a different sequence of bytes in the data packet; inserting bytes into a data packet; and deleting bytes in the data packet.
  • a map access stage can be used to access different types of map (e.g. a lookup table), including direct indexed array and associative array.
  • a map access stage may comprise at least one of: reading a value from a location; writing a value to a location; replacing a value at a location in the map with a different value.
  • a map access stage may comprise a compare operation in which a value is read from a location in the map and compared with a different value. If the value read from the location is less than the different value, then a first action (e.g. do nothing, exchange the value at the location for the different value, or add the values together) may be performed. Otherwise, a second action (e.g. do nothing, exchange or add a value) may be performed. In either case, the value read from the location may be provided to the next processing stage.
  • a first action e.g. do nothing, exchange the value at the location for the different value, or add the values together
  • a second action e.g. do nothing,
  • Each map access stage may be implemented in a stateful processing unit.
  • the circuitry 1700 may include a hash function 1710 configured to perform a hash of input values that are used as an input to a lookup table.
  • the circuitry 1700 includes a memory 1720 configured to store state associated with the atom’s operations.
  • the circuitry 1700 includes an arithmetic logic unit 1730 configured to perform an operation.
  • a logic stage may perform computations on the values provided by the preceding stages.
  • the processing units configured to implement a logic stage may be stateless processing units. Each stateless processing unit can perform a simple arithmetic operation. Each processing unit may perform, for example, an 8-bit operation.
  • Each logic stage may be implemented in a stateless processing unit.
  • Figure 18, illustrates an example of circuitry 1800 that may be included in an atom configured to perform processing of a logic stage.
  • the circuitry 1800 comprises an array of arithmetic logic unit (ALUs) and multiplexers.
  • the ALUs and multiplexors are arranged in layer, with the outputs of one layer of processing by the ALUs being used by the multiplexors to provide the inputs to the next layer of ALUs.
  • a pipeline of stages implemented in the hardware module may comprise a first packet access stage (pktO), followed by a first logic stage (logicO), followed by a first map access stage (mapO), followed by a second logic stage (logic 1), followed by a second packet access stage (pktl), and so on. It may, therefore, take the following form:
  • Stage pktO extracts the required information from the packet Stage pktO passes this information to stage logicO.
  • Stage logicO determines whether the packet is a valid IP packet. In some case logicO forms the map request and sends the map request to mapO, which carries out the map operation.
  • Stage mapO may perform an update to the look up table.
  • Stage logic 1 then collects the result from map operation and decides whether to drop the packet as a result.
  • the map request is disabled to cover the case where a map operation should not be performed for this packet.
  • logicO indicates to logic 1 whether or not the packet should be dropped in dependence upon whether or not the packet is a valid IP packet.
  • the look up table contains 256 entries where each entry is an 8-bit value.
  • This example described includes only five stages. However, as noted many more may be used. Furthermore, operations need not all be carried out in sequence, but some operations with respect to the same data packet may be carried out simultaneously by different processing units.
  • the hardware module 610 shown in Figure 6 illustrates a single pipeline of atoms for performing a function with respect to data packets.
  • a hardware module 610 may comprise a plurality of pipelines for processing data packets. Each of the plurality of pipelines may perform a different function with respect to data packets.
  • the hardware module 610 is configurable to interconnect a first set atoms of the hardware module 610 to form a first data processing pipeline.
  • the hardware module 610 is also configurable to interconnect a second set of atoms of the hardware module 610 to form a second data processing pipeline.
  • a series of steps starting from a sequence of computer code may be carried out.
  • the compiler which may run on a processor on the host device or on the network interface device, has access to the disassembled sequence of computer code.
  • the compiler is configured to split the sequence of computer code instructions into separate stages.
  • Each stage may comprise operations according to one of the processing stage templates described above. For example, one stage may provide a read from the data packet. One stage may provide an update of map data. Another stage may make a pass drop decision.
  • the compiler assigns each of the plurality of operations expressed by the code to one of the plurality of stages.
  • the compiler is configured to assign each of the processing stages determined from the code to be performed by a different processing unit. This means that each of the respective at least one operation of a processing stage is carried out by a different processing stage.
  • the output of the compiler can then be used to cause the processing units to perform the operations of each stage in a particular order so as to perform the function.
  • the output of the compiler comprises generated instmctions which are used to cause the processing units of the hardware module to carry out the operations associated with each processing stage.
  • the output of the compiler may also be used to generate logic in the hardware module that responds to control messages for configuring the hardware module 610. Such control messages are described in more detail below with respect to Figure 14.
  • the compilation process for compiling a function to be executed on the network interface device 600 may be performed in response to determining that the process for providing the function is safe for execution in the kernel of the host device.
  • the determination of the safety of the program may be carried out by a suitable verifier as described above with respect to Figure 3. Once the process has been determined to be safe for execution in the kernel, the process may be compiled for execution in the network interface device.
  • Figure 15 illustrates a representation of at least some of the plurality of processing units that perform their respective at least one operation in order to perform the function with respect to a data packet.
  • a representation may be generated by the compiler and used to configure the hardware module to perform the function.
  • the representation indicates the order in which the operations may be carried out and how some of the processing units perform their operations in parallel.
  • the representation 1500 is in the form of a table having rows and columns. Some of the entries of the table show atoms, e.g. atom 15l0a, configured to perform their respective operation.
  • the row to which a processing unit belongs indicates the timing of the operation performed by that processing unit with respect to a particular data packet. Each row may correspond to a single time period represented by one or more cycles of a clock signal. Processing units belonging to the same row, perform their operations in parallel.
  • Inputs to the logic stage are provided in row 0 and computation flows forward into the later rows.
  • an atom receives the result from the processing by the atom in the same columns as itself but in the previous row.
  • atom 15 lOb receives results from the processing by atom 15 lOa, and performs its own processing on dependence upon these results.
  • atoms may also access outputs from atoms in the previous row for which the column number differs by no more than two.
  • the atom 15 lOd may receive the results from the processing performed by atom lSlOc.
  • atoms may also access outputs from atoms in the previous two rows and in any column. This may be performed using global routing resources. For example, the atom 15 lOf may receive the results from the processing performed by atom !5l0e.
  • the particular constraints are determined by the topology supported by the interconnection fabric supported by the hardware module.
  • the interconnection fabric is programed to cause the atoms of the hardware module to execute their operations in a particular order and provide data between each other within the constraints.
  • Figure 15 shows one particular example of how the interconnection fabric may be so programmed.
  • a place and route algorithm is used during synthesis of an FPGA application 515 onto an FPGA (as illustrated in Figure 5).
  • the solution space is constrained and so the algorithm has a short bounded execution time.
  • a second at least one processing unit (which may be an FPGA application or a template type of processing unit as described above with respect to Figure 6) may be configured to perform the function with respect to data packets.
  • the function can then be migrated from the first at least one processing unit to the second at least one processing unit, such that the second at least one processing unit then performs the function for subsequently received data packets at the network interface device.
  • the slower compilation time of the second at least one processing unit therefore, does not prevent the network interface device from performing the function with respect to data packets before the function has been compiled for the second at least one processing unit, since the first at least one processing unit can be compiled faster and can be used for performing the function with respect to data packets whilst the function is compiled for the second at least one processing unit. Since the second at least one processing unit typically has a faster processing time, migrating to the second at least one processing unit when it is compiled allows faster processing of the data packets received at the network interface device.
  • the application compilation processes may be configured to run on at least one processor of the data processing system, wherein the at least one processor is configured to send instructions for the first at least one processing unit and the second at least one processing unit to perform the at least one function with respect to a data packet at appropriate times.
  • the at least one processor may comprise a host CPU.
  • the at least one processor may comprise a control processor on the network interface device.
  • the at least one processor may comprise a combination of one or more processors on the host system and one or more processors on the network interface device.
  • the at least one processor is configured to perform a first compilation process to compile a function to be performed by a first at least one processing unit of a network interface device.
  • the at least one processing unit is also configured perform a second compilation process to compile the function to be performed by a second at least one processing unit of the network interface device.
  • the at least one processing unit instructs the first at least one processing unit to perform the function with respect to data packets received from a network.
  • the at least one processing unit instructs the second at least one processing unit to begin performing the function with respect to data packets received from the network.
  • Performing these steps enables the network interface device to perform the function using the first at least one processing unit (which may have a shorter compile time but slower and/or less efficient processing) whilst waiting for the second compilation process to complete.
  • the network interface device may then perform the function using the second at least one processing unit (which may have a longer compile time but faster and/or more efficient processing) in addition to or instead of the first at least one processing unit.
  • FIG 7 illustrates an example network interface device 700 in accordance with embodiments of the application. Like reference elements to those shown in the previous Figures are indicated with like reference numerals.
  • the network interface device comprises a first at least one processing unit 710.
  • the first at least one processing unit 710 may comprise the hardware module 610 shown in figure 6, which comprises a plurality of processing units.
  • the first at least one processing unit 710 may comprise one or more CPUs, such as shown in Figure 4.
  • the function is compiled to run on the first at least one processing unit 710 such that, during a first time period, the function is performed by the first at least one processing unit 710 with respect to data packets received from the network.
  • the first at least one processing unit 710 is, prior to completion of the second compilation process for the second at least one processing unit, instructed by the at least one processor to perform the function with respect to data packets received from the network.
  • the network interface device comprises a second at least one processing unit 720.
  • the second at least one processing unit 720 may comprise an FPGA having an FPGA application (such as is illustrated in Figure 5) or may comprise the hardware module 610 shown in figure 6, which comprises a plurality of processing units.
  • the second compilation process is carried out to compile the function for running on the second at least one processing unit. That is, the network interface device is configured to compile the FPGA application 515 on the fly.
  • the second at least one processing unit 720 is configured to begin performing the function with respect to the data packets received from the network.
  • the first at least one processing unit 710 may cease performing the function with respect to the data packets received from the network. In some embodiments, the first at least one processing unit 710 may, in part, cease performing the function with respect to the data packets. For example, if the first at least one processing unit comprises a plurality of CPUs, subsequent to the first time period, one or more of the CPUs may cease performing the processing with respect to the data packets received from the network, with the remaining CPUs of the plurality of CPUs continuing to perform the processing.
  • the first at least one processing unit 710 may be configured to perform the function with respect to data packets of a first data flow.
  • the second at least one processing unit 720 may begin to perform the function with respect to the data packets of the first data flow.
  • the first at least one processing unit may cease performing the function with respect to the data packets of the first data flow.
  • the first at least one processing unit 710 comprises a plurality of CPUs (as illustrated in Figure 4) whilst the second at least one processing unit 720 comprises a hardware module having a plurality of processing units (as illustrated in Figure 6).
  • the first at least one processing unit 710 comprises a plurality of CPUs (as illustrated in Figure 4) whilst the second at least one processing unit 720 comprises an FPGA (as illustrated in Figure 5).
  • the first at least one processing unit 710 comprises a hardware module having a plurality of processing units (as illustrated in Figure 6) whilst the second at least one processing unit 720 comprises an FPGA (as illustrated in Figure 5).
  • FIG 11 illustrates how the connected plurality of processing units 640a, 640b, 640d may perform its respective at least one operations with respect to a data packet.
  • Each of the processing units is configured to perform its respective at least one operation with respect to a received data packet.
  • the at least one operation of each processing unit may represent a logic stage in the function (e.g. a function of an eBPF program).
  • the at least one operation of each processing unit may be expressible by an instruction that is executed by the processing unit.
  • the instruction may determine the behaviour of an atom.
  • Figure 11 illustrates how the packet (Po) progresses along the processing stages implemented by each processing unit.
  • Each processing unit performs processing with respect to the packet in a particular order specified by the compiler.
  • the order may be such that some of the processing units are configured to perform their processing in parallel.
  • This processing may comprises accessing at least part of the packet held in a memory. Additionally or alternatively, this processing may comprises performing a look up into a look up table to determine an action to be carried out for the packet. Additionally or alternatively, this processing may comprises modifying state 1110.
  • the processing units exchange Metadata Mo, M M 2 , M 3 with one another.
  • the first processing unit 640a is configured to perform its respective at least one predefined operation and generate metadata Mi in response.
  • the first processing unit 640a is configured to pass the metadata Mi to the second processing unit 640b.
  • At least some of the processing units perform their respective at least one operation in dependence upon at least one of: the content of the data packet, its own stored state, the global shared state, and metadata (e.g. M 0 , Mi, M 2 , M 3 ) associated with the data packet.
  • Some of the processing units may be stateless.
  • Each of the processing units may perform its associated type of operation for the data packet (Po) during at least one clock cycle. In some embodiments, each of the processing units may perform its associated type of operation during a single clock cycle. Each of the processing units may be individual clocked for performing their operations. This clocking may be an addition to the clocking of the processing pipeline of processing units.
  • the second processing unit 640b is configured to be connected to the first processing unit 640a configured to perform a first at least one predefined operation with respect to the first data packet.
  • the second processing unit 640b is configured to receive from the first further processing unit, results of the first at least one predefined operation.
  • the second processing unit 640b is configured to perform a second at least one predefined operation in dependence upon the results of the first at least one predefined operation.
  • the second processing unit 640b is configured to be connected to the third processing unit 640d configured to perform a third at least one predefined operation with respect to the first data packet.
  • the second processing unit 640b is configured to send results of the second at least one predefined operation to the third processing unit 640d for processing in the third at least one predefined operation.
  • the processing units may similarly operate in order so as to provide the function with respect to each of a plurality of data packets.
  • Embodiments of the application are such that multiple packets may be simultaneously be pipelined if the function permits.
  • a first processing unit 640a is executing its respective at least one operation at a first time (to) with respect to a third data packet (P 2 ).
  • a second processing unit 640b is executing its respective at least one operation at the first time (to) with respect to a second data packet (Pi).
  • a third processing unit 640d is executing its respective at least one operation at the first time (to) with respect to a first data packet (P 0 ).
  • each of the packets moves along one stage in the sequence.
  • the first processing unit 640a is executing its respective at least one operation at a first time (to) with respect to a fourth data packet (P 3 ).
  • the second processing unit 640b is executing its respective at least one operation at the first time (to) with respect to the third data packet (P 2 ).
  • the third processing unit 640d is executing its respective at least one operation at the first time (to) with respect to the first data packet (P t ).
  • packets may move from one stage to the next, not necessarily in lock step.
  • each of the processing units may be configured to execute a no operation (i.e. the processing unit stalls) instruction when necessary.
  • operations require one clock cycle to be executed by a processing unit. This can mean that values in shared state that are required by one processing unit have not yet been updated by another processing unit. Out of date values in the shared state 1110 may therefore be read by the processing unit requiring them. Hazards may therefore occur when reading and writing values to shared state. On the other hand, operations on intermediate values may be passed along as metadata without hazards occurring.
  • an increment operation may be an operation to increment a packet counter in shared state 1110.
  • the second processing unit 640b is configured to read the value of a counter from shared state 1110, and provide the output of this read operation (e.g. as metadata M 2 ) to the third processing unit 640d.
  • the third processing unit 640d is configured to receive the value of the counter from the second processing unit 640b.
  • the third processing unit 640d increments this value and writes the new incremented value to the shared state 1110.
  • a problem may occur when executing such an increment operation, which is that if, during the second time slot, the second processing unit 640b attempts to access the counter stored in shared state 11 10, the second processing unit 640b may read the previous value of the counter before the counter value in shared state 1 110 is updated by the third processing unit 640d.
  • the second processing unit 640b may be stalled during the second time slot (through the execution by the second processing unit 640b of a no operation instruction or a pipeline bubble).
  • a stall may be understood to be a delay in the execution of the next instruction. This delay may be implemented by execution of a“no operation” instruction instead of the next instruction.
  • the second processing unit 640b then reads the counter value from shared state 1 110 during a following third time slot. During the third time slot, the counter in shared state 1110 has been updated, and so it is ensured that the second processing unit 640b reads the updated value.
  • the respective atoms are configured to read from the state, update the state and write the updated state during a single pipeline time slot.
  • the stalling of the processing units described above may not be used. However, stalling the processing units may reduce the cost of the memory interface required.
  • the processing units in the pipeline may wait until other processing units in the pipeline have finished their processing before performing their own operations.
  • the compiler builds the data-path by linking an arbitrary number of processing stage instances, where each instance is built from one of a predefined number (three in the example given) of pre-synthesised processing stage templates.
  • the processing stage templates are logic stage templates (e.g. providing arithmetic operations over registers, scratch pad memory, and metadata), packet access state templates (e.g. providing packet data loads and/or packet data stores), and map access stage templates (e.g. map lookup algorithms, map table sizes).
  • Each processing stage instance may be implemented by a single one of the processing units. That is each processing stage comprises the respective at least one operation carried out by a processing unit.
  • Figure 13 illustrates an example of a how the processing stages may be connected together in a pipeline 1300 to process a received data packet.
  • a first data packet is received at and stored in a FIFO 1305.
  • One or more calling arguments are received at a first logic stage 1310.
  • the calling arguments may comprise a program selector which identifies the function to be executed for a received data packet.
  • the calling arguments may comprise an indication of a packet length of the received data packet.
  • the first logic stage 1310 is configured to process the calling arguments and provide an output to the first packet access stage 1315
  • the first packet access stage 1315 loads data from the first packet at the network tap 1320.
  • the first packet access stage 1315 may also write data to the first packet in dependence upon the output of the first logic stage 1310.
  • the first packet access stage 1315 may write data to the front of the first data packet.
  • the first packet access stage 1315 may overwrite data in the data packet.
  • the loaded data and any other metadata and/or arguments are then provided to the second logic stage 1325, which performs processing with respect to the first data packet and provides output arguments to the first map access stage 1330.
  • the first map access stage 1330 uses the output from the second logic stage 1325 to perform a look up into a lookup table to determine an action to be performed with respect to the first data packet.
  • the output is then passed to a third logic stage 1335, which processes this output and passes the result to a second packet access stage 1340.
  • the second packet access stage 1340 may read data from the first data packet and/or write data to the first data packet in dependence upon the output of the third logic stage 1335. The results of the second packet access stage 1340 are then passed to a fourth logic stage 1345 that is configured to perform processing with respect to the inputs it receives.
  • the pipeline may comprise a plurality of packet access stages, logic stages, and map access stages.
  • a final logic stage 1350 configured to output the return arguments.
  • the return arguments may comprise a pointer identifying the start of a data packet.
  • the return arguments may comprise an indication of an action to be performed with respect to a data packet.
  • the indication of the action may indicate whether or not the packet is to be dropped.
  • the indication of the action may indicate whether or not the packet is to be forwarded to the host system.
  • the network interface device may comprise at least one processing unit configured to drop the respective data packet in response to an indication that the packet is to be dropped.
  • the pipeline 1300 may additionally include one or more bypass FIFOs l355a, l355b, l355c.
  • the bypass FIFOs may be used to pass processing data, e.g. data from the first data packet around the map access stages and/or packet access stages.
  • the map access stages and/or packet access stages do not require data from the first data packet in order to perform their respective at least one operation.
  • the map access stages and/or packet access stages may perform their respective at least one operation in dependence upon the input arguments.
  • Figure 8 illustrates a method 800 performed by a network interface device 600, 700 according to embodiments of the application.
  • a function a hardware module of the network interface device is arranged to perform a function.
  • the hardware module comprises a plurality of processing units, each configured to perform a type of operation in hardware with respect to a data packet.
  • S810 comprises arranging at least some of the plurality of processing units to perform their respective predefined type of operation in a particular order so as to provide a function with respect to each received data packet.
  • Arranging the hardware module as such comprises connecting at least some of the plurality of processing units such that received data packets undergo processing by each of the pluralities of operations of the at least some of the plurality of processing units. The connecting may be achieved by configuring routing hardware of the hardware module to route the data packets and associated metadata between the processing units.
  • a first data packet is received from the network at a first interface of the network interface device.
  • the first data packet is processed by each of the at least some processing units that were connected during the compilation process in S810.
  • Each of the at least some processing units performs with respect to the at least one data packet the type of operation that it is preconfigured to perform. Hence, the function is performed with respect to the first data packet.
  • the processed first data packet is transferred onwards to its destination. This may comprise sending the data packet too the host. This may comprise sending the data packet over the network.
  • Figure 9 illustrates a method 900 that may be performed in a network interface device 700 according to embodiments of the application.
  • the first at least one processing unit (i.e. the first circuitry) of the network interface device is configured to receive and process data packets received from over the network. This processing comprises performing the function with respect to the data packets. The processing is performed during a first time period.
  • a second compilation process is performed during the first time period so as to compile the function for performance on a second at least one processing unit (i.e. the second circuitry).
  • the first at least one processing unit ceases performing the function with respect to the received data packets.
  • the first at least one processing unit may cease to perform the function only with regard to certain data flows.
  • the second at least one processing unit may then perform the function (at S950) with regard to those certain data flows instead.
  • the second at least one processing unit is configured to begin performing the function with respect to data packets received from the network.
  • FIG 16 illustrates a method 1600 according to embodiments of the application.
  • the method 1600 could be performed in a network interface device or a host device.
  • a compilation process is performed so as to compile a function to be performed by the first at least one processing unit.
  • a compilation process is performed so as to compile the function to be performed by the second at least one processing unit.
  • This process comprises assigning each of a plurality of processing units of the second at least one processing unit to perform at least one operation associated with a stage of a plurality of stages for processing a data packet so as to provide the first function.
  • Each of the plurality of processing units is configured to a type of processing and the assigning is performed in dependence upon determining that the processing unit is configured to perform a type of processing suitable for performing the respective at least one operation. In other words, the processing units are selected according to their template.
  • an instruction is sent to cause the first at least one processing unit to perform the function. This instruction may be sent before the compilation process in S 1620 begins.
  • an instruction is sent to the second circuitry to cause the second circuitry to perform the function with respect to data packets.
  • This instruction may include compiled instructions produced at S 1620.
  • the function according to embodiments of the application may be provided as a pluggable component of a processing slice in the network interface. Reference is made to Figure 14, which illustrates an example of how a slice 1425 may be used in the network interface device 600. The slice 1425 may be referred to as a processing pipeline.
  • the network interface device 600 includes a transmit queue 1405 for receiving and storing data packets from the host that are to be processed by the slice 1425 and then transmitted over the network.
  • the network interface device 600 includes a receive queue 1410 for storing data packets received from the network 1410 that are to be processed by the slice 1425 and then delivered to the host.
  • the network interface device 600 includes a receive queue 1415 for storing data packets received from the network that have been processed by the slice 1425 and are for delivery to the host.
  • the network interface device 600 includes a transmit queue for storing data packets received from the host that have been processed by the slice 1425 and are for delivery to the network.
  • the slice 1425 of the network interface device 600 comprises a plurality of processing functions for processing data packets on the receive path and the transmit path.
  • the slice 1425 may comprise a protocol stack configured to perform protocol processing of data packets on the receive path and the transmit path.
  • there may be a plurality of slices in the network interface device 600. At least one of the plurality of slices may be configured to process receive data packets received from the network. At least one of the plurality of slices may be configured to process transmit data packets for transmission over the network.
  • the slices may be implemented by hardware processing apparatus, such as at least one FPGA and/or at least one ASIC.
  • Accelerator components l430a, l430b, l430c, l430d may be inserted at different stages in the slice as shown.
  • the accelerator components each provide a function with respect to a data packet traversing the slice.
  • the accelerator components may be inserted or removed on the fly, i.e. during operation of the network interface device.
  • the accelerator components are, therefore, pluggable components.
  • the accelerator components are logic regions, which are allocated for the slice 1425. Each of them supports a streaming packet interface allowing packets traversing the slice to be streamed in and out of the component.
  • one type of accelerator component may be configured to provide encryption of data packets on the receive or transmit path.
  • Another type of accelerator component may be configured to provide decryption of data packet on the receive or transmit path.
  • the function discussed above that is provided by executing operations performed by a plurality of connected processing units may be provided by an accelerator component.
  • the function provided by an array of network processing CPUs (as discussed above with reference to Figure 4) and/or an FPGA application (as discussed above with reference to Figure 5) may be provided by an accelerator component.
  • the processing performed by a first at least one processing unit may be migrated from a second at least one processing unit.
  • a component for processing by the first at least one processing unit in the slice’s 1425 components may be replaced by a component for processing by the second at least one processing unit.
  • the network interface device may comprise a control processor configured to insert and remove the components from the slice 1425.
  • a component from performing the function by a first at least one processing unit may be present in the slice 1425.
  • the control processor may be configured to, subsequent to the first time period: remove the pluggable component providing the function by the first at least one processing unit from the slice 1425 and insert the pluggable component providing the function by the second at least one processing unit into the slice 1425.
  • control processor may load programs into the component and issue control-plane commands to control the flow of frames into the components. In this case, it may be that the components are caused to operate or not operate without being inserted or removed from the pipeline.
  • control plane or configuration information is carried over the data path, rather than requiring separate control buses.
  • requests to update the configuration of data path components are encoded as messages which are carried over the same buses as network packets.
  • the data path may carry two types of packets: network packets and control packets.
  • Control packets are formed by the control processor, and injected into the slice 1425 using the same mechanism that is used to send or receive data packets using a slice 1425. This same mechanism may be a transmit queue or receive queue. Control packets may be distinguished from network packets in any suitable way. In some embodiments, the different types of packets may be distinguished by a bit or bits in a metadata word.
  • control packets contain a routing field in the metadata word that determines the path that the control packet takes through the slice 1425.
  • a control packet may carry a sequence of control commands. Each control command may targets one or more components of the slice 1425. The respective data path component is identified by a component ID field. Each control command encodes a request for the respective identified component. The request may be to make changes to the configuration of that component. The request may control whether or not the component is activated, i.e. whether or not the component performs its function with respect to data packets traversing the slice.
  • the control processor of the network interface device 600 is configured to send a message to cause one of the components of the slice to start performing the function with respect to data packets received at the network interface device.
  • This message is a control plane message that is sent through the pluggable components and which causes the atomic switch over of frames into the component for performing the function. This component then executes on all received data packets traversing the slice until it is switched out.
  • the control processor is configured to send a message to cause another of the components of the slice to cause this component to cease performing the function with the respect to data packets received at the network interface device 600.
  • sockets may be present at various points in the ingress and egress data path.
  • the control processor may plumb additional logic into and out of the slice 1425. This additional logic may take the form of FIFOs placed between the components.
  • the control processor may send control plane message through the slice 1425 to configured components of the slice 1425.
  • the configuration may determine the function performed by component of the slice 1425.
  • a control message sent through the slice 1425 may cause the hardware module to be configured to perform a function with respect to data packets.
  • Such a control message may cause the atoms of the hardware module to be interconnected into a pipeline of the hardware module so as to provide a certain function.
  • Such a control message may cause the individual atoms of the hardware module to be configured so as to select an operation to be performed by the individually selected atoms. Since each atom is pre-configured to perform a type of operation, the selecting of the operation for each atom is made in dependence upon the type of operation that each atom is pre-configured to perform.
  • a packet processing program or a feed forward pipeline is run in an FPGA.
  • a method for causing subunits of the FPGA to implement the packet processing program or a feedforward pipeline will be described.
  • the packet processing program or feed forward pipeline may be an eBPF program or a P4 program or any other suitable program.
  • This FPGA may be provided in a network interface device.
  • the packet processing program is deployed or run only after the network interface device is installed with respect to its host.
  • the packet processing program or feedforward pipeline may implement a logic flow with no loops.
  • the program may be written in an unprivileged domain or a lower privileged domain such as in the user level.
  • the program may be run on privileged or a higher privileged domain such as a kernel.
  • the hardware running the program may require that there are no arbitrary loops.
  • Some embodiments may be provided in the context of an FPGA, an ASIC or any other suitable hardware device. Some embodiments use sub-units of the FPGA or ASIC or the like. The following example is described with reference to an FPGA. It should be appreciated that a similar process may be performed with an ASIC or any other suitable hardware device.
  • the sub-units may be atoms. Some examples of atoms have been previously described. It should be appreciated that any of those previously described examples of atoms may be alternatively or additionally be used as sub units. Alternatively or additionally these sub-units may be referred to as“slices” or configurable logic blocks.
  • Each of these sub-units may be configured to perform a single instruction or a plurality of related instructions.
  • the related instructions may provide a single output (which may be defined by one or more bits).
  • a sub-unit can be considered to be a compute unit.
  • the sub-units may be arranged in a pipeline where the packets are processed in order.
  • the sub-units can be dynamically assigned to execute a respective instruction (or instructions) in a program.
  • the sub-unit may be all or part of a unit which is used to define the blocks of, for example, an FPGA.
  • the blocks of the FPGA are referred to as slices.
  • a sub-unit or atom equates to a slice.
  • the compiling may be to the atom level. This may have the advantage that processing is pipelined. The packets may be processed in order. The compilation process may be performed relatively quickly.
  • an arithmetic operation may require one slice per byte.
  • a logic operation may require half a slice per byte.
  • a shift operation may require a collection of slices depending on the width of the shift operation.
  • a compare operation may require one slice per byte.
  • a select operation may require half a slice per byte.
  • placing and routing is performed. Placing is the allocating of a particular physical sub-unit to perform a particular instructions or instructions. Routing ensures that the output or outputs of a particular subunit are routed to the correct destination which may for example be another subunit or subunits.
  • the placing and routing may use a process where operations are assigned to particular subunits starting from one end of the pipeline. In some embodiments, the most critical operations may be placed before less critical operations. In some embodiments, the routing may be assigned at the same time that particular operations are being placed. In some embodiments, the routes may be selected from a limited set of pre-computed routes. This will be described in more detail later.
  • the operation will be held for later.
  • the pre-computed routes may be byte wide routes. However, this is by way of example only and in other embodiments, different widths of routes may be defined. In some embodiments, there may be a plurality of different sized routes provided.
  • the routing may be limited to routing between nearby sub units.
  • the sub units may be physically arranged in a regular structure on the FPGA.
  • rules may be made as to how the sub-units may communicate. For example a sub unit can only provide an output to a sub unit which is next to it, above it or below it.
  • limits may be placed on how far away the next sub-unit is, for the purposes of routing.
  • a sub unit may output data only to an adjacent sub unit or a sub unit which is within a defined distance (e.g. there is no more than one intervening sub unit).
  • the FPGA may have one or more“static” regions and one or more“dynamic” regions.
  • the static region provides a standard configuration and the dynamic function may provide functions in accordance with the requirements of the end user.
  • the static part may for example be defined before an end-user receives the network interface device, for example before the network interface device is installed with respect to the host.
  • the static region may be configured to cause the network interface device to provide certain functions.
  • the static region will be provided with precomputed routes between the atoms. As will be discussed in more detail later, there may routing between one or more static regions which pass through one or more dynamic regions.
  • the dynamic regions may be configured by the end user in dependence on their requirements, when the network interface device is deployed with respect to the host.
  • the dynamic regions may be configured to perform different functions for the end user over the course of time.
  • a first compilation process is performed to provide a first bit file which is referred to as the main bit file 50 and a tool checkpoint 52. This is the bit file for at least a part of the static region in some embodiments.
  • a bit file will when downloaded to the FPGA causes the FPGA to function as specified in a program from which the bit file has been compiled from.
  • the program which is used in the first compilation process may be any one or more programs or may be a test program which is specifically designed to assist in the determining of the routing within a part of the FPGA.
  • a series of simple programs may be alternatively or additionally be used.
  • a program may be modified or have a reconfigurable partition which can be used by the compiler.
  • the program might be modified to make the job of the compiler easier by moving nets out of the reconfigurable partition.
  • Step S l may be performed in a design tool.
  • the Vivado tool may be used with Xilinx FPGAs.
  • the checkpoint file may be provided by the design tool.
  • the checkpoint file represents a snapshot of a design at the point at which the bit file is generated.
  • the checkpoint file may comprise one or more a synthesized netlist, design constraints, placement information and routing information.
  • bit file is analysed taking into account the checkpoint file to provide a bit file description 54.
  • the analysis may be to one or more of detect resources, generate routes, check timing, generate one or more partial bite files and generate a bit file description.
  • the analysis may be configured to extract routing information from the bit file.
  • the analysis may be configured to determine which wires or routes the signals have propagated.
  • the analysis phase may be performed at least partially in a synthesizing or design tool.
  • a scripting tool of Vivado may be used.
  • the scripting tool may be TCL (tool command language).
  • TCL can be used to add or modify the capabilities of Vivado.
  • the functions of Vivado may be invoked and controlled by TCL scripts.
  • the bit file description 54 defines how a given part of the FPGA can be used. For example, the bit file description will indicate which atom can be routed to which other atoms and one or more routes by which it is possible to route between those atoms. For example for each atom, the bit file description will indicate where the inputs to that atom can come from and where the outputs from that atom can be routed to along with one or more routes for the output of data.
  • the bit file description is independent of any program.
  • a bit file description may contain one or more of route information, an indication of which pairs of routes conflict and a description of how to generate a bit file from the required configuration of atoms.
  • the bit file description may provide a set of routes available between a set of atoms but before any specific instruction has been performed by a given atom.
  • the bit file description may be for a portion of the FPGA.
  • the bit file description may be for a portion of the FPGA which is dynamic.
  • the bit file description will include which routes are available and/or which routes are unavailable.
  • the bit file may indicate for the dynamic part of the FPGA which routes are available taking into account any routing across the dynamic part of the FPGA required, by for example the static part(s) of the FPGA.
  • bit file description may be obtained in any suitable way.
  • a bit file description may be provided by the provider of the FPGA or ASIC.
  • the bit file description may be provided by the design tool.
  • the analysis step may be omitted.
  • the design tool may output a bit file description.
  • the bit file description may be for the static part of the FPGA including any required routing across the dynamic part of the FPGA.
  • any other suitable technique may be used to generate a bit file description.
  • the tool which is used to design the FPGA is used to provide the analysis which is used to generate the bit file.
  • the tools may be specific to the product or a range of products in some embodiments.
  • a provider of an FPGA may provide an associated tool for managing that FPGA.
  • a generic scripting tool may be used.
  • a different tool or different technique may be used to determine a partial bit file.
  • the main bit file may be analysed in order to determine which features correspond to which features. This may require a plurality of partial bit files to be generated.
  • step S3 is performed when the network interface device is installed with respect to a host and is carried out on the physical FPGA device.
  • Steps S 1 and S2 may be performed as part of the design synthesis process to produce the bit file image which implements the network interface device.
  • steps S l and/or step S2 are used to characterise the behaviour of FPGA. Once the FPGA has been characterised, the bit file description is stored in memory for all physical network interface devices which are to operate in a given defined manner.
  • step S3 a compilation is performed using the bit file description and the eBPF program.
  • the output of the compilation is a partial bit file for the eBPF program.
  • the compiling will add the routes to the partial bit file and the programming to be performed by individual ones of the slices.
  • bit file description may be provided in the system which is deployed.
  • the bit file description may be stored in memory.
  • the bit file description may be stored on the FPGA, on a network interface device or on the host device.
  • the bit file description is stored in flash memory or the like, connected to the FPGA on the network interface device.
  • the flash memory may also contain the main bit file.
  • the eBPF program may be stored with the bit file description or separately.
  • the eBPF program may be stored on the FPGA, on a network interface device or on the host.
  • the program may be transferred from a user-mode program to a kernel, both running on the host.
  • the kernel would transfer the program to the device driver which would then transfer it to the compiler, either running on the host or the network interface device.
  • an eBPF program may be stored on the network interface device so that it can be run before the host OS has booted.
  • the compiler may be provided at any suitable location on the network interface device, FPGA or host.
  • the compiler may be run on a CPU on the network interface device.
  • the compiler flow will now be described.
  • the front end of the compiler receives an eBPF program.
  • the eBPF program may be written in any suitable language.
  • the eBPF program may be written in a C type language.
  • the compiler is configured at the front end to convert the program to an intermediate representation IR.
  • the IR may be a LLVM-IR or any other suitable IR.
  • pointer analysis may be performed to create packet/map access primitives.
  • an optimization of the IR may be performed by the compiler. This may be optional in some embodiments.
  • the high level synthesis backend of the compiler is configured to split a program pipeline into stages, generate packet access taps and emit C code.
  • the HLS part of the design tool and/or the design tool being used may be invoked to synthesise the output of the HLS phase.
  • the compiler backend for the FPGA atoms splits the pipeline into stages and generates packet access taps. If-conversion may be performed to convert control dependencies to data dependencies. The design is placed and routed. The partial bit file for the eBPF program is emitted.
  • Routing issues could arise, such as shown in Figure 20a, where there is a routing conflict.
  • slice A may communicate with slice C and slice B may communicate with slice D.
  • a common routing part 60 has been allocated to the communication between slice A and slice C as well as to the communication between slice B and D. In some embodiments, this routing conflict may be avoided.
  • Figure 20b As can be seen a separate route 62 is provided between slice A and slice C as compared to the route 64 between slice B and slice D.
  • the bit file description may include a plurality of different routes for at least some pairs of sub-units.
  • the compiling process will check for routing conflicts such as shows in Figure 20a. In the case of routing conflicts, the compiler can resolve or avoid such conflicts by choosing an appropriate alternative one of the routes.
  • Figure 21 shows a partition 66 in the FPGA for performing the eBPF program.
  • the partition will, for example, interface with the static part of the FPGA via a series of input flip- flops 68 and a series of output flip-flops. In some embodiments, there may be routing 70 across the design as previously discussed.
  • the compiler may need to deal with routing across the area of the FPGA which is being configured by the compiler.
  • the compiler needs to generate a partial bit file which fits into a reconfigurable partition within a main bit file.
  • the design tool will avoid using logic resources within the reconfigurable partition so that those resources can be used by the partial bit file.
  • the design tool may not be able to avoid using routing resources within the reconfigurable partition.
  • the analysis tool will need to avoid using the routing resource which have been used by the design tool which is in the main bit file.
  • the analysis tool may need to make sure its list of available routes in the bit file description does not include any which use resources being used by the main bit file.
  • the available routes may defined in terms of route templates which can be used at a large number of places within the FPGA since the FPGA is highly regular.
  • the routing resources used by the main bit file break the regularity and mean that the analysis tool avoids using those templates in the places where they would conflict with the main bit file.
  • the analysis tool may need to generate new route templates which can be used in those places and/or prevent certain route templates from being used in particular locations.
  • Some embodiments may uses any suitable synthesis tool for generating the bit file description.
  • some embodiments may make use Bluespec tools which is based on a mode which uses atomic transactions for hardware.
  • the eBPF program fragment has two instructions:
  • the first instruction adds the number in register 1 (rl) to the number in register 2 (r2) and places the result in rl .
  • the second instruction adds rl to r3 and places the result in rl. Both instructions in this example use 64-bit registers but only use the lowest 32 bits. The upper 32 bits of the results are filled with zeros.
  • a 32-bit add instruction requires 32 pairs of lookup tables (LUTs), a 32-bit carry chain and 32 flip-flops.
  • Each pair of lookup tables will add two bits to produce a 2-bit result.
  • the carry chain is the structure which allows a bit to be carried from digit column to the next during an addition and allows a bit to be borrowed from the next column during a subtraction.
  • the 32 flip-flops are storage elements which accept a value on one clock cycle and reproduce it on the next clock cycle. These may be used to limit the amount of work done per clock cycle and to simplify the timing analysis.
  • the FPGA may comprise a number of slices.
  • the carry chain propagates from the bottom of the slice (CIN) to the top of the slice (COUT) which then connects to the CIN input of the next slice up.
  • each slice has a 4-bit carry chain
  • eight slices are used to perform a 32-bit addition.
  • an atom may be considered to be provided by a pair of slices. This is because it may be convenient in some embodiments for an atom to operate on 8- bit values.
  • each slice has an 8-bit carry chain
  • four slices are used to perform a 32-bit addition.
  • an atom may be considered to be provided by a slice.
  • XnYm indicates the position of the atom in the arrangement.
  • Xn indicates the column and Ym indicates the row.
  • X6Y0 indicates that the slice is in column 6 and in row 0. It should be appreciated that any other suitable numbering scheme may be used in other embodiments.
  • the result of the first instruction needs to be calculated by four adjacent slices in the same column so that the carry chain connects up correctly.
  • the compiler might choose to calculate that result in slices X7Y0, X7Y 1 , X7Y2 and X7Y3.
  • the inputs need to be connected up. There would be a connection from X6Y0 to X7Y0, another from X6Y 1 to X7Y1, one from X6Y2 to X7Y2 and one from X6Y3 to X7Y3. There also need to be corresponding connections from X6Y4-X6Y7 to X7Y0-X7Y3.
  • slice X6Y0 flip-flip 0 The output from slice X6Y0 flip-flip 0 is connected to input 0 of slice X7Y0 LUT 0.
  • slice X6Y0 flip-flip 1 The output from slice X6Y0 flip-flip 1 is connected to input 0 of slice X7Y0 LUT 1.
  • slice X6Y0 flip-flip 7 The output from slice X6Y0 flip-flip 7 is connected to input 0 of slice X7Y0 LUT 7.
  • the rl and r2 values from slices X6Y0-X6Y7 will be transferred to the inputs of slices X7Y0-X7Y3, will be processed by the LUTs and the carry chain and the result will be stored in the flip-flips of those slices (X7Y0-X7Y3), ready to be used on the next cycle.
  • the compiler needs to choose a place to calculate the result of instruction 2. It might choose slices X7Y4 to X7Y7. Again, there would full-byte connections from the result of instruction 1 (X7Y0 to X7Y3) to the inputs for instruction 2 (X7Y4 to X7Y7).
  • r3 The value of r3 is also required. If rl, r2 and r3 were produced in cycle 0 then rl+r2 would be produced in cycle 1. The value of r3 needs to be delayed by a clock cycle so that it is produced in cycle 1.
  • the compiler might choose to produce r3 in cycle 1 using slices X7Y8 to X7Y 11. There would then need to be a connection from the original slices which produced r3 in cycle 0 (X6Y8 to X6Y11) to the new slices which produce the same value in cycle 1 (X7Y8 to X7Y 11 ). Having done that, there now needs to be a connection from those new slices to the slices for instruction 2. So the outputs from slice X7Y8 connect to inputs of slice X7Y4 and so on.
  • the FPGA bit file would then contain the following features:
  • the compiler does not need to produce the upper 32 bits of the result of instruction 2 since they are known to be zero. It can just make a note of that fact and use zero whenever they are used.
  • the first instruction performs a bitwise-AND of rl with the constant Oxff and places the result in r 1.
  • a given bit in the result will be set to one if the corresponding bit was originally set to one in rl and the corresponding bit is set to one in the constant. It will set to zero otherwise.
  • the constant Oxff has bits 0 to 7 set and has bits 8 to 63 clear, so the result will be that bits 0 to 7 of rl will be unchanged but bits 8 to 63 will be set to zero. This simplifies things for the compiler since the compiler understands that bits 8 to 63 are zero and does not need to produce them.
  • the second instruction does the same thing to r2.
  • Instruction 3 checks whether rl is less than r2 and jumps to label Ll if it is. This skips instruction 4. Instruction 4 simply copies the value from r2 into rl . This sequence of instruction finds the minimum value of rl byte 0 and r2 byte 0, placing the result in rl byte 0.
  • the compiler may use a technique known as "if conversion" to turn the conditional jump into a select instruction:
  • Instruction 5 compares rl with r2, setting cl to one if rl is less than r2 and setting cl to zero otherwise.
  • Instruction 6 is the select instruction which copies rl into rl (which has no effect) if cl is set and copies r2 to rl otherwise. If cl is equal to one then instruction 3 would have skipped instruction 4 which means that rl would keep its value from instruction 1. In this case, the select instruction also keeps rl unchanged. If cl is equal to zero then instruction 3 would not have skipped instruction 4, so r2 would be copied into rl by instruction 4. Again, the select instruction will copy r2 into rl so the new sequence has the same effect as the old sequence.
  • Instruction 6 is not a valid eBPF instruction. However, the instructions are expressed in LLVM-IR while the compiler is working on them. Instruction 6 would be a valid instruction in LLVM-IR.
  • the compiler might then choose to calculate the result of instruction 5 in slice X1Y0.
  • a full-byte connection is required from the output of slice C0U0 to input 0 of slice X1Y0 and a full-byte connection from the output of slice X0Y8 to input 1 of slice X1Y0.
  • the way to compare two values is to subtract one from the other and see if the calculation overflows by trying to borrow from the next bit up. The result of this comparison then gets stored in flip- flop 7 of slice X1Y1.
  • rl and r2 will need to be delayed by a cycle to present the values at the right time to instruction 6.
  • the compiler might use slices XI Yl and X1Y2 for rl and r2 respectively.
  • the select instruction needs three inputs: cl, rl and r2. Note that rl and r2 are one byte wide, but cl is only one bit wide.
  • the compile calculates the result of the select instruction slice X2Y0. The selection is performed on a bit by bit basis with each LUT in slice X2Y0 handling one bit:
  • bit 0 of the result is rl bit 0 and r2 bit 0 otherwise.
  • bit 1 of the result is rl bit 1 and r2 bit 1
  • bit 7 of the result is rl bit 7 and r2 bit 7
  • Each LUT may need access to the corresponding bit from rl and the corresponding bit from r2, but all of the LUTs need access to cl . This means that cl needs to be replicated across the bits of input 0 of the slice. So the connections for the inputs of instruction 6 would be: Replicate bit 7 of the output of slice X1Y0 to input 0 of slice X2Y0.
  • the inputs and outputs here are of the connection.
  • the input of the connection is from the output of the first slice.
  • the output of the connection goes to the input of the second slice.
  • the compiler can assume that the 16-bit input value has been produced by two adjacent slices in the same column since the compiler can make sure the values are produced there.
  • Slice X1Y4 bit 0 is known to be zero so is not needed
  • Slice X1Y4 bit 1 is known to be zero so is not needed
  • Slice X1Y4 bit 2 is known to be zero so is not needed
  • Slice X1Y4 bit 3 is known to be zero so is not needed
  • Slice X1Y4 bit 4 is known to be zero so is not needed
  • Slice X1Y4 bit 5 is from slice X0Y4 bit 0
  • Slice X1Y4 bit 6 is from slice X0Y4 bit 1
  • Slice X1Y4 bit 7 is from slice X0Y4 bit 2
  • Slice XI Y5 bit 0 is from slice X0Y4 bit 3
  • Slice X1Y5 bit 1 is from slice X0Y4 bit 4
  • Slice X1Y5 bit 2 is from slice X0Y4 bit 5
  • Slice X1Y5 bit 3 is from slice X0Y4 bit 6
  • Slice XI Y5 bit 4 is from slice X0Y4 bit 7
  • Slice X1Y5 bit 5 is from slice X0Y5 bit 0
  • Slice XI Y5 bit 6 is from slice X0Y5 bit 1
  • Slice X1Y5 bit 7 is from slice X0Y5 bit 2
  • the 8 connections to the inputs of slice X1Y5 can be regarded as a shifted connection or shifted route.
  • the same structure can be used for slice X1Y4, but with inputs from XI Y3 and X1Y4 since bits 5-7 are matched and the slice can ignore bits 0-4 so it does not matter what input is presented there.
  • a connection shifting by 0 bits or 8 bits is just the same as a full byte connection since each bit connects to the corresponding bit of another slice in that case.
  • Shifting by a variable amount may be done in two or three stages, depending on the width of the value being shifted.
  • the stages are:
  • Stage 1 Shift by 0, 1, 2 or 3.
  • Stage 2 Shift by 0, 4, 8 or 12.
  • Stage 3 Shift by 0, 16, 32 or 48 (32-bit or 64-bit only).
  • the arithmetic shift right requires an "arithmetic shift right" type of connection. This type of connection takes the outputs of one slice and connects them to the inputs of another slice, but shifts them right by a constant amount in the process, replicating the sign bit as necessary. For example, an "arithmetic shift right by 3" connection would have:
  • Output bit 0 is from input bit 3
  • Output bit 1 is from input bit 4
  • Output bit 2 is from input bit 5
  • Output bit 3 is from input bit 6
  • Output bit 4 is from input bit 7
  • Output bit 5 is from input bit 7 (the sign bit)
  • Output bit 6 is from input bit 7 (the sign bit)
  • Output bit 7 is from input bit 7 (the sign bit)
  • Stage 1 might be calculated in slice X4Y2, in which case it would need the following connections:
  • Slice X4Y2 would then be configured to select one of the first four inputs based on input 4 and input 5 as follows:
  • Input 4 is 0 and input 5 is 0: select input 0
  • Input 4 is 1 and input 5 is 0: select input 1
  • Input 4 is 0 and input 5 is 1: select input 2
  • Input 4 is 1 and input 5 is 1 : select input 3
  • the shift amount may be copied from slice X3Y3 to slice X4Y3 to provide a delayed version.
  • Stage 2 might be calculated in slice X5Y2, in which case it would need the following connections:
  • Slice X5Y2 would then be configured to select input 0 or input 1 based on input 2 as follows:
  • Input 2 is 0: select input 0
  • Input 2 is 1: select input 1
  • the output of slice X5Y2 will be the result of the variable arithmetic shift right operation.
  • a bit file for a given atom may be as follows:
  • the FPGA is a regular structure, there may be a common template which can be used for a plurality of atoms with modifications for individual ones of the atoms where necessary.
  • bit file description for slice X7Y1 may specify the following possible inputs and outputs:
  • the compiler would use this bit file description to provide the partial bit file for the inputs and outputs of the slice X7Y1 for the previously describe first eBPF example of Input from X6Y 1 via route A
  • a bit file description for slice XnYm may specify the following possible inputs and outputs:
  • This bit file description maybe modified to remove one or more routes which are not available for the compiler to use, such as previously described. This may because the route is used by another atom or is used for routing across the partition.
  • the compiler may be implemented by a computer program comprising computer executable instructions which may be executed by one or more computer processors.
  • the compiler may run on hardware such as at least one processor operating in conjunction with one or more memories.
  • embodiments may thus vary within the scope of the attached claims.
  • some embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although embodiments are not limited thereto.
  • the embodiments may be implemented by computer software stored in a memory and executable by at least one data processor of the involved entities or by hardware, or by a combination of software and hardware.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Stored Programmes (AREA)
  • Advance Control (AREA)
  • Logic Circuits (AREA)
  • Multi Processors (AREA)

Abstract

L'invention concerne un dispositif d'interface réseau ayant un module matériel comprenant une pluralité d'unités de traitement. Chacune de la pluralité d'unités de traitement est associée à sa propre au moins une opération prédéfinie. À un moment de compilation, le module matériel est conçu par agencement d'au moins certaines de la pluralité d'unités de traitement pour effectuer leur au moins une opération respective par rapport à un paquet de données dans un certain ordre de façon à accomplir une fonction par rapport à ce paquet de données. Un compilateur permet d'attribuer différentes étapes de traitement à chaque unité de traitement. Un contrôleur est prévu pour passer d'un circuit de traitement à l'autre à la volée de telle sorte qu'un circuit de traitement peut être utilisé tandis qu'un autre est compilé.
PCT/EP2019/080281 2018-11-05 2019-11-05 Dispositif d'interface réseau WO2020094664A1 (fr)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2021523691A JP2022512879A (ja) 2018-11-05 2019-11-05 ネットワークインターフェースデバイス
EP19798619.3A EP3877851A1 (fr) 2018-11-05 2019-11-05 Dispositif d'interface réseau
KR1020217017269A KR20210088652A (ko) 2018-11-05 2019-11-05 네트워크 인터페이스 디바이스
CN201980087757.XA CN113272793A (zh) 2018-11-05 2019-11-05 网络接口设备
JP2024083450A JP2024116163A (ja) 2018-11-05 2024-05-22 ネットワークインターフェースデバイス

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US16/180,883 US11012411B2 (en) 2018-11-05 2018-11-05 Network interface device
US16/180,883 2018-11-05
US16/395,027 US11082364B2 (en) 2019-04-25 2019-04-25 Network interface device
US16/395,027 2019-04-25

Publications (1)

Publication Number Publication Date
WO2020094664A1 true WO2020094664A1 (fr) 2020-05-14

Family

ID=68470520

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/080281 WO2020094664A1 (fr) 2018-11-05 2019-11-05 Dispositif d'interface réseau

Country Status (5)

Country Link
EP (1) EP3877851A1 (fr)
JP (2) JP2022512879A (fr)
KR (1) KR20210088652A (fr)
CN (1) CN113272793A (fr)
WO (1) WO2020094664A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230099370A1 (en) * 2021-09-28 2023-03-30 Cisco Technology, Inc. Network flow attribution in service mesh environments

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309406B (zh) * 2022-09-30 2022-12-20 北京大禹智芯科技有限公司 P4控制分支语句的性能优化方法和装置

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9940284B1 (en) * 2015-03-30 2018-04-10 Amazon Technologies, Inc. Streaming interconnect architecture

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6798239B2 (en) * 2001-09-28 2004-09-28 Xilinx, Inc. Programmable gate array having interconnecting logic to support embedded fixed logic circuitry
JP2005130165A (ja) * 2003-10-23 2005-05-19 Nippon Telegr & Teleph Corp <Ntt> リコンフィギュアブルデバイスを用いたパケット処理装置
US7861067B1 (en) * 2004-10-28 2010-12-28 Nvidia Corporation Adjustable cycle pipeline system and method
JP4456552B2 (ja) * 2005-03-31 2010-04-28 富士通株式会社 動的代替機能を持つ論理集積回路、これを用いた情報処理装置及び論理集積回路の動的代替方法
JP2007241918A (ja) * 2006-03-13 2007-09-20 Fujitsu Ltd プロセッサ装置
JP4740828B2 (ja) * 2006-11-24 2011-08-03 株式会社日立製作所 情報処理装置及び情報処理システム
DE102007022970A1 (de) * 2007-05-16 2008-11-20 Rohde & Schwarz Gmbh & Co. Kg Verfahren und Vorrichtung zur dynamischen Rekonfiguration eines Funkkommunikationssystems
US20090213946A1 (en) * 2008-02-25 2009-08-27 Xilinx, Inc. Partial reconfiguration for a mimo-ofdm communication system
US8743877B2 (en) * 2009-12-21 2014-06-03 Steven L. Pope Header processing engine
US8667192B2 (en) * 2011-02-28 2014-03-04 Xilinx, Inc. Integrated circuit with programmable circuitry and an embedded processor system
US8874837B2 (en) * 2011-11-08 2014-10-28 Xilinx, Inc. Embedded memory and dedicated processor structure within an integrated circuit
US9450881B2 (en) * 2013-07-09 2016-09-20 Intel Corporation Method and system for traffic metering to limit a received packet rate
US9130559B1 (en) * 2014-09-24 2015-09-08 Xilinx, Inc. Programmable IC with safety sub-system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9940284B1 (en) * 2015-03-30 2018-04-10 Amazon Technologies, Inc. Streaming interconnect architecture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BOJIE LI ET AL: "ClickNP", ACM SIGCOMM 2016 CONFERENCE, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 22 August 2016 (2016-08-22), pages 1 - 14, XP058276533, ISBN: 978-1-4503-4193-6, DOI: 10.1145/2934872.2934897 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230099370A1 (en) * 2021-09-28 2023-03-30 Cisco Technology, Inc. Network flow attribution in service mesh environments
US11677650B2 (en) * 2021-09-28 2023-06-13 Cisco Technology, Inc. Network flow attribution in service mesh environments

Also Published As

Publication number Publication date
EP3877851A1 (fr) 2021-09-15
CN113272793A (zh) 2021-08-17
JP2022512879A (ja) 2022-02-07
JP2024116163A (ja) 2024-08-27
KR20210088652A (ko) 2021-07-14

Similar Documents

Publication Publication Date Title
US11824830B2 (en) Network interface device
US11082364B2 (en) Network interface device
Li et al. Clicknp: Highly flexible and high performance network processing with reconfigurable hardware
US10866842B2 (en) Synthesis path for transforming concurrent programs into hardware deployable on FPGA-based cloud infrastructures
US11687327B2 (en) Control and reconfiguration of data flow graphs on heterogeneous computing platform
US11709664B2 (en) Anti-congestion flow control for reconfigurable processors
US20220058005A1 (en) Dataflow graph programming environment for a heterogenous processing system
US9542244B2 (en) Systems and methods for performing primitive tasks using specialized processors
US20100037035A1 (en) Generating An Executable Version Of An Application Using A Distributed Compiler Operating On A Plurality Of Compute Nodes
US11113030B1 (en) Constraints for applications in a heterogeneous programming environment
JP2024116163A (ja) ネットワークインターフェースデバイス
Sivaraman et al. Packet transactions: A programming model for data-plane algorithms at hardware speed
US9411613B1 (en) Systems and methods for managing execution of specialized processors
Elakhras et al. Straight to the queue: Fast load-store queue allocation in dataflow circuits
Contini et al. Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication
US11449347B1 (en) Time-multiplexed implementation of hardware accelerated functions in a programmable integrated circuit
US20240184552A1 (en) Compilers and compiling methods field
US20230388373A1 (en) Load Balancing System for the Execution of Applications on Reconfigurable Processors
Tarafdar A Heterogeneous Development Stack for a Re-Configurable Data Centre
Bandara Component design for application-directed FPGA system generation frameworks
Tarafdar A Hetorogeneous Stack for a Re-configurable Data Centre
Sanaullah Towards hardware as a reconfigurable, elastic, and specialized service
Ebcioglu et al. Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS Cloud
Flamand Task partitioning and scheduling on dynamically
Espenshade Scalable framework for heterogeneous clustering of commodity FPGAs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19798619

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021523691

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20217017269

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2019798619

Country of ref document: EP

Effective date: 20210607