WO2007047866A2 - Utilisation de dispositifs en cascade pour accroitre le rapport de sursouscription - Google Patents
Utilisation de dispositifs en cascade pour accroitre le rapport de sursouscription Download PDFInfo
- Publication number
- WO2007047866A2 WO2007047866A2 PCT/US2006/040928 US2006040928W WO2007047866A2 WO 2007047866 A2 WO2007047866 A2 WO 2007047866A2 US 2006040928 W US2006040928 W US 2006040928W WO 2007047866 A2 WO2007047866 A2 WO 2007047866A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- queue
- data
- port
- oversubscription
- priority
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/28—Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
- H04L12/46—Interconnection of networks
- H04L12/4641—Virtual LANs, VLANs, e.g. virtual private networks [VPN]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
Definitions
- the invention relates to the field of data transmission and more specifically to managing data when an Ethernet network is oversubscribed.
- the invention allows a user to increase the oversubscription beyond that what is possible with a single device by cascading multiple devices.
- Fig. 1A shows the OSI reference model.
- Fig. 2A is the oversubscription block diagram
- Fig. 3A is the overall process flow chart.
- Fig. 4A is Ethernet frame format with vLAN tag.
- Fig. 4B shows the round-robin approach to enqueueing.
- Fig. 5 is vLAN priority to queue mapping table.
- Fig. 6 shows the WRED drop probability graph.
- Fig. 7 is the frame drop behavior table.
- Fig. 8 shows the MDRR approach.
- Fig. 9 shows an arrangement of cascaded devices.
- Fig. 10 is the case of full-rate application
- Fig. 11 shows a 2;1 oversubscription application
- Fig. 12 is the overall block diagram of this invention.
- Fig. 13 shows an N:1 oversubscription application
- Fig. 14 shows an N:1 ingress buffering arrangement
- An embodiment of the present invention aggregates large quantity of data and manages an oversubscribed data transmission system.
- the data enters the device from an 8 port Physical Layer (PHY) by the way of a Reduced Medium Independent Interface (RMII) or Reduced Gigabit Medium Independent Interface (RGMII) through a Media Access Control (MAC) device.
- PHY Physical Layer
- RMII Reduced Medium Independent Interface
- RGMII Reduced Gigabit Medium Independent Interface
- MAC Media Access Control
- Up to three 8 port PHY devices may be used.
- the incoming data are then classified into high and low priority according to the priority level contained in their virtual Local Area Network (vLAN) tag.
- the prioritized data are then processed through Weighted Random Early Detection (WRED) routine.
- WRED routine prevents congestion before it occurs by dropping some data and passing other according to the pre-determined criteria.
- the passed data are written into the memory that is divided into 480 1 Kbyte (KB) buffers (blocks).
- the buffers are further classified into a free list and an allocation list.
- the data are written into the memory by the Receive Write Memory manager.
- Each port on the device of this invention accommodates a high priority queue and a low priority queue, with low priority queue being allocated up to 48 blocks and the high priority queue up to 32 blocks.
- the stored data are read by the Receive Read Memory Manager, with each port being serviced in round robin fashion, and within a port, high and low priority queues are serviced by using Modified Deficit Round Robin (MDRR) approach.
- MDRR Modified Deficit Round Robin
- the data are then transmitted out of the device via an SPI 4.2 or similar approach.
- Fig. 1A Shown in Fig. 1A is an International Standards Organization (ISO) reference model for standardizing communications systems called the Open Systems Interconnect (OSI) Reference Model.
- ISO International Standards Organization
- OSI Open Systems Interconnect
- the OSI architecture defines the communications process as a set of seven layers, with specific functions isolated and associated with each layer. The layer isolation permits the characteristics of a given layer to change without impacting the other layers, provided that the supporting services remain the same. Each layer consists of a set of functions designed to provide a defined series of services.
- Layer 1 the physical layer (PHY) is a set of rules that specifies the electrical and physical connections between devices. This level specifies the cable connections and the electrical rules necessary to transfer data between devices. It typically takes a data stream from an Ethernet Media Access Controller (MAC) and transforms it into electrical or optical signals for transmission across a specified physical medium.
- MAC Ethernet Media Access Controller
- PHY governs the attachment of the data terminal equipment, such as serial port of personal computers, to data communications equipment, such as modems.
- Layer 2 the data link layer, denotes how a device gains access to the medium specified in the physical layer. It defines data formats, including the framing of data within transmitted messages, error control procedures, and other link control activites. Since it defines data formats, including procedures to correct transmission errors, this layer becomes responsible for reliable delivery of information.
- Layer 3 the network layer, is responsible for arranging a logical connection between the source and the destination nodes on the network. This includes the selection and management of a route for the flow of information between source and destination, based on the available data paths in the networks.
- Layer 4 the transport layer, assures that the transfer of information occurs correctly after a route has been established through the network by the network level protocol.
- Layer 5 the session layer provides a set of rules for establishing and terminating data stream between nodes in a network. These include establishing and terminating node connections, message flow control, dialogue control, and end- to-end data control.
- Layer 6 the presentation layer, addresses the data transformation, formatting, and syntax. One of its primary functions of this layer is the conversion of transmitted data into a display format appropriate for a receiving device.
- Layer 7 the application layer, acts as a window through which the application gains access to all the services provided by the model. This layer typically performs such functions as file transfers, resource sharing and database access.
- each layer appends appropriate heading information to frames of information flowing within the network, while removing the heading information added by the proceeding layer.
- FIG. 2A Shown in Fig. 2A is the over-all basic block diagram 10 of the interaction of the typical embodiment of the device of this invention 14 and other communications components.
- the data enters form the line side via a PHY 12 device (ingress) and may flow bi-directionally.
- PHY 12 is an 8 Port device capable of operating at 10 Mega bits per second (Mbps), 100 Mbps or 1Giga bit per second (Gbps) for each port, resulting in a total of 24 Gbps for 24 ports.
- the information from the PHY 12 is transmitted to the device 14 via interface 20, typically Reduced Medium-Independent Interface (RMII) or Reduced Gigabit Medium-Independent Interface (RGMII).
- the device 14 aggregates the information from all 24 ports and transmits it to Network Processor Unit (NPU) 18 via System Packet Interface Level 4 Phase 2 (SPI 4.2) or a device of similar capability.
- SPI 4.2 System Packet Interface Level 4 Phase 2
- the device 14 may be oversubscribed by a ratio of up to 8:1 on the line side.
- the data is then directed from NPU 18 to suitable switch fabric on the system back-plane.
- Fig. 3A Shown in Fig. 3A is the general process flow chart applicable to each port for the data being transmitted between the PHY 12 and the switch fabric.
- the data enters the device 14 via a generally available Media Access Control Device (MAC) 32.
- the MAC 32 may be integrated with the device 14 or it may be a separate unit. In general terms MAC 32 or a similar device is employed to control the access when there is a possibility that two or more devices may want to use a common communication channel.
- device 14 employs up to 24 MACs 32.
- the Ethernet data stream is typically transmitted to the ingress side of device 14 in Ethernet frame format 60 with a virtual Local Area Network (vLAN) tag 62 shown in Fig. 4A.
- the Ethernet frame 60 conforms with IEEE 802.1Q frame format.
- vLAN virtual Local Area Network
- the primary purpose of the vLAN tag 62 is to determine the priority of the incoming data traffic based on Class of Service (CoS) and classify it accordingly.
- the components of the vLAN tag 62 are: Tag Control Identifier (TCI) 64, Priority filed 66 (typically 3 bits of data per IEEE 802.1p standard), Canonical Format Identifier 68 and vLAN identity information 70 (typically 12 bits of data).
- TCI Tag Control Identifier
- Priority filed 66 typically 3 bits of data per IEEE 802.1p standard
- Canonical Format Identifier 68 typically 12 bits of data.
- vLAN identity information 70 typically 12 bits of data.
- the vLAN 62 makes it appear that a set of stations and applications are connected to a single physical LAN when in fact they are actually not.
- the receiving station can determine the type of the frame and correctly interpret the data carried in the frame.
- One with skill in the art would be able to program the type of routine needed to retrieve this information.
- the value of the bits following the source address is examined. If the value is greater than 1500, an Ethernet frame is indicated. If the value is 8100, then the frame is IEEE 802.1 Q tagged frame and the software would look further into the tag to determine vLAN identification and other information.
- Fig. 4B All ingress ports are scanned in round robin fashion resulting in an equitable process for selecting ports for enqueueing, i.e. for entering the device 14. This is shown in Fig. 4B. Multiple priority queues are associated with each port. Some queues are used for high priority traffic and some for low priority traffic.
- the oversubscription logic of device 14 obtains priority designation from the vLAN priority field 66 of the vLAN tag 62.
- the 3-bit vLAN priority field 66 indexes into a user programmable table that provides the lookup needed to determine the priority level. Typically, the upper four of the eight priority levels are mapped into a high priority queue and the lower four priority levels are mapped into low priority queue. If there is no VLAN 62 tag, all levels default to a single queue.
- FIG. 5A shows vLAN priority field 66 mapping table and the Class of Service (CoS) priority mapping register.
- the device 14 also employs an IEEE 802.3 -2000 compliant flow control mechanism. Each RGMII port with its MAC will perform independent flow control processing.
- the basic mechanism uses the PAUSE frames per the 802.3x specification. Each of the high and low priority queues associated with each port is programmed with a desired threshold value. When this value is exceeded, a PAUSE frame is generated and sent to a remote upstream node.
- the device 14 provides two different options for the PAUSE frame. In the first option, a 16-bit programmable timer value is sent in the PAUSE frame, this bit being used by the receiver as a pause quantum. No further PAUSE frames are sent.
- the transmission begins again.
- the MAC sends a PAUSE frame when the threshold is exceeded and another PAUSE frame with a zero pause quanta when the buffers go below threshold signifying that the port is ready to receive data again.
- WRED Weighted Random Early Detection
- Random Early Detection aims to control the average queue size by indicating to the end hosts when they should temporarily slow down transmission of packets.
- RED takes advantage of the congestion control mechanism of Transmission Control Protocol (TCP).
- TCP Transmission Control Protocol
- RED communicates to the packet source to decrease its transmission rate. Assuming the packet source is using TCP, it will decrease its transmission rate until all the packets reach their destination, indicating that the congestion is cleared. Additionally, TCP not only pauses, but it also restarts quickly and adapts its transmission rate to the rate that the network can support.
- RED distributes losses in time and maintains normally low queue depth while absorbing spikes. When enabled on an interface, RED begins dropping packets when congestion occurs at a pre-selected rate. Packet Drop Probability
- the packet drop probability is based on the minimum threshold, maximum threshold, and mark probability denominator.
- RED starts dropping packets.
- the rate of packet drop increases linearly as the average queue size increases until the average queue size reaches the maximum threshold.
- the mark probability denominator is the fraction of packets dropped when the average queue depth is at the maximum threshold. For example, if the denominator is 256, one out of every 256 packets is dropped when the average queue is at the maximum threshold. When the average queue size is above the maximum threshold, all packets are dropped.
- the minimum threshold value should be set high enough to maximize the link utilization. If the minimum threshold is too low, packets may be dropped unnecessarily, and the transmission link will not be fully used. If the difference between the maximum and minimum thresholds is too small, many packets may be dropped at once.
- WRED 38 combines the capabilities of the RED algorithm with the Internet Protocol (IP) precedence feature to provide for preferential traffic handling of higher priority packets.
- WRED 38 can selectively discard lower priority traffic when the interface begins to get congested and provide differentiated performance characteristics for different classes of service.
- WRED 38 can also be configured to ignore IP precedence when making drop decisions so that non-weighted RED behavior is achieved.
- WRED 38 differs from other congestion avoidance techniques such as queueing strategies because it attempts to anticipate and avoid congestion rather than control congestion once it occurs. WRED 38 makes early detection of congestion possible and provides for multiple classes of traffic.
- WRED 38 communicates to the packet source to decrease its transmission rate. If the packet source is using TCP, it will decrease its transmission rate until all the packets reach their destination, which indicates that the congestion is cleared.
- the average queue size is based on the previous average and the current size of the queue.
- the WRED 38 process will be slow to start dropping packets, but it may continue dropping packets for a time after the actual queue size has fallen below the minimum threshold (Kbytes). The slow-moving average will accommodate temporary bursts in traffic.
- WRED 38 provides up to four programmable thresholds (watermarks) associated with each of the two queues. Corresponding to four thresholds, four programmable probability levels are provided creating four threshold-probability pairs. This relationship is shown in Fig. 6A, where Probability of
- the threshold is the value on queue level (queue depth) and the corresponding probability is the probability of dropping a frame if the corresponding threshold is exceeded. It is also possible to set thresholds on some ports to guarantee no frame drops. This option is possible for only a subset of ports operating in the 1Gbps mode.
- the value of constant K determines how big the probability of drop is for a given queue filling over the threshold Q th .
- the device 14 supports four programmable watermarks per queue and based on each level, P n , the probability for drop is calculated for the next sequence.
- the frames which are not dropped are written into the device 14 memory, such memory being either internal or external to the device 14.
- the threshold for low and high priority queues are programmed in the device 14 registers.
- the device 14 utilizes CfgRegRxPauseWredLpThr and Cfg
- RegRxPauseWredHp Thr registers Associated probabilities are programmed into registers: CfgRegRxWredLpProb and CfgRegRxPauseWredHpThr. A person skilled in the art will be able to properly define such registers.
- Fig. 7 shows combination of probability and threshold levels used and the corresponding frame drop behavior.
- Memory manager 44 is organized as a pool of preferably 1 Kbyte buffers (or blocks) for a minimum of 480 blocks in case of a 24 port device 14.
- the 1 Kbyte buffer size enables easy memory allocation from ports that have small amount or no data arriving to them to other ports that are more occupied and need the memory.
- the buffers can be further classified into an allocation list and the free list. Each port has two allocation lists, one is high priority queue and the other a low priority queue. The high priority queue can occupy between 1 and 32 blocks unless there is no priority mechanism and all packets fall into one queue.
- the low priority queue can occupy between 1 and 48 blocks.
- the size of the low priority queue is larger than the high priority queue because the high priority queue is serviced more frequently.
- the buffers are reserved as soon as the data transmission starts, i.e., as soon as vLAN tag has been read and the data is classified as high or low priority queue.
- the unoccupied buffers are kept in a free list and signify the amount of memory remaining after the total of 480 Kbytes have been decremented by the allocation list.
- the receive memory operates at a frequency of 140 MHz making a total of 36
- the memory may be a dual ported RAM or a device with similar capabilities. This memory is sufficient to handle the case of all 24 ports running at 1Gbps and SPI 4.2 running at full speed.
- the data are written into the memory manager by Receive Write Memory Manager (RxWrMemMgr) that generally functions as follows:
- Drop registers Uses the drop registers to decide on packet drops. When a number of buffers used per queue exceeds certain threshold, packets are dropped with fixed probability. The threshold and the probability are programmed in the four WRED registers associated with each queue. Drop is achieved by reading packets from the RxMacFifo but not writing them into the memory.
- RxWrMemMgr employes the following basic data structure: A 480 entry buffer list pointing to the start of each of the 480 Kbyte buffers
- a read pointer pointing to read buffers for the entry list (rx_port_buffers_rd_ptr).
- a set of four Drop registers per port for setting thresholds for the WRED-like function. The registers contain threshold for the number of buffers used by the port and the probability associated with dropping a packet for that particular threshold.
- a pop function that looks at the address of free buffer(s), freejist, sends that information to the requesting port and returns a pointer to a free buffer to the requesting port.
- a read scheduler (arbiter - arb) that returns next port to be read from: function next_port (input req [(0:23])
- the Receive Read Memory Manager is responsible for de-queueing data from the 48 (24 high priority and 24 low priority) queues and it operates at 155 MHz system clock frequency. Ports are serviced in a round robin fashion, however, within a port, high and low priority queues are serviced using commercially available MDRR 46 (Modified Deficit Round Robin) based approach.
- MDRR 46 Modified Deficit Round Robin
- the MDRR 46 approach provides fairness among the high and low priority queues and avoids starvation of the low priority queues. Complete Ethernet frames are read out from each queue alternatively until the associated credit register reaches zero or goes negative.
- the MDRR 46 approach assigns queue 1 of the group as low latency, high priority (LLHP) queue for special traffic such as voice. This is the highest priority Layer 2 CoS queue. LLHP queue is always serviced first and then queue 0 serviced.
- a configurable credit window 78 and credit counter 80 shown in Fig. 8 are added for each high and low priority queues. The credit window 78 sets the maximum bound for dequeueing for the port.
- the credit counter 80 represents the number of 16-byte transfers available for the queue for the current round.
- the TxWrMemMgr employees the following basic data structure: A 240 entry free list buffer pointing to the start fo each of the 240 1 Kbyte buffers (tx_free_list). One 32 entry allocation list per port (tx_port_ql).
- the support of cascading devices shown in Fig. 9, further permits the customer to adjust the system cost, based on the level of performance that they wish to provide.
- oversubscription is used here to describe a situation where the total ingress data bandwidth is greater than the system-side interface bandwidth. In this situation determining which data to discard (once the buffers of the device are full), is crucial to the quality of the connectivity.
- the device as currently designed, supports the following levels of oversubscription:
- the Primary System-Side Interface can receive traffic from: a) Port #1 Line-Side Interface b) Port #2 Line-Side Interface c) Secondary System-Side Interface.
- the Primary System-Side Interface can transmit traffic to: a) Port #1 Line-Side Interface b) Port #2 Line-Side Interface c) Secondary System-Side Interface.
- the enhanced functionality of the Primary System-Side Interface allows it to very precisely manage the traffic from three different streams: a) Port #1 b) Port #2 c) Secondary System-Side Interface. This allows it to concatenate the various sources onto a single system-side stream while maintaining an MDRR-style of traffic control. This level of control is not be needed by all users.
- the number of cascaded devices can be increased until the additional circuitry required to intermingle the streams at the topmost Primary System-Side interface cannot be accommodated in the device.
- a second possible limiting factor is the number of channels that can be carried on the system-side interface.
- the current implementation can be used to extend beyond the 4:1 oversubscription limitation, as shown in Fig. 13.
- the invention can be configured so that it will simply pass any traffic that does not terminate on the device onward to subsequent devices.
- a simplified view of how this is implemented is shown in Fig. 14.
- the rate limiting functionality is especially useful for this expanded operation. Using this less precise method of intermingling the traffic from the different ports, the oversubscription ratio can be substantially increased, with the upper bound now limited by the accumulated latency the ingress traffic will encounter, due to the repeated buffering in the intermediate devices.
- the current design can be expanded to provide many cascaded devices, as shown in Fig. 13 and Fig. 14.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Security & Cryptography (AREA)
- Stereophonic System (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
L'invention porte sur un procédé permettant d'accroître une sursouscription au delà de ce qui est possible avec un seul dispositif en utilisant plusieurs dispositifs en cascade. En utilisant cette disposition on peut accroître le nombre de dispositifs en cascade jusqu'à ce que le circuit additionnel nécessaire pour entremêler les flux au niveau de l'interface, ne puisse plus être logé dans le dispositif.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US72839705P | 2005-10-18 | 2005-10-18 | |
US60/728,397 | 2005-10-18 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2007047866A2 true WO2007047866A2 (fr) | 2007-04-26 |
WO2007047866A3 WO2007047866A3 (fr) | 2007-07-26 |
Family
ID=37963290
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2006/040928 WO2007047866A2 (fr) | 2005-10-18 | 2006-10-18 | Utilisation de dispositifs en cascade pour accroitre le rapport de sursouscription |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2007047866A2 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10764201B2 (en) | 2017-11-28 | 2020-09-01 | Dornerworks, Ltd. | System and method for scheduling communications |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6169788B1 (en) * | 1996-03-29 | 2001-01-02 | Cisco Technology, Inc. | Communication server apparatus having distributed switching and method |
US20060187937A1 (en) * | 2005-02-19 | 2006-08-24 | Cisco Technology, Inc. | Techniques for oversubscribing edge nodes for virtual private networks |
-
2006
- 2006-10-18 WO PCT/US2006/040928 patent/WO2007047866A2/fr active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6169788B1 (en) * | 1996-03-29 | 2001-01-02 | Cisco Technology, Inc. | Communication server apparatus having distributed switching and method |
US20060187937A1 (en) * | 2005-02-19 | 2006-08-24 | Cisco Technology, Inc. | Techniques for oversubscribing edge nodes for virtual private networks |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10764201B2 (en) | 2017-11-28 | 2020-09-01 | Dornerworks, Ltd. | System and method for scheduling communications |
Also Published As
Publication number | Publication date |
---|---|
WO2007047866A3 (fr) | 2007-07-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20060045009A1 (en) | Device and method for managing oversubsription in a network | |
US20090292575A1 (en) | Coalescence of Disparate Quality of Service Matrics Via Programmable Mechanism | |
US9871739B2 (en) | Service interface for QOS-driven HPNA networks | |
US6862265B1 (en) | Weighted fair queuing approximation in a network switch using weighted round robin and token bucket filter | |
US8040801B2 (en) | Service interface for QoS-driven HPNA networks | |
US6438135B1 (en) | Dynamic weighted round robin queuing | |
US8064344B2 (en) | Flow-based queuing of network traffic | |
CA2227244C (fr) | Methode d'appui a la mise en file d'attente par connexion de trafic retrocontrole | |
US6256315B1 (en) | Network to network priority frame dequeuing | |
US6515963B1 (en) | Per-flow dynamic buffer management | |
US7680139B1 (en) | Systems and methods for queue management in packet-switched networks | |
US7457297B2 (en) | Methods and apparatus for differentiated services over a packet-based network | |
US7835405B2 (en) | Multiplexing/demultiplexing on a shared interface | |
US6661803B1 (en) | Network switch including bandwidth controller | |
US8379518B2 (en) | Multi-stage scheduler with processor resource and bandwidth resource allocation | |
EP1417795B1 (fr) | Noeud de commutation avec regulation tampon de commande d'acces media dependant dune classification | |
JP2002044139A (ja) | ルータ装置及びそれに用いる優先制御方法 | |
US20050068798A1 (en) | Committed access rate (CAR) system architecture | |
NZ531355A (en) | Distributed transmission of traffic flows in communication networks | |
WO2007047866A2 (fr) | Utilisation de dispositifs en cascade pour accroitre le rapport de sursouscription | |
WO2020114133A1 (fr) | Procédé de mise en œuvre d'expansion de pq, dispositif, équipement et support d'informations | |
Laine et al. | Core Node Implementation for Differentiated Services | |
GB2343344A (en) | Communication switch including input throttling to reduce output congestion | |
JP2006501753A5 (fr) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06817179 Country of ref document: EP Kind code of ref document: A2 |