CROSS-REFERENCES TO RELATED APPLICATIONS
The present application claims benefit under 35 USC 119(e) of U.S. provisional application No. 61/261,115, filed Nov. 13, 2009, entitled “Network Traffic Optimization,” the content of which is incorporated herein by reference in its entirety.
The present application is related to and incorporates by reference application Ser. No. 10/877,853, filed Jun. 25, 2004, the content of which is incorporated herein by reference in its entirety.
BACKGROUND OF THE INVENTION
Conventionally, management of networked computer systems in organizations is divided among a number of groups such as networking, storage, systems, and possibly groups in charge of maintaining regulatory compliance. Enterprise applications require resources from each such functional area; a failure in any of these areas can have a significant impact on the business. The strategy of splitting the management responsibilities by functional areas has worked so far because the functional areas have traditionally been loosely coupled and the data center environments have been relatively static.
The trend towards convergence of computing, storage and networking in order to create a more dynamic and efficient infrastructure makes these functions dependent on each other. For example, server virtualization means that a small change made by the systems group may have a major effect on the network bandwidth. The increasing demand for bandwidth by networked storage accounts for a significant proportion of the overall network bandwidth, thereby making the network vulnerable to changes made by the storage group. In order to maintain the services in a converged environment, the complex relationships between various network elements need to be managed properly.
FIG. 1 shows a network communication system 100 that includes a multitude of switches configured to connect a multitude of hosts to each other and to the Internet. Four exemplary hosts 10 1, 10 2, 10 3, 10 4 (alternatively and collectively referred to as host 10), are shown as being in communication with the Internet via switches 22 1, 22 2, 22 3, 22 4, (alternatively and collectively referred to as switch 22), switches 24 1, 24 2 (alternatively and collectively referred to as switch 24), and switches 26 1, 26 2 (alternatively and collectively referred to as switch 26). Network communication system 100 is controlled, in part, by network equipment group 30, storage group 35, server group 40, and regulatory compliance group 45. Each such group monitors its own resources and uses its own management tools and thus has very limited visibility into the other components of the data center.
FIGS. 2A and 2B show the challenge faced in managing a networked system using a conventional technique. FIG. 2A shows a network communication system that includes a multitude of servers 110 1, 110 2, 110 3 as well as a multitude of switches collectively identified using reference number 120. Each server 110 i is shown as having one or more associated virtual machines (VM) 115 i. For example, server 110 1 is shown as having associated VMs 115 11 and 115 12; server 110 2 is shown as having associated VMs 115 21, 115 22, and 115 23; and server 110 3 is shown as having associated VM 115 31. Assume that a system manager decides to move virtual machine 115 23 from server 110 2 to server 110 1—shown as VM 115 13 in FIG. 2B following the move. The system management tools show that there is enough capacity on the destination server 110 1 thus suggesting that the move would be safe. However, the move can cause the storage traffic, which had previously been confined to a single switch, to congest links across the data center causing system wide performance problems. The conventional siloed approach in which different teams manage the network, storage and servers has a number of shortcomings.
BRIEF SUMMARY OF THE INVENTION
A method of optimizing network traffic, in accordance with one embodiment of the present invention, includes, in part, measuring amounts of traffic exchange between each of a multitude of hosts disposed in the network, identifying a network domain to which each of the multitude of hosts is connected, calculating a net increase or decrease in inter-domain traffic associated with moving each of the multitude of hosts among the network domains in order to generate a list, and ranking the list of moves by net saving in the inter-domain traffic.
In one embodiment, the highest ranked move is automatically applied so as to change the network domain to which the host associated with the highest ranked move is connected. In one embodiment, the hosts are virtual machines. In one embodiment, a change in the inter-domain traffic as a result of moving a first host in accordance with the list occurs only if one or more conditions are met. In one embodiment, at least one of the conditions is defined by availability of a resource of a second host connected to the domain to which the first host is to be moved. In one embodiment, such a resource is the CPU resource of the second host. In one embodiment, at least one of the conditions defines a threshold that is to be exceeded before the first host is moved. In one embodiment, the network domain is a switch.
A computer readable medium, in accordance with one embodiment of the present invention, includes instructions that when executed by one or more processors cause the one or more processors to optimize network traffic. To achieve this, the instructions further cause the processor(s) to measure amounts of traffic exchange between each of the multitude of hosts disposed in the network and in which the processor(s) is (are) disposed, identify a network domain to which each of the multitude of hosts is connected, calculate a net increase or decrease in inter-domain traffic associated with moving each of the multitude of hosts among the multitude of domains to generate a list, and rank the list of moves by net saving in the inter-domain traffic.
In one embodiment, the instructions further cause the highest ranked move to be automatically occur so as to change the network domain to which the host associated with the highest ranked move is connected. In one embodiment, the hosts are virtual machines. In one embodiment, the instructions further cause the processor(s) to cause a change in inter-domain traffic by moving a first hosts in accordance with the list only if one or more conditions are met. In one embodiment, at least one of the conditions is defined by availability of a resource associated with a second host connected to a network domain to which the first host is to be moved. In one embodiment, such a resource is the CPU resource of the second host. In one embodiment, at least one of the conditions defines a threshold that is to be exceeded before the first host is moved. In one embodiment, the network domain is a switch.
A system adapted to optimize network traffic, in accordance with one embodiment of the present invention, includes, in part, a module operative to measure amounts of traffic exchange between each of the multitude of the hosts disposed in the network, a module operative to identify a network domain to which each of the multitude of hosts is connected, a module operative to calculate a net increase or decrease in inter-domain traffic associated with moving each of the multitude of hosts among a multitude of network domains to generate a list, and a module operative to rank the list of moves by net saving in the inter-domain traffic.
In one embodiment, the system further includes a module operative to automatically apply the highest ranked move so as to change the network domain to which the host associated with the highest ranked move is connected. In one embodiment, the hosts are virtual machines. In one embodiment, the system further includes a module that causes a change in inter-domain traffic by moving a first hosts in accordance with the list only if one or more conditions are met. In one embodiment, at least one of the conditions is defined by availability of a resource disposed in a second host connected to a network domain to which the first host is to be moved. In one embodiment, the resource is a CPU resource of the second host. In one embodiment, at least one of the conditions defines a threshold to be exceeded prior to moving the first host. In one embodiment, the network domain is a switch.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a network communication system that includes a multitude of switches configured to connect a multitude of hosts to each other and to the Internet.
FIG. 2A shows a network communication system that includes a multitude of hosts and switches.
FIG. 2B shows the network communication system of FIG. 2A after one of its virtual machines has been moved from one host to another host.
FIG. 3 shows a Symmetric Multi-Processing architecture having four CPUs and a shared memory.
FIG. 4 shows the processors and memories forming a Non-Uniform Memory Access architecture
FIG. 5 shows the association between a number of virtual machines and a pair of nodes hosting the virtual machines.
FIG. 6 shows a number of modules of a network optimization system, in accordance with one exemplary embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
Convergence and interdependence between the resources in a data center require a cross functional approach to management in order to ensure successful operation. To achieve greater scalability, shared visibility into all elements of a data center, and an integrated management strategy, in accordance with one aspect of the present invention, all components in a data center are monitored by a single traffic monitoring system. Data center wide visibility is critical to ensuring that each group is aware of the impact of its actions on shared resources and to providing the information needed to enhance the control of the data center.
Current trends toward Virtualization, Converged Enhanced Ethernet (CEE), Fibre Channel over Ethernet (FCoE), Service Oriented Architectures (SOA) and Cloud Computing are part of a broader re-architecture of the data centers in which enterprise applications are decomposed into simpler elements that can be deployed, moved, replicated and connected using high-speed switched Ethernet.
An integrated approach to management is needed if the full benefits of a converged data center are to be realized. Ensuring network-wide visibility into the storage, network and services running in the data center, their traffic volumes, and their dependencies are critical components of an integrated management strategy. In order to achieve data center wide visibility, every layer of the data center network, including the core, distribution, top of rack and blade server switches are taken into account, as described further below in accordance with various embodiments of the present invention.
FIG. 3 shows a Symmetric Multi-Processing (SMP) architecture 300 having four CPUs 302, 304, 306, and 308 sharing a memory 310. FIG. 4 shows a Non-Uniform Memory Access (NUMA) architecture 400 that includes four processors 402, 404, 406, 408 each having four CPUs and a memory. The processors are connected to one another via a high speed bus 410. As the number of processor cores increases, system architectures have moved from SMP to NUMA. SMP systems are limited in scalability by contention for access to the shared memory. In a NUMA system, memory is divided among groups of CPU's, increasing the bandwidth and reducing latency of access to memory within a module at the cost of increased latency for non-local memory access. Intel Xeon® (Nahalem) and AMD Opteron® (Magny-Cours) based servers provide commodity examples of the NUMA architecture.
System software running on a NUMA architecture are aware of the processor topology in order to properly allocate memory and processes to maximize performance. Since NUMA based servers are widely deployed, most server operating systems are NUMA aware and take location into account when scheduling tasks and allocating memory. Virtualization platforms also need to be location aware when allocating resources to virtual machines on NUMA systems.
Ethernet networks share similar NUMA-like properties. Sending data over a short transmission path offers lower latency and higher bandwidth than sending the data over a longer transmission path. While bandwidth within an Ethernet switch is high (multi-Terrabit capacity backplanes are not uncommon), the bandwidth of Ethernet links connecting switches is only 1 Gbit/s or 10 Gbit/s (with 40 Gbit/s and 100 Gbit/s under development). Shortest path bridging (see 802.1aq and Trill) further increases the amount of bandwidth, and reduces the latency of communication, between systems that are close.
In accordance with embodiment of the present invention, the traffic matrix representing the amount of traffic between each pair of hosts on the network is used to optimize network traffic. Network traffic optimization, in accordance with embodiments of the present invention, may be used to automate migration of servers in order to minimize inter-domain traffic. It is understood that a network domain refers to the same branch in the network hierarchy that is shared by the same hosts. Likewise, inter-domain traffic refers to traffic between hosts positioned along different branches of the network. Traffic between different branches of a network is facilitated by a network traffic equipment such as a switch, a router, and the like. Combining data from multiple locations to generate an end-to-end traffic matrix is described in application Ser. No. 10/877,853, filed Jun. 25, 2004, the content of which is incorporated herein by reference in its entirety. The following description of the embodiments of the present invention are described with respect to the sFlow® standard, a leading, multi-vendor standard for monitoring high-speed switched and routed networks. It is understood that embodiments of the present invention are equally applicable to any other network monitoring technology, sFlow® or otherwise. Detailed description of the sFlow® technology is provided, for example, on http://www.inmon.com/technology/index.php; and http://sflow.org/. Moreover, although the following description is provided with reference to network switches, it is understood that any network device, whether implemented in hardware, software or a combination therefore, that facilitates inter-domain and intra-domain traffic may be used and falls within the scope of embodiments of the present invention.
The sFlow® measurement technology, built into computers and network equipment from a number of leading vendors, such as HP®, IBM®, Dell®, Brocade °, BLADE®, Juniper®, Force10® and 3Com®, ensures data center wide visibility of all resources, including switches, storage servers, blade servers and virtual servers. As networks, systems and storage converge, the visibility provided by the sFlow® in the network provides an increasingly fuller picture of all aspects of the data center operations, thus enabling effective management and control of the network resources and delivering the converged visibility needed to manage the converged data center.
Unlike other monitoring technologies, the sFlow® provides an integrated, end-to-end, view of the network performance. This integration substantially increases the value of information by making it actionable. For example, identifying that an application is running slowly isn't enough to solve a performance problem. However, if it is also known that the server hosting the application is seeing poor disk performance, can link the disk performance to a slow NFS server, can identify the other clients of the NFS server, and can finally determine that all the requests are competing for access to a single file, then the decision to take action can be much more informed. It is this ability to link data together, combined with the scalability to monitor every resource in the data center that the sFlow® advantageously provides.
The sFlow® standard includes physical and virtual server performance metrics. The sFlow® specification describes a coherent framework that builds on the sFlow® metrics exported by most switch vendors, thus linking network, server and application performance monitoring to provide an integrated picture of the network performance.
If two hosts are connected to the same switch, the switch backplane provides enough bandwidth so that traffic between the hosts does not compete with other traffic on the network. If the two hosts are on different switches, then the links between the switches are shared and generally oversubscribed. The capacity of the links between switches is often an order of magnitude less than the bandwidth available within the switch itself.
A traffic matrix, which describes the amount of traffic between each pair of hosts (alternatively referred to herein as servers) on the network, can be formed using the sFlow® standard. For example, assume that a network has four hosts A, B, C and D. The traffic between all pairs of hosts may be represented as a 4×4 table (a matrix), as shown below:
|
TABLE I |
|
|
|
To A |
To B |
To C |
To D |
|
|
|
From A |
0 |
1 |
2 |
2 |
From B |
1 |
0 |
3 |
3 |
From C |
2 |
1 |
0 |
2 |
From D |
3 |
2 |
1 |
0 |
|
Assume further that information about the switch that each host is connected to is also known. A number of techniques exists for locating hosts. One such technique is described in application Ser. No. 10/877,853, filed Jun. 25, 2004, the content of which is incorporated herein by reference in its entirety. Assume that the following location information is available for the example shown in Table I:
|
TABLE II |
|
|
|
Host |
Switch |
|
|
|
A |
SW1 |
|
B |
SW1 |
|
C |
SW2 |
|
D |
SW2 |
|
|
In accordance with one embodiment of the present invention, network configuration changes, such as moving a host from one switch to another, are identified and used so as to minimize the amount of traffic between switches and increase the amount of traffic within switches. To achieve this, first, the total amount of traffic to or from each host is calculated. Continuing with the example above, the following shows the total amount of traffic to or from each host:
Total A=sum(row A)+sum(column A)=11
Total B=sum(row B)+sum(column B)=11
Total C=sum(row C)+sum(column C)=11
Total D=sum(row D)+sum(column D)=13
The location data is subsequently used to calculate the amount of traffic that each host exchanges with the hosts on each of the other switches. Table III below shows the amount of traffic each host in Table I exchanges with the host on each of switches:
The net effect of moving hosts between switches may thus be calculated. As seen from Table III, host A is shown as exchanging 2 units of traffic with the hosts on SW1, and 9 units of traffic with the hosts on SW2. Moving host A from SW1 to SW2 would thus result in a net reduction of inter-switch traffic of 7 (9−2) since traffic exchanged with hosts C and D would now be local (9) and traffic exchanged with host B would now be non-local (2). Accordingly, the net increase or decrease in inter-switch traffic can be calculated for each possible move and the results can be sorted by net-saving to produce a list of recommended moves. The net effect of moving the hosts between switches for the above example is shown in Table IV below.
|
TABLE IV |
|
|
|
Host |
Current Switch |
Proposed Switch |
Net Savings |
|
|
|
A |
SW1 |
SW2 |
7 |
|
B |
SW1 |
SW2 |
7 |
|
D |
SW2 |
SW1 |
7 |
|
C |
SW2 |
SW1 |
5 |
|
|
While physically reconfiguring and relocating servers is a difficult process that would only be carried out if there were compelling reasons, server virtualization makes this process far simpler. The advent of virtual servers allows server software to migrate between physical servers. Since the traffic that a server generates is a function of the software, moving a virtual server will also move its traffic. Popular virtualization software such as VMWare and Xen both provide the ability to easily move virtual machines from one physical server to another.
Virtualization and the need to support virtual machine mobility (e.g. vMotion, XenMotion, Xen Live Migration, associated with VMware and Citrix XenServer products) are driving the adoption of large, flat, high-speed, layer-2, switched Ethernet fabrics in data centers. A layer-2 fabric allows a virtual machine to keeps its IP address and maintain network connections even after the virtual machine is moved (performing a “live” migration). However, while a layer-2 fabric provides transparent connectivity that allows virtual machines to move, the performance of the virtual machine is highly dependent on its communication patterns and location.
As servers are pooled into large clusters, virtual machines may easily be moved, not just between NUMA nodes within a servers, but between servers within the cluster. For optimal performance, the cluster management software needs to be aware of the network topology and workloads in order to place each VM in the optimal location. The inclusion of the sFlow standard in network switches and virtualization platforms provides the visibility into each virtual machine's current workload and dependencies, including tracking the virtual machine as it migrates across the data center.
FIG. 5 shows the association between virtual machines 580, 582, and 584 and a pair of NUMA nodes 500 and 550. Each NUMA node is shown as having four CPUS and a memory. For example, NUMA node 500 is shown as having CPUS 502, 504, 506, 508, and memory 510. Each virtual machine is shown as having a pair of virtual CPUs, namely VCPU0 and VCPU1. VCPU0 and VCPU1 of virtual machine 580 are shown as being respectively associated with CPUs 502 and 504 of node 500. VCPU0 of virtual machine 582 is shown as being associated with CPU 504 of node 500, whereas VCPU1 of virtual machines 582 is associated with CPU 552 of node 500. Likewise, VCPU0 and VCPU1 of virtual machine 584 are shown as being respectively associated with CPUs 552 and 554 of node 550. As is well known, the virtual machines are connected to one or more virtual switches which are application software running on the nodes. Consequently, the communication bandwidth is higher for virtual machines that are on the same node and relatively lower for virtual machines that are on different nodes. For example, assume that virtual machines 580 and 584 exchange a substantial amount of traffic. Assume further that a network traffic optimization technique, in accordance with embodiments of the present invention, shows that node 500 currently hosting virtual machine 580 is close to full capacity, whereas node 550 hosting virtual machine 584 is identified as having spare capacity. Consequently, migrating virtual machine 580 to node 550 reduces network traffic, and reduces the latency of communication between virtual machines 580 and 584. In accordance with embodiments of the present invention, the association between virtual machines and the nodes may be varied to optimize network traffic. A network traffic optimization technique, in accordance with embodiments of the present invention, thus provides substantial visibility which is key to controlling costs, improving efficiency, reducing power and optimizing performance in the data center.
Additional constraints may also be applied before causing a change in network traffic movement. For example, a move may be considered feasible if enough spare capacity exists on the destination host to accommodate the new virtual machine. Standard system performance metrics (CPU/memory/IO utilization) can be used to apply these constrains, thus allowing a move to occur only when the constraints are satisfied. Other constraints may also be applied in order to determine whether conditions for a move is met. The following is a code for generating tables II, III, and IV—using data associated with a network traffic matrix—in order to optimize the network traffic, in accordance with one exemplary embodiment of the present invention.
|
String.prototype.startsWith = function(str) {return (this.match(“{circumflex over ( )}”+str)==str)} |
// start by finding the top communicating mac pairs |
var select = [‘macsource,rate(bytes)’, |
|
‘macdestination,rate(bytes)’, |
|
‘macsource,macdestination,rate(bytes)’]; |
|
‘isunicast=1’, |
|
‘isunicast=1’]; |
var q = Query.topN(‘historytrmx’, |
|
select, |
|
where, |
|
‘today’, |
|
‘bytes’, |
|
1000); |
q.multiquery = true; |
addrs = { }; |
pairs = { }; |
function updateaddrs(hash,key,val) { |
|
if(!hash[key]) hash[key] = val; |
|
else hash[key] += val; |
} |
function updatepairs(hash,addr,key,val) { |
|
var chash = hash[addr]; |
|
if(!hash[addr]) { |
|
chash = { }; |
|
hash[addr] = chash; |
|
} |
|
updateaddrs(chash,key,val); |
|
var addr = row[0]; |
|
var bytes = row[1]; |
|
if(addr) updateaddrs(addrs,addr,bytes); |
|
var addr = row[0]; |
|
var bytes = row[1]; |
|
if(addr) updateaddrs(addrs,addr,bytes); |
|
var src = row[0]; |
|
var dst = row[1]; |
|
var bytes = row[2]; |
|
if(src && dst) { |
|
updatepairs(pairs,src,dst,bytes); |
|
updatepairs(pairs,dst,src,bytes); |
]); |
// locate the mac addresses |
var addrarr = [ ]; |
for (var addr in addrs) addrarr.push(addr); |
var n = Network.current( ); |
var locations = n.locationMap(addrarr); |
var agentmap = { }; |
var portmap = { }; |
for (var i = 0; i < addrarr.length; i++) { |
|
var loc = locations[i]; |
|
var addr = addrarr [i]; |
|
if(loc) { |
|
n.path = loc; |
|
agentmap[addr] = n.agentIP( ); |
|
portmap[addr] = loc; |
} |
var result = Table.create( |
[“MAC”,“From Zone”,“From Group”,“From Port”,“%Local”,“To Zone”,“To Group”,“To |
Agent”,“%Local”,“Bits/sec. Saved”], |
[“address”,“string”,“string”,“interface”,“double”,“string”,“string”,“agent”,“double”,“integer”]); |
// find moves that reduce interswitch traffic |
for(var addr in pairs) { |
|
var bytes = addrs[addr]; |
|
var sagent = agentmap[addr]; |
|
if(sagent) { |
|
var siblings = pairs[addr]; |
|
var dagents = { }; |
|
for(var sib in siblings) { |
|
var dagent = agentmap[sib]; |
|
if(dagent) { |
|
var sbytes = siblings[sib]; |
|
updateaddrs(dagents,dagent,sbytes); |
|
} |
|
for(var dagent in dagents) { |
|
var dbytes = dagents[dagent]; |
|
var sbytes = dagents[sagent] ? dagents[sagent] : 0; |
|
if(sagent != dagent) { |
|
var netsaving = dbytes − sbytes; |
|
//&& addr.startsWith(‘005056’) |
|
if(netsaving > 0 ) { |
|
n.path = sagent; |
|
var szone = n.zone( ); |
|
var sgroup = n.group( ); |
|
n.path = dagent; |
|
var dzone = n.zone( ); |
|
var dgroup = n.group( ); |
|
result.addRow([addr, |
|
szone, |
|
sgroup, |
|
portmap[addr], |
|
100*sbytes/bytes, |
|
dzone, |
|
dgroup, |
|
dagent, |
|
100*dbytes/bytes, |
|
netsaving]); |
} |
result.sort(9,true); |
// splice in vendor codes |
result.insertColumn(“MAC Vendor”,“string”,n.vendorMap(result.column(0)),1); |
result.scaleColumn(10,8); |
// prune the table, 1 move suggestion per mac, limit rows to truncate value |
var suggested = { }; |
var truncated = Table.create(result.cnames,result.ctypes); |
for(var r = 0; r < result.nrows; r++) { |
|
var mac = result.cell(r,0); |
|
if(!suggested[mac]) { |
|
truncated.addRow(result.row(r)); |
|
suggested[mac] = true; |
|
if(truncated.nrows > truncate) break; |
} |
//result.nrows = Math.min(result.nrows,truncate); |
Report.current( ).table(truncated); |
|
The embodiments of the present invention apply to any data network and at any level of hierarchy and abstraction. For example, a network may be formed by connecting (i) the CPUs within a server, (ii) a multitude of servers, (iii) a multitude of data centers, and the like. At any level of network, it is desired to keep traffic local. Accordingly, embodiments of the present invention may be applied to a traffic matrix at any level of network abstraction to optimize network traffic.
FIG. 6 shows a network traffic optimization system 600 in accordance with one embodiment of the present invention. System 600 is shown as including, in part, an identification module 602, a calculating module 604, a ranking module 606, and a measurement module 608. Identification module 602 is adapted to identify the network domain to which each of the hosts of interest are connected. Calculating module 604 is adapted to calculate the net increase or decrease in inter-domain traffic associated with moving the hosts among the network domains. Measurement module 608 measures the amount of traffic exchange among the of hosts. Ranking module 606 is adapted to rank the list of moves by net saving in the inter-domain traffic. In one embodiment, the hosts are virtual machines. Apply module (not shown) is adapted to automatically apply the highest ranked move so as to change the network domain to which the host associated with the highest ranked move is connected. Change module (not shown) is adapted to cause a change in inter-domain traffic by moving a first host in accordance with the list if one or more conditions are met. In an embodiment, at least one of the one of more conditions is defined by availability of a resource associated with a second host that is connected to the network domain to which the first host is to be moved. The resource can be a CPU resource of the second host. In another embodiment, at least one of the one or more conditions defines a threshold that is to be exceeded prior to moving the first host. It is understood that modules 602, 604, 606, 608, and 610 may be software modules, hardware modules, or a combination of software and hardware modules. In one embodiment, each module may have a blade or a box-like module housing in which one or more processing units are arranged. The one or more processing units can be arranged in an SMP architecture (FIG. 3) or a NUMA architecture (FIG. 4). In one embodiment, each network domain includes a switch.
The above embodiments of the present invention are illustrative and not limitative. Various alternatives and equivalents are possible. Other additions, subtractions or modifications are obvious in view of the present invention and are intended to fall within the scope of the appended claim.