A Design Study of Alternative Network Topologies for the Beowulf Parallel Workstation



next up previous
Next: Conclusions Up: A Design Study Previous: Parallel Disk I/O

Discussion

 

 

 


: Sensitivity of Throughput to Message Demand

The experiments described in the previous section exhibit a varied set of behaviors, demonstrating the performance characteristics of different network topologies on the Beowulf Parallel Workstation. These results are discussed in relation to the original Beowulf topology, designated as Bonded Dual Net in Figures 3 and 4. In this configuration, each of the Ethernets independently connects all 16 nodes and each node uses channel bonding to balance its communications between the two networks. Unlike previous studies [9,10,11], where it was clear that a channel bonded dual net configuration was always desirable, there are now situtations where one topology may be preferable to another depending on application requirements. It is also seen that with the alternative topologies, significant performance gains over the original Beowulf configuration can be achieved.

An immediate observation to be gained from the network throughput experiment (Figure 3) is that the alternate network topologies consistently outperform the two original interconnect schemes (Single Net and Bonded Dual Net) presented in [9,10]. As one would expect, the local-wire experiments achieve the highest throughput, peaking at 6.3 MB/s for an 8 KB token. This bests the original Beowulf Bonded Dual Net configuration by a factor of 3.7, which is close to the theoretical limit of 4. The roughly equivalent performance of the 1 and 2 KB tokens reflects IP packet fragmentation caused by the Ethernet 1536 byte maximum packet size [2]. While the same number of tokens are being exchanged for both cases, the 2 KB case actually generates twice as many Ethernet packets; each 2 KB token is broken up into two Ethernet packets, while each 1 KB token fits in one Etherent packet.

As expected, there is no meaningful difference between the Switched Mesh Local-Net and the Routed Mesh Local-Net data because as indicated in the previous section, both are operationally identical. In the case of remote-wire transactions, the software routed scheme demonstrates lesser performance than the switched scheme. This can be attributed to the overhead introduced as part of the critical path time for performing the packet routing in software and also the contention for processor resources between routing software and packet generating code on a routing node. For the switched network configurations, the switch backplane does not appear to be a bottleneck even though the switching units used were of relatively low cost. A benefit of the software routed network configurations is that they involve no additional equipment cost. Switch based network approaches increase the cost of Beowulf by about 5%.

The sensitivity of interconnect throughput with respect to message demand is shown in Figure 5 where the curves presented differ in the number of message being generated simultaneously. This is for the case of software routed remote packet transfers; every packet must travel through an intermediate node for routing. These results show that throughput increases with message demand as it does with packet size. As the traffic density increases, the througput grows but less than linearly. The first for curves from the bottom represent an operation in which no node performs more than one of three roles: token supplier, token router, or token consumer. But beyond that, some contention occurs as overlapping of some paths result in one or more nodes doing double duty. Beyond 1 KB packet size, tokens are fragmented into multiple packets. The first point at which this occurs yields no throughput gain because the overhead increases with the amount of information transferred and the packets are not fully filled producing reduced efficiency. But beyond this point, the majority of multiple packets are fully filled and significant performance increase is observed with token size. For the top three curves, the contention due to collisions permits poorer throughput increase and at some points negative throughput increase. This is most dramatically demonstrated at the higher token sizes were the smaller message demand performs at higher throughput levels.

Bonded Switched Mesh Local-Net throughputs are about half that of the non-channel bonded local transaction cases due to the packet duplication caused by switch or software node routing. The two data packet paths from nodes B9 to BD in Figure 2 illustrate this behavior. One path traverses only a local wire while the other path must traverse a switch in reaching the local node, effectively causing the local transaction to be a remote one. In the switched scheme, channel bonding causes successive packets to alternate paths to their destination, resulting in half of the traffic taking a longer route and incurring the latency of packet duplication. In light of the data, the channel bonded configuration is not desirable when most message traffic in an application will be between nodes sharing a local wire. While this indicates that special conditions must exist to attain the best network performance, this is a reasonable condition. Parallel programmers often tailor their codes to specific topologies on machines like the Intel Paragon [5] and Cray T3D [4]. This is not to say that it is desirable to force the programmer to contend with architectural details in writing his code, but rather to point out that while Beowulf does not relieve this problem, it does not impose any new impediments.

The Disk/Net Balance experiments were conducted to determine the operating balance between the disk and network I/O systems. A balanced machine should match the disk system's ability to generate data with the network's capacity to convey the traffic. As in the network throughput experiments, the local-wire disk transfers attain the highest sustained throughput, maintaining a rate of about 6 MB/s. However, the channel bonded network configurations exceed that performance for 1 remote file copy, achieving 7.2 MB/s, before degrading as more remote file copies are added. Channel bonding is able to improve throughput for light workloads by distributing traffic between two paths, but for heavier workloads, the increased latency inherent to remote-wire transactions eliminates the usefulness of this load-balancing method.

Again looking at the one remote file copy situation, the software routed network with remote-wire transactions exhibits the lowest throughput at just over 4 MB/s. This particular situation highlights the overhead associated with software routing. The equivalent switched network experiment achieves a throughput 1.7 MB/s higher, enabled by the faster routing of hardware switching. The performance benefit of switching can be seen by comparing the switched remote-net experiment to the software routed remote-net experiment. Up until 4 internode file copies, the switched configuration consistently outperforms the software routed configuration by about 1.5 MB/s. After 4 internode file copies, the attained throughput converges. This can be explained by the way the experiments were conducted. Up until 4 copies, no file transfer suffers from contention for access to any given wire. Starting with 5 file copies, contention is introduced. The situation is illustrated in Figure 1. The transfer between nodes B1 and B6 contends for the same wire as the transfer between B2 and B7.

From the results of the Network Throughtput and Disk/Net Balance experiments, it can be said that throughput is sensitive to 3 factors: message size, available aggregate bandwidth, traffic density/demand. Message size determines the degree of utilization of network resources. Small messages do not take full advantage of the Ethernet packet size, while larger ones cause packet fragmentation and increased traffic. Available aggregate bandwidth sets the upper bound for attainable throughput. Traffic density and demand determine to what degree the available bandwidth is utilized, in addition to establishing the amount of network contention, limiting the maximal exploitation of available bandwidth. Although the performance measurements do not measure one-to-many or many-to-one transactions, we expect that the essential performance attributes of the system would be preserved under such conditions. Specifically, the channel bonded configurations would excel under a light workload, while the single-channel configurations would perform better under greater traffic.



next up previous
Next: Conclusions Up: A Design Study Previous: Parallel Disk I/O



Chance Reschke
Mon Nov 4 13:04:09 EST 1996