Berkeley COMPSCI 268 - TCP Incast - D1106217

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 268> TCP Incast

Berkeley COMPSCI 268 - TCP Incast

School name University of California, Berkeley

Pages 25

Download Save

Unformatted text preview:

This is the “Other Incast talk” or the “Response to the Incast talk” given on the last day of SIGCOMM 2009 by our colleagues at CMU. last day of SIGCOMM 2009 by our colleagues at CMU. It contains very recent findings beyond the writeup in the paper. We strongly encourage questions, criticism, and feedback. This is joint work with my co-authors at the Reliable, Adaptive, and Distributed Laboratory (RAD Lab) at UC Berkeley. 1The bottomline of the story is that TCP incast is not solved. We will develop this story by answering several questions. story by answering several questions. We begin by looking at what is incast, why do we care, and what has already been done. We continue with a discussion on some subtle methodology issues. There are some instinctive first fixes. They work. But they are limited. We will spend some time trying to discover the root causes of incast. It requires looking at the problem from several angles. Lastly, we outline a promising path towards a solution. 2Incast is a TCP pathology that occurs in N to 1 large data transfers. It occurs most visibly on one-hop topologies with a single bottleneck. The observed phenomenon is that the “goodput” seen by applications drops far below link capacity. We care about this problem because it can affect key datacenter applications with synchronization boundaries, meaning that the application cannot proceed until it has received data from all senders. Examples include distributed file systems, where block requests are satisfied only when all senders finished transmitted their own fragment of the block; MapReduce, where the shuffle step cannot complete until intermediate results from all nodes are fetched; or web search and similar query applications, where a query is not satisfied until responses from the distributed workers are assembled. 3The problem is complex, because it can be masked by application level inefficiencies, leading to opinions that Incast is not “real”. inefficiencies, leading to opinions that Incast is not “real”. We encourage datacenter operators with different experiences to share their perspectives. From our own experience in using MapReduce, we can get a three-fold performance improvement just by using better configuration parameters. So the experimental challenge is to remove application level artifacts and isolate the observed bottleneck to the network. We believe that as applications improve, incast would become visible in more and more applications. 4There is significant prior and concurrent work on Incast by a group at CMU. We’re in regular contact and exchanging results and ideas. We’re in regular contact and exchanging results and ideas. Their paper in FAST 2008 gave a first description of the problem, and coined the term “incast”. The key findings there are that all popular TCP variants suffer from this problem. There are non-TCP workarounds, but they are problematic in real-life deployment scenarios. Their SIGCOMM 2009 paper, presented yesterday, suggested that reducing the TCP retransmission timeout minimum is a first step, and that high resolution timers for TCP also helps. Our results are in agreement with theirs. However, we also have several points of departure convincing us that the first-fixes are limited. 5So, what is a good methodology to study incast?6We need a simple workload, to ensure that any performance degradation we see are not due to application inefficiencies. are not due to application inefficiencies. We use a straight-forward N-to-1 block transfer workload. This is the same workload used in the CMU work, and it is motivated by data transfer behavior in file systems and MapReduce. What happens is this:A block request is sent. Then the receiver starts receiving fragments from all senders. Some time later it finishes receiving different fragments. The synchronization boundary occurs because we send the second block request only after we received the last fragment. There after the process repeats. 7There is some complexity even with this straight-forward workload. We can run the workload in two ways. the workload in two ways. We can keep the fragment size fixed as the number of senders increase. This is how the workload was run in the FAST 2008 incast paper. Alternatively, we can vary the fragment size such that the sum of fragments is fixed. This is how the workload was run in the SIGCOMM 2009 incast paper. The two different workload flavors result in two different ideal behavior, but fortunately this is the only place that they differ. We used the fixed fragment workload to ensure comparability with the results in the FAST 2008 incast paper. 8We also need to do measurements in physical networks. Our instinct is to simulate the problem on event-based simulators like ns-2. It turns out that the default models in a simulator like ns-2 are not detailed enough for analyzing the dynamics of this problem. Later in the talk you will see several places where this inadequacy comes up. Thus we are compelled to domeasurements physical networks.We used Intel Xeons machines running a recent distribution of Linux, and we implement TCP modifications to the TCP stack in Linux. The machines are connected by 1Gbps Ethernet through a Nortel 5500 series switch. We did most of our analysis by looking at TCP socket state variables directly. We also used tcpdump and tcptrace to leverage some of their aggregation and graphing capabilities. 9There are some first fixes to the problem. They work, but they are not enough. 10Fixing the mismatch between RTO and the round-trip-time or RTT is indeed the first step, as demonstrated inthe incastpresentation yesterday. first step, as demonstrated inthe incastpresentation yesterday. Many OS implementations contain an TCP RTO minimum constant that acts as the lower bound and the initial value for the TCP RTO timer. The default value of this constant is hundreds of milliseconds, optimized for the wide area network, but far below RTT in datacenters. Reducing the default value gives us a huge improvement immediately. More interesting, there are several regions in the graphs. As we increase the number of senders, we go through collapse, recovery, and sometimes encounter another decrease. The recovery and decrease is not observed in concurrent work. Later you’ll see why. 11When we have small RTO values, we need high resolution timers to make them effective. effective. This graph shows that there is a difference between low and high

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 268 - TCP Incast

Sign up for free to view:

Please select your school