Data Center NetworkingStanford CS144 Lecture 17Philip Levis, 11/30/11Low latencies: µsHigh capacity: GigE, 10 GigESpecialized trafficCentrally managedTopology(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)Storage Workload(picture courtesy of Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)Query Workload(picture courtesy of Alizadeh et al., “Data Center TCP (DCTCP)”)ProblemsPer-Pair Bandwidth(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)Incast(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)Incast Details(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)Mixed traffic•Low latency for short flows•High burst tolerance (incast)•High throughput for long flowsRecent Research•New switching topology: Al-Fares et al.•Fix TCP incast: Vasudevan et al.•Data Center TCP: Alizadeh et al.Per-Pair Bandwidth(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)Fat TreeFat Tree(k/2)2k/2k/2kSwitchingPrefixPort10.2.0.0/24010.2.1.0/2410.0.0.0/0SuffixPort0.0.0.2/820.0.0.3/8310.2.0.X10.2.1.XX.X.X.2X.X.X.3TCAMEncoderPrefixNext HopPort0010.2.0.100110.2.1.111010.4.1.121110.4.1.23Not Perfect(k/2)2k/2k/2kFat-Tree StatusIncast•RTO = SRTT + (4 X RTTVAR)Behavior(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)RFC 6298 (2.4) Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second. - in practice, often 200ms The delayed ACK algorithm specified in [Bra89] SHOULD be used by a TCP receiver. When used, a TCP receiver MUST NOT excessively delay acknowledgments. Specifically, an ACK SHOULD be generated for at least every second full-sized segment, and MUST be generated within 500 ms of the arrival of the first unacknowledged packet. - in practice, often 40msRFC 2581Solutions•Proposal 1: Adjust RTO (Vasudevan et al.)•Proposal 2: DCTCP (Alizadeh et al.)RTTRTT 2RTO•Make RTOmin 200µs•Timeout = (RTO + (rand(0.5) x RTO))ImprovementWide AreaDCTCP•Three goals•Low latency for short flows•High burst tolerance (incast)•High throughput for long flows•Basic approach: keep switch queues shortQueue Length•RTT measurements are noisy•At high speeds, very small•GigE: 10 packets is 120µs•10GigE: 10 paciets is 12µs•Use ECN (explicit congestion notification)•RFC 3168Setting ECNKSet ECN bitMonitoring α•Per RTT, measure F, the fraction of packets sent that had the ECN bit set•DCTCP acks copy the ECN bit of the corresponding data packets into ECN-Echo field•Compute α, EWMA of FAdjusting cwnd•cwnd = cwnd x (1 - α/2)DCTCP Caveat“We stress that DCTCP is designed for the data center environment. In this paper, we make no claims about suitability of DCTCP for wide area networks.”Data Center Networks•Very different than wide area Internet•Tiny RTTs•Different traffic patterns•Single administrative domain•Standards (e.g., IETF) much less important•A lot of very novel network
View Full Document