CS 417: Distributed Systems 12/4/2011 © 2011 Paul Krzyzanowski 1 Distributed Systems 24. Clusters Paul Krzyzanowski [email protected] Designing highly available systems • Incorporate elements of fault-tolerant design – Replication, TMR • Fully fault tolerant system will offer non-stop availability – You can’t achieve this! • Problem: expensive! Designing highly scalable systems SMP architecture Problem: Performance gain as f(# processors) is sublinear – Contention for resources (bus, memory, devices) – Also … the solution is expensive! Clustering Achieve reliability and scalability by interconnecting multiple independent systems Cluster: a group of standard, autonomous servers configured so they appear on the network as a single machine single system image Ideally… • Bunch of off-the shelf machines • Interconnected on a high speed LAN • Appear as one system to users • Processes are load-balanced – May migrate – May run on different systems – All IPC mechanisms and file access available • Fault tolerant – Components may fail – Machines may be taken down we don’t get all that (yet) (at least not in one package)CS 417: Distributed Systems 12/4/2011 © 2011 Paul Krzyzanowski 2 Clustering types • Supercomputing (HPC) – and Batch processing • High availability (HA) – and Load balancing • High availability / High scalability Cluster Components Cluster management • Software to manage cluster membership – What are the nodes in the cluster? – What are the live nodes in the cluster? • Quorum – Number of elements that must be online for the cluster to function – Voting algorithm to determine whether the set of nodes has quorum (a majority of nodes to keep running) • Keep track of quorum – Count cluster nodes running the cluster manager – If over ½ are active, the cluster has quorum – Forcing a majority avoids split-brain Interconnect Cluster Interconnect • Provide communication between nodes in a cluster • Often separate network from the LAN • Sometimes known as System Area Network (SAN) • Goals – Low latency • Avoid OS overhead, layers of protocols, retransmission, etc. – High bandwidth • High bandwidth, switched links • Avoid overhead of sharing traffic with non-cluster data – Low CPU overhead – Low cost • Cost usually matters if you’re connecting thousands of machines Example High-Speed Interconnects • Myricom’s Myrinet – High-speed switch fabric for low-latency, high-bandwidth, interprocess communication between nodes – Lower overhead than ethernet – 10 Gbps bandwidth between any two nodes – Scales to tens of thousands of nodes – Example: used in IBM’s Linux Cluster Solution • Infiniband – Direct interconnect to CPU bridge chip – 2.5 – 30 Gbps • 10 Gigabit EthernetCS 417: Distributed Systems 12/4/2011 © 2011 Paul Krzyzanowski 3 Disks Shared storage access • If an application can run on any machine, how does it access file data? • If an application fails over from one machine to another, how does it access its file data? • Can applications on different machines share files? Network File Systems • One option – network file systems: NFS, SMB, AFS, AFP, etc. – Caching may cause inconsistencies – Data sharing is usually a problem – Performance may be a problem (LAN vs. local bus access) Shared disk • Shared disk – Allows multiple systems to share access to disk drives – Works well if there isn’t much contention – Disk access must be synchronized • Synchronization via a distributed lock manager (DLM) • Cluster File System – Client runs a file system accessing a shared disk at the block level – Examples: IBM General Parallel File System (GPFS), Microsoft Cluster Shared Volumes (CSV), Oracle Cluster File System (OCFS), Red Hat Global File System (GFS) Cluster File System • No client/server roles, no disconnected modes • All nodes are peers and access a shared disk(s) • Distributed Lock Manager – Process to ensure mutual exclusion when needed – Not needed for local file systems on shared disk – Inode-based locking and caching control • Linux GFS (no relation to Google GFS) – Cluster file system accessing storage at a block level – Cluster Logical Volume Manager (CLVM): volume management of cluster storage – Global Network Block Device (GNBD): block level storage access over ethernet: cheap way to access block-level storage Shared nothing • Shared nothing – No shared devices – Each system has its own storage resources – No need to deal with DLMs – If a machine A needs resources on B, A sends a message to B • If B fails, storage requests have to be switched over to a live node – Exclusive access to shared storage • Multiple nodes may have access to shared storage • Only one node is granted exclusive access • Exclusive access changed on failoverCS 417: Distributed Systems 12/4/2011 © 2011 Paul Krzyzanowski 4 SAN: Node-Disk interconnect • Storage Area Network (SAN) • Separate network between nodes and storage arrays – Fibre channel – iSCSI • Any node can be configured to access any storage through a fibre channel switch • DAS: Direct Attached Storage • SAN: block-level access to a disk via a network • NAS: file-level access to a remote file system Failover HA issues • How do you detect failover? • How long does it take to detect? • How does a dead application move/restart? • Where does it move to? Heartbeat network • Machines need to detect faulty systems – Heartbeat: “ping” mechanism • Need to distinguish system faults from network faults – Useful to maintain redundant networks – Send a periodic heartbeat to test a machine’s liveness – Watch out for split-brain! • Ideally, use a network with a bounded response time – Microsoft Cluster Server supports a dedicated “private network” • Two network cards connected with a pass-through cable or hub – SAN interconnect – Cluster interconnect Failover Configuration Models • Active/Passive (N+M nodes) – M dedicated failover node(s) for N active nodes – Passive nodes do nothing until they’re needed • Active/Active – Failed workload goes to remaining nodes Design options for failover • Cold failover – Application restart • Warm failover – Restart last checkpointed image – Relies on application checkpointing itself periodically – May use writeahead log (tricky)
View Full Document