Scalable ClustersOverviewMicrosoft Cluster ServiceCluster AbstractionsCluster Abstractions (cont’d)Slide 6Node FailureMember Regroup AlgorithmRegroup Algorithm (cont’d)Joining a ClusterForming a ClusterLeaving a ClusterNode StatesResource ManagementResource MigrationPushing a Resource GroupPulling a Resource GroupClient Access to ResourcesMembership ManagerGlobal Update ManagerFrangipaniServer LayeringAssumptionsSystem StructureSlide 25Security ConsiderationsDisk LayoutLogging and RecoveryConcurrency ConsiderationsConcurrency Considerations (cont’d)Cache CoherenceSynchronizationLocking ServiceLocking Service HoleAdding and Removing ServersBackupsSummaryScalable ClustersJed Liu11 April 2002OverviewMicrosoft Cluster ServiceBuilt on Windows NTProvides high availability servicesPresents itself to clients as a single systemFrangipaniA scalable distributed file systemMicrosoft Cluster ServiceDesign goals:Cluster composed of COTS componentsScalability – able to add components without interrupting servicesTransparency – clients see cluster as a single machineReliability – when a node fails, can restart services on a different nodeCluster AbstractionsNodesResourcese.g., logical disk volumes, NetBIOS names, SMB shares, mail service, SQL serviceQuorum resourceImplements persistent storage for cluster configuration database and change logResource dependenciesTracks dependencies btw resourcesCluster Abstractions (cont’d)Resource groupsThe unit of migration: resources in the same group are hosted on the same nodeCluster databaseConfiguration data for starting the cluster is kept in a database, accessed through the Windows registry.Database is replicated at each node in the cluster.Node FailureActive members broadcast periodic heartbeat messagesFailure suspicion occurs when a node misses two successive heartbeat messages from some other nodeRegroup algorithm gets initiated to determine new membership informationResources that were online at a failed member are brought online at active nodesMember Regroup AlgorithmLockstep algorithmActivate. Each node waits for a clock tick, then starts sending and collecting status messagesClosing. Determine whether partitions exist and determines whether current node is in a partition that should survivePruning. Prune the surviving group so that all nodes are fully-connectedRegroup Algorithm (cont’d)Cleanup. Surviving nodes local membership information as appropriateStabilized. DoneJoining a ClusterSponsor authenticates the joining nodeDenies access if applicant isn’t authorized to joinSponsor sends version info of config databaseAlso sends updates as needed, if changes were made while applicant was offlineSponsor atomically broadcasts information about applicant to all other membersActive members update local membership informationForming a ClusterUse local registry to find address of quorum resourceAcquire ownership of quorum resourceArbitration protocol ensures that at most one node owns quorum resourceSynchronize local cluster database with master copyLeaving a ClusterMember sends an exit message to all other cluster members and shuts down immediatelyActive members gossip about exiting member and update their cluster databasesNode StatesInactive nodes are offlineActive members are either online or pausedAll active nodes participate in cluster database updates, vote in the quorum algorithm, maintain heartbeatsOnly online nodes can take ownership of resource groupsResource ManagementAchieved by invoking a calls through a resource control library (implemented as a DLL)Through this library, MSCS can monitor the state of the resourceResource MigrationReasons for migration:Node failureResource failureResource group prefers to execute at a different nodeOperator-requested migrationIn the first case, resource group is pulled to new nodeIn all other cases, resource group is pushedPushing a Resource GroupAll resources in the old node are brought offlineOld host node chooses a new hostLocal copy of MSCS at new host brings up the resource groupPulling a Resource GroupActive nodes capable of hosting the group determine amongst themselves the new host for the groupNew host chosen based on attributes that are stored in the cluster databaseSince database is replicated at all nodes, decision can be made without any communication!New host brings online the resource groupClient Access to ResourcesNormally, clients access SMB resources using names of the form \\node\serviceThis presents a problem – as resources migrate between nodes, the resource name will changeWith MSCS, whenever a resource migrates, resource’s network name also migrates as part of resource groupClients only sees services and their network names – cluster becomes a single virtual nodeMembership ManagerMaintains consensus among active nodes about who is active and who is definedA join mechanism admits new members into the clusterA regroup mechanism determines current membership on start up or suspected failureGlobal Update ManagerUsed to implement atomic broadcastA single node in the cluster is always designated as the lockerLocker node takes over atomic broadcast in case original sender fails in mid-broadcastFrangipaniDesign goals:Provide users with coherent, shared access to filesArbitrarily scalable to provide more storage, higher performanceHighly available in spite of component failuresMinimal human administrationFull and consistent backups can be made of the entire file system without bringing it downComplexity of administration stays constant despite the addition of componentsServer LayeringUserprogramUserprogramUserprogramFrangipanifile serverFrangipanifile serverPetaldistributed virtualdisk serviceDistributedlock servicePhysical disksAssumptionsFrangipani servers trust:One anotherPetal serversLock serviceMeant to run in a cluster of machines that are under a common administration and can communicate securelySystem StructureFrangipani implemented as a file system option in the OS kernelAll file servers read and write the same file system data structures on the shared Petal diskEach file server keeps a redo log in Petal so that when it fails, another server can access log and
View Full Document