Page: 1 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem Tandem NonStop Systems - Cyclone – Affordable Commercial Database Systems with very long MTTF – Modularity » units of service, failure, diagnosis, repair, growth » fault containment regions » expandable for performancePage: 2 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – Fail Fast Mode » terminate operation immediately after error detection » reduces error propagation » single error corrections/ double error detection » ECC, data coding » hardware self checking » software and firmware consistency checks » after failure OS distributes processors applications on remaining processors » load balancing is transparent to userPage: 3 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – Architecture (Overview in Pra96. Fig. 4.1) » loosely coupled MIMD, up to 16 processors » dual processors, independent & asynchronous » heavy use of low-level dual redundancy » multiple, physically separate sections » each section: up to 4 processors, communication via Dynabus » write through cache » mirrored disks – Processor Pair » primary/backup approach » primary sends checkpoints » when primary proc. fails: backup becomes primary rolls back to last checkpoint picks up from that pointPage: 4 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – Hardware Fault Tolerance » single fault tolerance » primary objective to prevent single fault to bring down system » redundant hardware: processors, busses, I/O controllers, disks, power supplies » spare RAM chips » each processor has own power supply – Software Fault Tolerance » processors can detect other halted processors » “I'm alive" protocol » GUARDIAN 90 OS maintains idle backups of user processes » Processor consistency check via checkpoint messagesPage: 5 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – Networking and I/O » Networks Dynabus: 40 MB/s = 2 independent 20 MB/s buses Dynabus+: 4 unidirectional fiber optics, – up to 50m physical separation – robust to electro-magnetic interference » I/O processor can support 2 I/O systems each system has 2 channels each channel supports up to 32 I/O devices burst data of 5 MB/s = 10 MB/s per processor DMA I/O mirrored disks (dual ported)Page: 6 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – On-line Maintenance » Field replaceable units (FRU) processors I/O controllers fans power supplies can be installed/replaced by user » Warm swaps of FRU » Effective MTTR = milliseconds => very high AvailabilityPage: 7 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem Tandem - Himalaya – Main features » loosely coupled massively parallel computer » 2 to 4080 processors » cross-coupled MIPS R4400 RISC processors one logical processor both processors operate in lockstep » 32K primary cache, 4MB secondary cache » up to 256 MB RAM » 4 independent I/O channels » fiber-optic TorusNet horizontal controller => 4 sections (each section = 4 processors) vertical controller => 14 nodes = domain depth controller => 16 domainsPage: 8 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem Himalaya 2000Page: 9 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem K2000SE serverPage: 10 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem optional expansion cabinetPage: 11 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – TorusNet » section » node » ringPage: 12 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – K200, K2000, K20000 Servers Spec. Features: » Target: online transaction processing » standard RISC technology » loosely coupled architecture » dual interprocessor buses » dual-ported controllers » fault-tolerant power subsystem » in case of power outage server memory is preserved via integrated battery backupPage: 13 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – NonStop Operating system » core of Tandem’s open systems environment » enables operation to run primary and backup processes » before performing any critical function, sends backup process a checkpoint message containing data and status information » kernel supports end-to-end integrity features » micro-kernel is message-based (parallel processing software) » kernel supports application program and operations control interfaces called “personalities” » these personalities support applications from different platforms » e.g. relational database management personalities applications can be developed using: SQL, Data Access Language (Macintosh), SQL Server (Microsoft/Sysbase), ODBC (Microsoft), Oracle Tools (Oracle)Page: 14 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem » other personalities are transaction processing personalities allows parallel transaction processing services for different systems » “guardian services” allow compatibility to Tandem applications » “open systems services” supports UNIXPage: 15 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem » Transaction Manager (NonStop TM/MP) deals with effects of incomplete transactions, system failures and network failures. » Remote Duplicate Facility allows data to be located remote to shield from environmental disaster. » Safeguard security management facility deals with security issues » Network support includes TCP/IP, IPX/SPX, NETBIOS, AppleTalk, SNA, OSI and ATMPage: 16 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 30 Tandem – Maintenance » key data logged and evaluated by expert-system to identify potential problem » can dial automatic to online support center » field replaceable units can be exchanged by warm swapsPage: 17 © 2003 A.W. Krings
View Full Document