Unformatted text preview:

Modern Distributed Systems Design– Security and High AvailabilityMark ChekhanovskiyCS265 Fall 2002Research Topic:Modern Distributed Systems Design – Security and High Availability- Measuring AvailabilityHow resiliency and high availability are interconnected?Define downtime and what causing downtime. How to meager availability?- Highly Available Data ManagementData management is the most sensitive area of modern distributed systems.Quick overview of existing data topologies- Redundant System DesignRedundant storage (RAID, Multihosting, Multipathing, DiskAray, JBOD, etc)Failover Configurations and ManagementIntroduction to SAN and Fibrechannel protocolSecurity aspects of data management in Storage Area NetworksResilience and high availability mean that all of a system’s failure modes are known and well-defined, including networks and applications. They mean that the recovery times for all known failures have an upper bound; we know how long a particular failure will have the system down. While there may be certain failures that we cannot cope with very well, we know they are and how to recover from them, and we have backup plans for use if our recoveries don’t work. A resilient system is one that can take a hit to a critical component, and recover and come back in a known, bounded and generally acceptable period of time.Measuring availability always turns to be about measuring the cost. In mission critical systems availability is the most important property that needs to be considered. We must consider not only the cost as burden but be able to evaluate is when “100%” uptime is required. Unfortunately just pairing two servers with 99% of uptime wont makeour system 99.99% even basic mathematically you can argue so. Downtime could be defined by following: “If a user cannot get his job done on time, the system is down”Percentage Uptime PercentageDowntimeDowntime per year Downtime per week98% 2% 7.3 days 3h22m99% 1% 3.65 days 1h41m99.8% 0.2% 17h30m 20m10s99.9% 0.1% 8h45m 10m5s99.99% 0.01% 52.5m 1m99.999% 0.001% 5.25m 6s99.9999% 0.00001% 31.5s 0.6sThere are many causes for system downtime that are falling into following categories: - Planned – ones that easiest to reduce that include scheduled system maintenance, hot-swappable hard drives, cluster upgrades and even failovers. Usually 30% of all downtime;- People or human factor – dumb mistakes and complex innovation in IT equipment, software and protocols requires greater knowledge of engineers. Usually 15 % of all downtime;- Software Failures - due to software bugs and viruses. (40%)MTBFAvailability = ---------------------, where MTBF – “mean time between failures” and MTBF + MTTR MTTR - “maximum time to repair”So what could go wrong in Modern Distributed Systems?- Hardware- Environmental and Physical Failures- Network Failures- Database System Failures (application crash or hangs, resource shortfalls, index corruption, buggy software)- Web Server Failures- File and Print Server Failures The Cost of Downtime:Industry Business Operation Average Downtime costper hourFinancial Brokerage Operation $6.45 MilFinancial Credit Card/SalesAuthorization$2.6MMedia Pay per view TV $150KRetail Catalog sales $90K-$115KTransportation Airlines $89.5KLevels of Availability:1. Regular Availability - Do Nothing Special;2. Increased Availability - Protect the Data (use RAID);3. High Availability – Protect the System (loosely coupled servers, hit JACK SPOT 99.98% of Availability)4. Disaster recovery – Protect the Organization5. Fault-Tolerant System – System that build from double- and triple-redundant components working in parallel.Disks and the data stored on them, are the most critical part of just about any computer system. Why?1. Disks are the most likely component to fail2. Disks contain data3. The data must be protected4. Data accessibility must be ensuredHow storage is managed:1. Disk(s) in computer system (eg SCSI). SCSI-1 3-5Mb/sec evolution to Ultra-3160Mb/s. There are initiators and target that are connected to bus2. Fiber channel speed 100Mb/s evolution to 2Gb/s. Devices could be 2 kilometers apart on Fiber channel network3. Multi-hosting – one set of disks connected to more then one server4. Multi-pathing – connecting single host to a single disk array with more then one data path5. Disk array – a single enclosure or cabinet, containing slots for many disks6. JBOD (Just Bunch of Drives) – collection of disks with no hardware intelligence7. SAN – Storage Area Networks (brakes one-to-one relation between server andstorage, storage virtualization, utilization, centralized management and allocation, intrinsic resilience and high availability, no-need for disk co-location, complex fail over configurations, efficient resource deployment, LAN-free backups)The RAID standard describes several ways to combine and manage a set of independent disks so that the resultant combination provides a level of disk redundancy.RAID-0 : stripping – each chunk of data to be written to disk is broken into smaller segments, with each segment written to a separate disk. (Increases performance but decreases availability)RAID –1 : mirroring – is a model when a copy of every byte on disk is kept on a second disk. Using more then one copy of the data will increase level of redundancy. Good data protection from the inevitable loss of disk, read performance is better then without. It requires 100% disk overhead. Synchronization of failed drive requires block for block complete copy of the contents of the original disk that requires time and a lot of IO. RAID-1+0 – Striped MirrorsRAID 3, 4 and 5: Parity RAID – Each RAID volume requires space equivalent to one extra disk. This additional disk’s blocks contain calculated parity, which are generated by taking XOR of the contents of the corresponding data blocks on all other disks in the RAID volume.RAID-3 – Virtual Disk Blocks. Every disk operation touches all disks regardless of the size of the write.RAID-4 – Dedicated Parity Disk. The entire disk is devoted to RAID volume parity.RAID-5 – Striped Parity Region works as RAID 4 by the parity volume is striped across all disks.Windows and UNIX operating systems are using Journaled FileSystem (JFS) to reduce the risk of loosing data after system crashes and corresponding increase of system availability by limiting the need for full disk scan. JFS reserve chunk of disk space for a


View Full Document

SJSU CS 265 - MDSD

Documents in this Course
Stem

Stem

9 pages

WinZip

WinZip

6 pages

Rsync

Rsync

7 pages

Hunter

Hunter

11 pages

SSH

SSH

16 pages

RSA

RSA

7 pages

Akenti

Akenti

17 pages

Blunders

Blunders

51 pages

Captcha

Captcha

6 pages

Radius

Radius

8 pages

Firewall

Firewall

10 pages

SAP

SAP

6 pages

SECURITY

SECURITY

19 pages

Rsync

Rsync

18 pages

honeypots

honeypots

15 pages

VPN

VPN

6 pages

Wang

Wang

18 pages

TKIP

TKIP

6 pages

ESP

ESP

6 pages

Dai

Dai

5 pages

Load more
Download MDSD
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view MDSD and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MDSD 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?