Monitoring, Diagnosing, and RepairingOverviewWhat is the problem?Goals of Dissertation ResearchGoals of System AdministrationMonitoring, Diagnosing, and Repairing (MDR)MDR: Examples — IntroMDR: Example 1MDR: Example 2MDR: Example 3MDR: Example 4MDR: Fundamental RequirementsMDR: Environmental ConstraintsMDR: Previous SystemsMDR: Previous Systems, cont.MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (4-6)MDR: ArchitectureMDR-Arch: DerivationsKey: Semi-Hier. DBs.Key: Self-DescribingKey: End-to-End NotificationKey: Aggregation & HiResKey: Agg & HiRes: SnapshotKey: Self-ConfiguringKey: Secure Remote ActionsMDR: Testing MethodologyMDR: DemoTimeline: Key PiecesTimelineConclusionOld SlidesSolutionsManaging Stable StorageSupporting UsersGoals: EnvironmentGoals: Environment, cont.Goals: Faults & ErrorsGoals: UsersSimplifying SecuritySlide 41Monitoring, Diagnosing, and Monitoring, Diagnosing, and RepairingRepairingEric AndersonU.C. Berkeley2Jan 13, 2019OverviewOverviewWhat is System Administration?–What is the problem?–Goals of Dissertation Research–Goals of System AdministrationMonitoring, diagnosing, and repairing Dissertation TimelineConclusion3Jan 13, 2019What is the problem?What is the problem?Problems occur in systems, and result in loss of productivity–Server failures denial of service–System overload lower productivityCost is too high –Cost of ownership estimated at $5,000-$15,000/year/machine–Median salary (~50k) / (median # machines/admin) $700Our goal: Reduce cost by–Repairing problems faster (possibly automatically)–Handling more problems4Jan 13, 2019Goals of Dissertation ResearchGoals of Dissertation ResearchDescribe field of System AdministrationMonitoring, Diagnosing, and Repairing:–Approach: Synthesize solutions from other fields of research1) Detect previously ignored problems2) Automatic repair of some problems3) Reduce number of administrators needed 4) Support users’ understanding of systemApply here & distribute softwareThesis: Through our approach, we can achieve goals 1-4.5Jan 13, 2019Goals of System AdministrationGoals of System AdministrationGoal: Support cost-effective use of the computer environmentMore specifically (some non-technical):Environment: uniform, customizable, high performance and availableFaults & errors: recovery from benign errors, protection from malicious attacksUsers: training, accounting & planning, legal6Jan 13, 2019Monitoring, Diagnosing, and Monitoring, Diagnosing, and Repairing (MDR)Repairing (MDR)-Introductory examples-Fundamental requirements -Environmental constraints-Previous work-Six key innovations-Architecture-Details on innovations-Evaluation methodology7Jan 13, 2019MDR: Examples — IntroMDR: Examples — IntroFour examples1) Broken component2) Resource overload — transient3) Resource contention — user program4) Resource exhaustion — long termPrevious Solutions–Pay someone to watch–Ignore or wait for someone to complain–Specialized scripts (not general vast repeated work)8Jan 13, 2019MDR: Example 1MDR: Example 1Web server has crashed/hungGather information: process existence, service uptime, restart timesAnalyze data: process not responding, and hasn’t been recently restarted.Automatic repair: restart daemon.Notify administrator: had to restart daemon.9Jan 13, 2019MDR: Example 2MDR: Example 2The NOW is “slow.” Gather data: load, process info, CPU infoAnalyze data: bounds on expected valuesNotified administrator: fileserver overloaded. Visualize data: nfsd’s are overloaded. Repair: admin moves data, adds disks, or starts more nfsd’s10Jan 13, 2019MDR: Example 3MDR: Example 3User running programGather: user statistics, CPU, diskVisualize: spending too much time waiting on remote accesses(User fixes program, gathering, visualization repeated)Analyze: some nodes have less throughput Visualize: those have other jobs running on themRepair: user is benchmarking so kills all extraneous processes11Jan 13, 2019MDR: Example 4MDR: Example 4Web server increasing beyond capacityGather: CPU, request rate, reply latencyAnalyze: Burst lengths getting longer, latency increasingVisualize: Graph of burst lengths & CPU usage over timeRepair: Order more machines, install load balancer12Jan 13, 2019MDR: Fundamental RequirementsMDR: Fundamental Requirements-Gathering-Flexible data gathering, self-describing storage-Analyzing-Calculate statistical measures, identify relevant statistics.-Notifying-Flexible infrequent messages to administrators or users-Visualizing-Maximize information/pixel, support multiple interfaces-Repairing -Automate simple repairs, support group operations13Jan 13, 2019MDR: Environmental Constraints MDR: Environmental Constraints Change is inherent–Lack of Web/Mbone 5 years ago, now most/many have these.Problems on many time-scales–Second-Minute transients vs. Week-Month capacity problemsMust operate under very adverse conditions–Often used when system is broken–Would like at least post-mortum analysisNeed to handle hundreds – thousands of nodes–Scalability: All sites are getting larger, possibly wide area–Our system has 200 (NOW) – 2000 (Soda) nodes14Jan 13, 2019MDR: Previous SystemsMDR: Previous SystemsMany previous systems: I’ve looked at about 16.Not comprehensive, not extensible.Look at a few that did a nice job of a piece:[Fink97] — Run test, notify display engine+ Easy to add tests+ Selectivity of notification good– Tests are just programs (redo gathering)– Central, non-fault tolerant solution– Many hard coded constants15Jan 13, 2019MDR: Previous Systems, cont.MDR: Previous Systems, cont.[Hard92] — buzzerd: Pager notification system+ Flexible rules for notification+ External interface for adding notify requests– Simplistic gathering– Poor fault tolerance[Pier96] — Igor group fixes+ Flexible operations+ Nice reporting of success/failure– Weak security, runs as root– No delegation of responsibility16Jan 13, 2019MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)Replicated, semi-hierarchical, data storage nodes–Rendezvous point for programs–Handles scaling and fault-toleranceSelf describing structures–Functions (visualize, summarize) + data go in database (OO)–DB has machine and human readable descriptions of dataEnd to end notification–Detect problems in MDR
or
We will never post anything without your permission.
Don't have an account? Sign up