New version page

Monitoring, Diagnosing, and Repairing

Upgrade to remove ads

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

Monitoring, Diagnosing, and RepairingOverviewWhat is the problem?Goals of Dissertation ResearchGoals of System AdministrationMonitoring, Diagnosing, and Repairing (MDR)MDR: Examples — IntroMDR: Example 1MDR: Example 2MDR: Example 3MDR: Example 4MDR: Fundamental RequirementsMDR: Environmental ConstraintsMDR: Previous SystemsMDR: Previous Systems, cont.MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (4-6)MDR: ArchitectureMDR-Arch: DerivationsKey: Semi-Hier. DBs.Key: Self-DescribingKey: End-to-End NotificationKey: Aggregation & HiResKey: Agg & HiRes: SnapshotKey: Self-ConfiguringKey: Secure Remote ActionsMDR: Testing MethodologyMDR: DemoTimeline: Key PiecesTimelineConclusionOld SlidesSolutionsManaging Stable StorageSupporting UsersGoals: EnvironmentGoals: Environment, cont.Goals: Faults & ErrorsGoals: UsersSimplifying SecuritySlide 41Monitoring, Diagnosing, and Monitoring, Diagnosing, and RepairingRepairingEric AndersonU.C. Berkeley2Jan 13, 2019OverviewOverviewWhat is System Administration?–What is the problem?–Goals of Dissertation Research–Goals of System AdministrationMonitoring, diagnosing, and repairing Dissertation TimelineConclusion3Jan 13, 2019What is the problem?What is the problem?Problems occur in systems, and result in loss of productivity–Server failures  denial of service–System overload  lower productivityCost is too high –Cost of ownership estimated at $5,000-$15,000/year/machine–Median salary (~50k) / (median # machines/admin)  $700Our goal: Reduce cost by–Repairing problems faster (possibly automatically)–Handling more problems4Jan 13, 2019Goals of Dissertation ResearchGoals of Dissertation ResearchDescribe field of System AdministrationMonitoring, Diagnosing, and Repairing:–Approach: Synthesize solutions from other fields of research1) Detect previously ignored problems2) Automatic repair of some problems3) Reduce number of administrators needed 4) Support users’ understanding of systemApply here & distribute softwareThesis: Through our approach, we can achieve goals 1-4.5Jan 13, 2019Goals of System AdministrationGoals of System AdministrationGoal: Support cost-effective use of the computer environmentMore specifically (some non-technical):Environment: uniform, customizable, high performance and availableFaults & errors: recovery from benign errors, protection from malicious attacksUsers: training, accounting & planning, legal6Jan 13, 2019Monitoring, Diagnosing, and Monitoring, Diagnosing, and Repairing (MDR)Repairing (MDR)-Introductory examples-Fundamental requirements -Environmental constraints-Previous work-Six key innovations-Architecture-Details on innovations-Evaluation methodology7Jan 13, 2019MDR: Examples — IntroMDR: Examples — IntroFour examples1) Broken component2) Resource overload — transient3) Resource contention — user program4) Resource exhaustion — long termPrevious Solutions–Pay someone to watch–Ignore or wait for someone to complain–Specialized scripts (not general  vast repeated work)8Jan 13, 2019MDR: Example 1MDR: Example 1Web server has crashed/hungGather information: process existence, service uptime, restart timesAnalyze data: process not responding, and hasn’t been recently restarted.Automatic repair: restart daemon.Notify administrator: had to restart daemon.9Jan 13, 2019MDR: Example 2MDR: Example 2The NOW is “slow.” Gather data: load, process info, CPU infoAnalyze data: bounds on expected valuesNotified administrator: fileserver overloaded. Visualize data: nfsd’s are overloaded. Repair: admin moves data, adds disks, or starts more nfsd’s10Jan 13, 2019MDR: Example 3MDR: Example 3User running programGather: user statistics, CPU, diskVisualize: spending too much time waiting on remote accesses(User fixes program, gathering, visualization repeated)Analyze: some nodes have less throughput Visualize: those have other jobs running on themRepair: user is benchmarking so kills all extraneous processes11Jan 13, 2019MDR: Example 4MDR: Example 4Web server increasing beyond capacityGather: CPU, request rate, reply latencyAnalyze: Burst lengths getting longer, latency increasingVisualize: Graph of burst lengths & CPU usage over timeRepair: Order more machines, install load balancer12Jan 13, 2019MDR: Fundamental RequirementsMDR: Fundamental Requirements-Gathering-Flexible data gathering, self-describing storage-Analyzing-Calculate statistical measures, identify relevant statistics.-Notifying-Flexible infrequent messages to administrators or users-Visualizing-Maximize information/pixel, support multiple interfaces-Repairing -Automate simple repairs, support group operations13Jan 13, 2019MDR: Environmental Constraints MDR: Environmental Constraints Change is inherent–Lack of Web/Mbone 5 years ago, now most/many have these.Problems on many time-scales–Second-Minute transients vs. Week-Month capacity problemsMust operate under very adverse conditions–Often used when system is broken–Would like at least post-mortum analysisNeed to handle hundreds – thousands of nodes–Scalability: All sites are getting larger, possibly wide area–Our system has 200 (NOW) – 2000 (Soda) nodes14Jan 13, 2019MDR: Previous SystemsMDR: Previous SystemsMany previous systems: I’ve looked at about 16.Not comprehensive, not extensible.Look at a few that did a nice job of a piece:[Fink97] — Run test, notify display engine+ Easy to add tests+ Selectivity of notification good– Tests are just programs (redo gathering)– Central, non-fault tolerant solution– Many hard coded constants15Jan 13, 2019MDR: Previous Systems, cont.MDR: Previous Systems, cont.[Hard92] — buzzerd: Pager notification system+ Flexible rules for notification+ External interface for adding notify requests– Simplistic gathering– Poor fault tolerance[Pier96] — Igor group fixes+ Flexible operations+ Nice reporting of success/failure– Weak security, runs as root– No delegation of responsibility16Jan 13, 2019MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)Replicated, semi-hierarchical, data storage nodes–Rendezvous point for programs–Handles scaling and fault-toleranceSelf describing structures–Functions (visualize, summarize) + data go in database (OO)–DB has machine and human readable descriptions of dataEnd to end notification–Detect problems in MDR


Download Monitoring, Diagnosing, and Repairing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Monitoring, Diagnosing, and Repairing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Monitoring, Diagnosing, and Repairing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?