Distributed systems [Fall 2009] G22.3033-001Know your staffImportant addressesThis class will teach you …Who should take this class?Course readingsCourse structureHow are you evaluated?Questions?What are distributed systems?Why distributed systems? for ease-of-useWhy distributed systems? for availabilityWhy distributed systems? for scalable capacityWhy distributed systems? for modular functionalityChallengesChallenges (continued)A word of warningTopics in this courseCase Study: Distributed file systemA simple distributed FS designTopic: System DesignTopic: ConsistencyTopic: Fault ToleranceTopic: SecurityTopic: ImplementationIntro to programming Lab: Yet Another File System (yfs)YFS is inspired by FrangipaniFrangipani DesignSlide 29Frangipani securityFrangipani server implements FS logicConcurrent accesses cause inconsistencySolution: use a lock service to synchronize accessPutting it togetherNFS (or AFS) architectureNFS messages for reading a fileWhy use file handles in NSF msg, not file names?Frangipani vs. NFSYFS: simplified FrangipaniLab scheduleL1: lock serverDistributed systems [Fall 2009]G22.3033-001Lec 1: Course Introduction & Lab IntroKnow your staff•Instructor: Prof. Jinyang Li (me)–[email protected]–Office Hour: Tue 5-6pm (715 Bway Rm 708)•TA: Bonan Min–[email protected]–Office Hour: Tue 3-4pm (715 Bway Rm 705)Important addresses•Class webpage: http://www.news.cs.nyu.edu/~jinyang/fa09–Check regularly for announcements, reading questions•Sign up for class mailing list [email protected]–We will email announcements using this list–You can also email the entire class for questions, share information, find project member etc.•Staff mailing list includes just me and Bonan [email protected] class will teach you …•Basic tools of distributed systems–Abstractions, algorithms, implementation techniques–System designs that worked•Build a real system!–Synthesize ideas from many areas to build working systems•Your (and my) goal: address new (unsolved) system challengesWho should take this class?•Pre-requisite:–Undergrad OS –Programming experience in C or C++•If you are not sure, do Lab1 asap.Course readings•No official textbook•Lectures are based on (mostly) research papers–Check webpage for schedules•Useful reference books–Principles of Computer System Design. (Saltzer and Kaashoek)–Distributed Systems (Tanenbaum and Steen)–Advanced Programming in the UNIX environment (Stevens)–UNIX Network Programming (Stevens)Course structure•Lectures –Read assigned papers before class–Answer reading questions, hand-in answers in class–Participate in class discussion•8 programming Labs –Build a networked file system with detailed guidance!•Project (workload is equivalent to 2 labs)–Extend the lab file system in any way you like!How are you evaluated?•Class participation 10%–Participate in discussions, hand in answers•Labs 45%•Project 15%–In teams of 2 people–Demo in last class–Short paper (<=4 pages)•Quizzes 30%–mid-term and final (90 minutes each)Questions?What are distributed systems?•Examples?Multiple hostsA network cloudHosts cooperate to provide a unified serviceWhy distributed systems?for ease-of-use•Handle geographic separation•Provide users (or applications) with location transparency:–Web: access information with a few “clicks”–Network file system: access files on remote servers as if they are on a local disk, share files among multiple computersWhy distributed systems?for availability•Build a reliable system out of unreliable parts–Hardware can fail: power outage, disk failures, memory corruption, network switch failures…–Software can fail: bugs, mis-configuration, upgrade …–To achieve 0.999999 availability, replicate data/computation on many hosts with automatic failoverWhy distributed systems?for scalable capacity•Aggregate resources of many computers–CPU: Dryad, MapReduce, Grid computing–Bandwidth: Akamai CDN, BitTorrent–Disk: Frangipani, Google file systemWhy distributed systems?for modular functionality•Only need to build a service to accomplish a single task well. –Authentication server–Backup server.Challenges•System design–What is the right interface or abstraction?–How to partition functions for scalability?•Consistency–How to share data consistently among multiple readers/writers?•Fault Tolerance–How to keep system available despite node or network failures?Challenges (continued)•Different deployment scenarios–Clusters–Wide area distribution–Sensor networks•Security–How to authenticate clients or servers?–How to defend against or audit misbehaving servers?•Implementation–How to maximize concurrency?–What’s the bottleneck?–How to reduce load on the bottleneck resource?A word of warningA distributed system is a system in which I can’t do my work because some computer that I’ve never even heard of has failed.”-- Leslie LamportTopics in this courseCase Study: Distributed file systemServer(s)Client 1Client 2 Client 3A distributed file system provides:• location transparent file accesses • sharing among multiple clients$ echo “test” > f2$ ls /dfsf1 f2$ ls /dfsf1 f2$ cat f2testA simple distributed FS design•A single server stores all data and handles clients’ FS requests.Client 1Client 2QuickTime™ and a decompressorare needed to see this picture.Client 3Topic: System Design•What is the right interface?–possible interfaces of a storage system •Disk•File system•Database•Can the system handle a large user population?–E.g. all NYU students and faculty•How to store peta-bytes of data?–Has to use more than 1 server–Idea: partition users across different serversTopic: Consistency•When C1 moves file f1 from /d1 to /d2, do other clients see intermediate results?•What if both C1 and C2 want to move f1 to different places?•To reduce network load, cache data at C1–If C1 updates f1 to f1’, how to ensure C2 reads f1’ instead of f1?Topic: Fault Tolerance•How to keep the system running when some file server is down?–Replicate data at multiple servers•How to update replicated data?•How to fail-over among replicas?•How to maintain consistency across reboots?Topic: Security•Adversary can manipulate messages–How to authenticate?•Adversary may compromise machines–Can the FS remain correct despite a few compromised nodes?–How to audit for past
View Full Document