FAWNHaystack_ver2RAIDFAWN A Fast Array of Wimpy Nodes David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan 22nd ACM Symposium on Operating Systems Principles – October 2009Projected Electricity Usage Source: Report to Congress on Server and Data Center Energy Efficiency – August 2007 2Distributed Key Value Store Dynamo Cassandra Voldemort 3200 Watts 410 Watts 200 Watts 5FAWN-KV store Frontend A B C D F E Frontend Switch Store Lookup Delete . . . . 6Flash • Fast random access – Optimized for random reads • Slow random writes – Sequential Writes using append only log 71001 Hash Index 000 001 010 100 101 1001 Y Offset Data Log FAWN-DS 101 Key Fragment Key Fragment Valid 8 Index bitsFAWN – Replication R = 3 A B C D F E 9FAWN – Join Protocol B D E C head tail 10FAWN – Join Protocol B D E C head tail pre-copy 11FAWN – Join Protocol B D E C head tail Log flush 12FAWN cluster Source: http://www.cs.cmu.edu/~fawnproj 13Throughput Source: http://www.sigops.org/sosp/sosp09/papers/andersen-sosp09.pdf 1100 - 1700 QPS per node 21 Node FAWN cluster - 20 GB data No Frontend cache 14Performance and Power System/Storage QPS Watts Queries/Joule Alix3cs/Sandisk(CF) 1298 3.75 346 Desktop / Mobi(SSD) 4289 83 51.7 Desktop / HD 171 87 1.96 256 Byte lookups 15FAWN vs Traditional servers Capital Cost + 3 Year Power Cost (0.10 /kWh) Source: http://www.sigops.org/sosp/sosp09/slides/andersen-slides-sosp09.pdf 16 Total Cost =Discussion • Rethink Hadoop/Dryad for FAWN – Read as Key Value Pairs in place of bulk reads • Low Power Processor vs SSD savings – CPU Intensive workloads • From RAID to FAWN – I/O bound drives, Memory wall – Flash Arrays, Limited power • Log Based store – More efficient for frequent reads 17Questions 18SSD vs HDD Sandisk 5000 HDD – WD2500 250GB 7400RPM Access Time 0.1 ms 13.4 ms Sequential read rate 28.5 MB/s Sequential write rate 24 MB/s Random read IOPS 1424 QPS Random write IOPS 125 QPS 19HAYSTACKPeter Vajgel, Doug Beaver and Jason SobelPresented by: Rini KaushikFacebook Photo Storage Needs 15 Billion Photos Each has 4 images 60 Billion Photos 1.5PB (1015) Bytes Growth rate 220 million new photos/week 25TB storage Bandwidth requirements550,000 photos/second550,000 photos/second Assuming avg photo size = 1MB 550GB/sec bandwidth 300 million users currently 1.3 billion people with quality internet 4x growth possible Two workloads Profile pictures – heavy access, smaller size Photos – intermittent access, more in the beginning, periodically afterwardsMotivation for Haystack Build a specialized store for photo workload Highly Scalable to meet growing storage needs High disk bandwidth Reduce metadata disk IOReduce metadata disk IO Reduce Content Distribution Network (CDN) reliance Build from commodity servers as opposed to expensive Netapp filers ($2million each) Simple key-value lookup of photos, no need for PosixEnter Haystack Generic Object store Several photos (needles) combined into a large 10 GB append able file (Haystack)Index file per Haystack for determining needle Index file per Haystack for determining needle offsetsThen and NowNASHaystackStorage Challenges before Haystack Photos stored in traditional Netapp’s NFS Filers (Network Attached Storage (NAS)) Metadata Too Huge to be Cached Posix compliance resulted in more metadata/file Each image a file 60 Billion Files 15TB metadata (256B inode)10 disk IO (3 with lookup cache) per file for metadata10 disk IO (3 with lookup cache) per file for metadata Drastically reduces disk throughput No direct path from client storage limited bandwidth Result Relied heavily on CDNs to cache data to meet goals 99.98% hit rate profile 92% photos NAS more as a backup Inefficient and ExpensiveHaystack objectHeaderHaystackFooterIndex FileAdvantages Reduced disk IO higher disk throughput 1 MB of in-memory secondary metadata for every 1GB on-disk 10TB per node 10GB metadata easily cacheable Simpler metadata easier lookupsNot posixcompliantNot posixcompliant No directory structures/file names 64 Bit ID instead of file names Single photo serving and storage layer Direct IO path between client and storage higher bandwidth Less metadata for XFSHaystack Infrastructure Photo Store Server Accepts HTTP requests Haystack store operations Maintains in-memory Haystack Index Haystack Object StoreFilesystemFilesystem Extent based XFS Storage 2 x quad-core CPUs 16GB – 32GB memory hardware raid controller with 256MB – 512MB of NVRAM cache 12+ 1TB SATA drivesOperations Upload Photo assigned 64 bit ID Scaled into 4 image sizes profileID, photo key pvolID (volumeID) mapping stored in MySQL DB pvolID used to identify the volume container of a haystackReadRead profileID , photo key, size and cookie Output – needle data Write/Modify Selects a haystack to store the photo Updates in-memory index Modify results in new version with higher offset DeleteExisting limitations/Discussion Adhoc data allocation of photos haystacks If photos (in the same album) are placed at different times by the same user, it would be good if they are placed sequentially or close by for better data locality. No support for delete/overwrites May lead to a lot of unnecessary versions and data hence, reduced storage efficiency Compaction operation seems to be pretty expensive as it involves creating a new copy of haystack. LFS has a much more sophisticated cleaning mechanismWhat happens if request come at the same time?What happens if request come at the same time? If a file is updated, is it guaranteed to be placed on the same Haystack ID or a separate one? If old Haystack is already full, how will version check work? How will the older versions get identified and deleted? Assumes just one disk read per photo what if XFS doesn’t have the information in the cache, then it will have an extra lookup for the file Once, Haystack’s size becomes bigger than the largest extent size supported by XFS, extra lookups may be necessary if a needle is split across extents It would be good to have an abstraction at the album level as well to reduce the lookup overheadExisting
View Full Document