Slide 1Content filteringChallenges…Challenges…ScenariosScenariosThings known about settings?Examples of issues raisedThought question… and the answer is… and the rest of the answer?What about Amazon?Taking us to… SienaRemainder of today’s talkContent-Based OverlaysKen BirmanCornell University. CS5410 Fall 2008.Content filteringTwo kinds of publish-subscribeTopic-based: A topic defines the group of receivers. Some systems allow you to subscribe to a pattern that matches sets of topics, by having a special “topics” meta-topic, but this is still topic-orientedFor scaling, typically must map topics to a smaller set of multicast groups or overlaysContent-based: A query determines the messages that each receiver will acceptCan implement in a database or in an overlayChallenges…Each approach has substantial challengesFor topic-based systems, the “channelization” problem (mapping many topics to a small number of multicast channels or overlays) is very hardIn the most general cases, channelization is NP-complete!Yet some form of channelization may be critical because few multicast mechanisms scale well if huge numbers of groups are neededToday we won’t look closely at the channelization problem, but may revisit it later if time permitsUnder some conditions, may be solvableChallenges…What about content-based solutions?We need to ask how to express queries “on content”Could use Xquery, the new XML query languageOr could define a special-purpose packet inspection solution, a so-called “deep packet inspector”Then would ideally want to build a smart overlayAny given packet routes towards its destinations…… and any given router optimizes so that it doesn’t have an amount of work proportional to the number of pending content queriesScenariosWhen would content routing be helpful?In cloud systems, often want to route a request to some system that processed prior work of a related natureFor example, if I interact with Premier Cru to purchase 2007 Rhone red wines, as I query their data center it could build up a cache of data. If my queries revisit the same nodes, they perform far betterIn (unpublished) work at Amazon.com, the company found that almost every service has “opinions” about how to route messages within service clusters!ScenariosWhat about out in the wild?Here, imagine using content filtering as a way to query huge sets of RSS feedsUser expresses “interests” and these map to content queries… which route exactly the right stuff to him/herIBM Gryphon project: used this model, assumed that clients would be corporate users (often stock traders)Siena: similar model but assumes more of a P2P community in the Internet WANThings known about settings?All of these settings are very differentAmazon’s world is dominated by machine-controlled layout algorithms that selectively place services on clusters. Produces all sorts of “regularities”E.g. clones of aservice often subscribe to the same dataAnd if A0 and B0 are collocated on node X, probably representatives of A and B will always be collocatedIBM’s world is dominated by heavy-tailed interest behaviors: Traders specialize in various waysSiena world is more like a web search streamExamples of issues raisedEarly work on IBM’s Gryphon platform focused on in-network aggregation of the queriesThey assumed that each message has an associated set of tags (attached by sender for efficiency)Subscription was a predicate over these tagsTheir focus was on combining the predicates, in the network, to avoid redundant workThey got good results and even sold Gryphon as a product. But…Thought questionHow often would you “expect” to have an opportunity to do in-network query combinations?Would you prefer to do an in-network solution, like Gryphon, or build a database solution like Cornell’s Cayuga, where events can also be stored?… and the answer isFor IBM’s corporate clients, there turned out to most often be just a single Gryphon router per data center, with WAN links between themIn effect: Broadcast every event to all data centersThis happens because “most” topics were of interest to at least someone in “most” data centers.Then filter at the last hop before delivery to client nodesTurns out that the router was fast enough for this modelSo all that in-network query combination work was unneeded in most client settings!… and the rest of the answer?The majority of users had some form of archival storage unit in each data centerIt subscribes to everything and keeps copiesSo in effect, the average user “turned Gryphon into something much like Cayuga”Given this insight, Cayuga assumes full broadcast for event streams, focuses on a database model with rapid update rates. A more natural solution…Benefit? A single integrated story versus one network story coupled to a distinct database solutionWhat about Amazon? Amazon has lots of packet-inspection routers that peek inside data quickly and forward as appropriateCustomized on a per-service basisMany packet formats… hence little commonality between these inspection “applets”Motivates Cornell’s current work on “featherweight processes” to inspect packets at line speeds and exploit properties of multicore machines for scalabilityTaking us to… SienaRelatively popularClaimed user community of a few hundred thousand downloadsPerhaps a few thousand of whom actually use the systemLittle known about the actual usersToday we’ll look at a slide set generously provided by the development teamRemainder of today’s talkWe’ll dive down to look closely at SienaCovering all three scenarios is just more than we have time to
View Full Document