Efficient Routing for Peer-to-Peer Overlays

Please download to get full document.

View again

of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Efficient Routing for Peer-to-Peer Overlays
Document Share
Document Tags
Document Transcript
  Efficient Routing for Peer-to-Peer Overlays Anjali Gupta, Barbara Liskov, and Rodrigo Rodrigues MIT Computer Science and Artificial Intelligence Laboratory  { anjali,liskov,rodrigo } @csail.mit.edu Abstract Most current peer-to-peer lookup schemes keepa small amount of routing state per node, typicallylogarithmic in the number of overlaynodes. This de-sign assumes that routing information at each mem-ber node must be kept small, so that the book-keeping required to respond to system membershipchanges is also small, given that aggressive mem-bership dynamics are expected. As a consequence,lookups have high latency as each lookup requirescontacting several nodes in sequence.In this paper, we question these assumptions bypresenting two peer-to-peer routing algorithms withsmall lookup paths. First, we present a one-hoprouting scheme. We show how to disseminate infor-mation about membership changes quickly enoughso that nodes maintain accurate routing tables withcomplete membership information. We also deduceanalytic bandwidth requirementsfor ourscheme thatdemonstrate its feasibility.We also propose a two-hop routing scheme forlarge scale systems of more than a few million nodes,where the bandwidth requirements of one-hop rout-ing can become too large. This scheme keeps a fixedfraction of the total routing state on each node, cho-sen such that the first hop has low latency, and thusthe additional delay is small.We validate our analytic model using simulationresults that show that our algorithms can maintainrouting information sufficiently up-to-date such thata large fraction (e.g., 99%) of the queries will suc-ceed without being re-routed. 1 Introduction Structured peer-to-peer overlays like Chord [15],CAN [11], Pastry [13], and Tapestry [18] provide asubstrate for building large-scale distributed appli-cations. These overlays allow applications to locateobjects stored in the system in a limited number of overlay hops.Peer-to-peer lookup algorithms strive to main-tain a small amount of per-node routing state – typi-cally  O (log N  ) – because their designers expect thatsystem membership changes frequently. This expec-tation has been confirmed for successfully deployedsystems. A recent study [14] shows that the averagesession time in Gnutella is only 2 . 9 hours. This isequivalent to saying that in a system with 100 , 000nodes, there areabout 19 membership change eventsper second.Maintaining small tables helps keep the amountof bookkeeping required to deal with membershipchanges small. However, there is a price to payfor having only a small amount of routing state pernode: lookups have high latency since each lookuprequires contacting several nodes in sequence.This paper questions the need to keep routingstate small. We take the position that maintain-ing full routing state (i.e., a complete description of system membership) is viable even in a very largesystem, e.g., containing a million nodes. We presenttechniques that show that in systems of this size,nodes can maintain membership information accu-rately yet the communication costs are low. Theresults imply that a peer-to-peer system can routevery efficiently even though the system is large andmembership is changing rapidly.We present a novel peer-to-peer lookup systemthat maintains complete membership information ateach node. We show analytic results that provethat the system meets our goals of reasonable ac-curacy and bandwidth usage. It is, of course, easyto achieve these goals for small systems. Our al-gorithm is designed to scale to large systems. Ouranalysis shows that we can use one-hop routing forsystems of up to a few millions of nodes.Our analysis also shows that beyond a few mil-lion nodes, the bandwidth requirements of the one-hop scheme become too large. We present the de-sign of a two-hop lookup scheme that overcomesthis problem, and still provides faster lookups thanexisting peer-to-peer routing algorithms. We alsopresent an analytic model of the two-hop systemand conclude that its bandwidth requirements arereasonable, even for systems with tens of millions of nodes.  Finally, the paperpresents simulation results thatcorroborate what our analytic models predict. Wealso show that performance does not degrade sig-nificantly as the system becomes larger or smallerthan due to aggressive system dynamics.The rest of the paper is organized as follows.Section 2 presents our system model. Sections 3and 4 describe our one-hop and two-hop routingschemes, respectively. Section 5 evaluates our sys-tem. We conclude with a discussion of what we haveaccomplished. 2 System Model We consider a system of   n  nodes, where  n  isa large number like 10 5 or 10 6 . We assume dy-namic membership behavior as in Gnutella, whichis representative of an open Internet environment.From the study of Gnutella and Napster [14], wededuce that systems of 10 5 and 10 6 nodes wouldshow around 20 and 200 membership changes persecond, respectively. We call this rate  r . We referto membership changes as events in the rest of thepaper.Every node in the overlay is assigned a random128-bit node identifier. Identifiers are ordered in an identifier ring   modulo 2 128 . We assume that iden-tifiers are generated such that the resulting set isuniformly distributed in the identifier space, for ex-ample, by setting a node’s identifier to be the cryp-tographic hash of its network address. Every nodehas a predecessor and a successor in the identifierring, and it periodically sends keep-alive messagesto these nodes.Similarly, each item has a  key , which is also anidentifier in the ring. Responsibility for an item(e.g., providing storage for it) rests with its  succes-sor  ; this is the first node in the identifier ring clock-wise from  key . This mapping from keys to nodes isbased on the one used in Chord [15], but changingour system to use other mappings is straightforward.Clients issue queries that try to reach the suc-cessor node of a particular identifier. We intendour system to satisfy a large fraction,  f  , of thequeries correctlyon the  first   attempt (where each at-tempt requires one or two hops, depending on whichscheme we use). Our goal is to support high valuesof   f  , e.g.,  f   = 0 . 99. A query may fail in its firstattempt due to a membership change, if the notifi-cation of the change has not reached the queryingnode. In such a case, the query can still be reroutedand succeed in a higher number of hops. Neverthe-less, we define failed queries as those that are notanswered correctly in the  first   attempt, as our objec-tive is to have one- or two-hop lookups, dependingon which algorithm we use. 3 One Hop Lookups This section presents the design and analysis of our one-hop scheme. In this scheme, every nodemaintains a full routing table containing informa-tion about every other node in the overlay. Theactual query success rate depends on the accuracyof this information.Section 3.1 describes how the algorithm handlesmembership changes, namely how to convey infor-mation about these changes to all the nodes in thering. Section 3.2 explains how the algorithm reactsto node failures and presents an informal correctnessargument for our approach. Section 3.3 discussesissues about asymmetry in the load of individualnodes. Section 3.4 presents an analysis of the band-width requirements of this scheme. 3.1 Membership Changes Membership changes(i.e., nodes joining and leav-ing the ring) raise two important issues that ouralgorithm must address. First, we must update lo-cal information about the membership change, inorder for each node in the system to determine pre-cisely which interval in the id space it is responsi-ble for. The second issue is conveying informationabout the change to all the nodes in the ring so thatthese nodes will maintain correct information aboutthe system membership and consequently manageto route in a single hop.To maintain correct local information (i.e., infor-mation about each node’s successor and predecessornode), every node  n  runs a stabilization routine pe-riodically, wherein it sends keep-alive messages toits successor  s  and predecessor  p . Node  s  checks if  n  is indeed its predecessor, and if not, it notifies  n  of the existence of another node between them. Simi-larly  p  checks if   n  is indeed its successor, and if notit notifies  n . If either of   s  or  p  does not respond,  n pings it repeatedly until a time-out period when itdecides that the node is unreachable or dead.A joining node contacts another system node toget its view of the current membership; this proto-col is similar to the Chord protocol [15, 16]. Themembership information enables it to get in touchwith its predecessor and successor, thus informingthem of its presence.To maintain correct full routing tables, notifica-tions of membership change events, i.e., joins andleaves, must reach every node in the system withina specified amount of time (depending on what frac-  X  12223344 45 slice leaderunit leaderordinary node Figure 1. Flow of event notifications in the system tion of failed queries, i.e.,  f  , is deemed acceptable).Our goal is to do this in a way that has low notifica-tion delay yet reasonable bandwidth consumption,since bandwidth is likely to be the scarcest resourcein the system.We achieve this goal by superimposing a well-defined hierarchy on the system. This hierarchy isused to form dissemination trees, which are used topropagate event information.We impose this hierarchy on a system with dy-namic membership by dividing the 128-bit circu-lar identifier space into  k  equal contiguous intervalscalled  slices  . The  i th slice contains all nodes cur-rently in the overlay whose node identifiers lie inthe range [ i · 2 128 /k, ( i  + 1) · 2 128 /k ). Since nodeshave uniformly distributed random identifiers, theseslices will have about the same number of nodes atany time. Each slice has a  slice leader  , which is cho-sen dynamically as the node that is the successorof the mid-point of the slice identifier space. Forexample, the slice leader of the  i th slice is the suc-cessor node of the key ( i  + 1 / 2) · 2 128 /k . When anew node joins the system it learns about the sliceleader from one of its neighbors along with other in-formation like the data it is responsible for and itsrouting table.Similarly, each slice is divided into equal-sizedintervals called  units  . Each unit has a  unit leader  ,which is dynamically chosen as the successor of themid-point of the unit identifier space.Figure 1 depicts how information flows in thesystem. When a node (labeled  X  in Figure 1) de-tects a change in membership (its successor failedor it has a new successor), it sends an event no-tification message to its slice leader ( 1 ). The sliceleader collects all event notifications it receives fromits own slice and aggregates them for  t big  secondsbefore sending a message to other slice leaders ( 2 ).To spread out bandwidth utilization, communica-tion with different slice leaders is not synchronized:the slice leader ensures only that it communicateswith each individual slice leader once every  t big  sec-onds. Therefore, messages to different slice leadersare sent at different points in time and contain dif-ferent sets of events.The slice leaders aggregatemessages they receivefor a short time period  t wait  and then dispatch theaggregate message to all unit leaders of their respec-tive slices ( 3 ). A unit leader piggybacks this infor-mation on its keep-alive messages to its successorand predecessor ( 4 ).Other nodes propagate this information in onedirection: if they receive information from their pre-decessors, they send it to their successors and viceversa. The information is piggy-backed on keep-alive messages. In this way, all nodes in the systemreceive notification of all events, but within a unitinformation is always flowing from the unit leaderto the ends of the unit. Nodes at unit boundaries donot send information to their neighboring nodes out-side their unit. As a result, there is no redundancyin the communications: a node will get informationonly from its neighbor that is one step closer to itsunit leader.We get several benefits from choosing this de-sign. First, it imposes a structure on the system,with well-defined event dissemination trees. Thisstructure helps us ensure that there is no redun-dancy in communications, which leads to efficientbandwidth usage.Second, aggregation of several events into onemessage allows us to avoid small messages. Smallmessages are a problem since the protocol overheadbecomes significant relativeto the messagesize, lead-ing to higher bandwidth usage. This effect will beanalyzed in more detail in Section 3.4.Our scheme is a three-level hierarchy. The choiceof the number of levels in the hierarchy involves atradeoff: A large number of levels implies a largerdelay in propagatingthe information, whereasa smallnumber of levels generates a large load at the nodesin the upper levels. We chose a three level hierarchybecause it has low delay, yet bandwidth consump-tion at top level nodes is reasonable. 3.2 Fault Tolerance If a query fails on its first attempt it does notreturn an error to an application. Instead, queries  can be rerouted. If a lookup query from node  n 1  tonode  n 2  fails because  n 2  is no longer in the system, n 1  can retry the query by sending it to  n 2 ’s suc-cessor. If the query failed because a recently joinednode,  n 3 , is the new successor for the key that  n 1 is looking up,  n 2  can reply with the identity of   n 3 (if it knows about  n 3 ), and  n 1  can contact  n 3  in asecond routing step.Since our scheme is dependent on the correctfunctioning of slice leaders, we need to recover fromtheir failure. Since there are relatively few slice lead-ers, their failures are infrequent. Therefore, we donot have to be very aggressive about replacing themin order to maintain our query success target. Whena slice or unit leader fails, its successor soon detectsthe failure and becomes the new leader.Between the time a slice or unit leader fails, anda new node takes over, some event notification mes-sages may be lost, and the information about thosemembership changes will not be reflected in the sys-tem nodes’ membership tables. This is not an is-sue for routing correctness, since each node main-tains correct information about its predecessor andsuccessor. It will, however, lead to more routinghops and if we allowed these errors to accumulate,it would eventually lead to a degradation of the onehop lookup success rate.To avoid this accumulation, we use the lookupsthemselves to detect and propagate these inaccura-cies. When a node performs a lookup and detectsthat its routing entry is incorrect (i.e., the lookuptimed out, or was re-routed to a new successor),this new information is then pushed to all the sys-tem nodes via the normal channels: it notifies itsslice leader about the event.The correctness of our protocols is based on thefact that successor and predecessor pointers are cor-rect. This ensures that, even if the remainder of themembership information contains errors, the querywill eventually succeed after re-routing. In otherwords, our complete membership description can beseen as an optimization to following successor point-ers, in the same way as Chord fingers are an opti-mization to successors (or similarly for other peer-to-peer routing schemes). Furthermore, we can ar-gue that our successor and predecessor pointers arecorrect due to the fact that we essentially follow thesame protocol as Chord to maintain these, and thishas already been proven correct [16]. 3.3 Scalability Slice leaders have more work to do than othernodes, and this might be a problem for a poorly pro-visioned node with a low bandwidth connection tothe Internet. To overcomethis problem we can iden-tify well connected and well provisioned nodes as“supernodes” on entry into the system (as in [17]).There can be a parallel ring of supernodes, and thesuccessor (in the supernode ring) of the midpointof the slice identifier space becomes the slice leader.We do require a sufficient number of supernodes toensure that there are at least a few per slice.As we will show in Section 3.4, bandwidth re-quirements are small enough to make most partic-ipants in the system potential supernodes in a 10 5 sized system (in such a system, slice leaders willrequire 35 kbps upstream bandwidth). In a million-node system we may require supernodes to be well-connected academic or corporate users (the band-width requirements increase to 350 kbps). Section 4presents the two-hop scheme that may be requiredwhen we wish the system to accommodateeven largermemberships. 3.4 Analysis This section presents an analysis of how to pa-rameterize the system to satisfy our goal of fastpropagation. To achieve our desired success rate, weneed to propagate information about events withinsome time period  t tot ; we begin this section by show-ing how to compute this quantity. Yet we also re-quire good performance, especially with respect tobandwidth utilization. Later in the section we showhow we satisfy this requirement by controlling thenumber of slices and units.Ouranalysisconsidersonly non-failure situations.It does not take into account overheads of slice andunit leader failure because these events are rare. Italso ignores message loss and delay since this simpli-fies the presentation, and the overhead introducedby message delays and retransmissions is small com-pared to other costs in the system.Our analysis assumes that query targets are dis-tributed uniformly throughout the ring. It is basedon a worst case pattern of events, queries, and notifi-cations: we assume all events happen just after thelast slice-leader notifications, and all queries hap-pen immediately after that, so that none of the af-fected routing table entries has been corrected and all   queries targeted at those nodes (i.e., the nodescausingthe events) fail. In areal deployment, querieswould be interleaved with events and notifications,so fewer of them would fail.This scenario is illustrated by the timeline inFigure 2. Here  t wait  is the frequency with whichslice leaders communicate with their unit leaders, t small  is the time it takes to propagate informationthroughout a unit, and  t big  is the time a slice leader  1 all events 2 all queries 3 resp. slicesknow 4 all slice leaders know 5 all unitleaders know 6 all nodesknow t_smallt_bigt_waitt_smallt_wait Figure 2. Timeline of the worst case situation waits between communications to some other sliceleader. Within  t wait + t small  seconds (point 3), slicesin which the events occurred all have correct entriesfor nodes affected by the respective events. After t big  seconds of the events (point 4), slice leaders no-tify other slice leaders. Within a further  t wait  + t small  seconds (point 6), all nodes in the system re-ceive notification about all events.Thus,  t tot  =  t detect  +  t wait  +  t small  +  t big . Thequantity  t detect  represents the delay between thetime an event occurs and when the leader of thatslice first learns about it. 3.4.1 Configuration Parameters The following parameters characterize a system de-ployment:1.  f   is the acceptable fraction of queries that failin the first routing attempt2.  n  is the expected number of nodes in the sys-tem3.  r  is the expected rate of membership changesin the systemGiven these parameters, we can compute  t tot .Our assumption that query targets are distributeduniformly around the ring implies that the frac-tion of failed queries is proportional to the expectednumber of incorrect entries in a querying node’srouting table. Given our worst case assumption,all the entries concerning events that occurred inthe last  t tot  seconds are incorrect and therefore thefraction of failed queries is  r × t tot n  . Therefore, to en-sure that no more than a fraction  f   of queries failwe need: t tot  ≤  f   × nr For a system with 10 6 nodes, with a rate of 200events /s , and  f   = 1%, we get a time interval aslarge as 50 s  to propagate all information. Note alsothat if   r  is linearly proportional to  n , then  t tot  isindependent of   n . It is only a function of the desiredsuccess rate. 3.4.2 Slices and Units Our system performance depends on the number of slices and units:1.  k  is the number of slices the ring is dividedinto.2.  u  is the number of units in a slice.Parameters  k  and  u  determine the expected unitsize. This in turn determines  t small , the time it takesfor information to propagate from a unit leader toall members of a unit, given an assumption about  h ,the frequency of keep-alive probes. From  t small  wecan determine  t big  from our calculated value for  t tot ,given choices of values for  t wait  and  t detect . (Recallthat  t tot  =  t detect  +  t big  +  t wait  +  t small .)To simplify the analysis we will choose valuesfor  h ,  t detect , and  t wait . As a result our analysis willbe concerned with just two independent variables, k  and  u , given a particular choice of values for  n , r , and  f  . We will use one second for both  h  and t wait . This is a reasonable decision since the amountof data being sent in probes and messages to unitleadersis largeenough to make the overheadin thesemessages small (e.g., information about 20 eventswill be sent in a system with 10 5 nodes). Note thatwith this choice of   h ,  t small  will be half the unit size.We will use three seconds for  t detect  to account forthe delay in detecting a missed keep-alive messageand a few probes to confirm the event. 3.4.3 Cost Analysis Our goal is to choose values for  k  and  u  in a waythat reduces bandwidth utilization. In particularwe are concerned with minimizing bandwidth useat the slice leaders, since they have the most workto do in our approach.Bandwidth is consumed both to propagate theactual data, and because of the message overhead. m  bytes will be required to describe an event, andthe overhead per message will be  v .There are four types of communication in oursystem.1.  Keep-alive messages:  Keep-alivemessagesformthe base level communication between a nodeand its predecessor and successor. These mes-sages include information about recent events.As described in Section 3.1, our system avoidssending redundant information in these mes-sages by controlling the direction of informa-tion flow (from unit leader to unit members)and by not sending information across unitboundaries.
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!