[New Zealand Digital Library] [Computer Science Technical Reports] [Query Results] ---------------------------------------------------------------------------- Primary{Backup Protocols: Lower Bounds and Optimal Implementations Navin Budhiraja? Keith Marzullo? Fred B. Schneidery Sam Touegz Department of Computer Science Cornell University Ithaca NY 14853, USA Abstract We present a precise specification of the primary{backup approach. Then, for a variety of different failure models we prove lower bounds on the degree of replication, failover time, and worst-case blocking time for client requests. Finally, we outline primary{backup protocols and indicate which of our lower bounds are tight. Keywords: Fault-tolerance, reliability, availability, primary{backup, lower bounds, optimal protocols. 1 Introduction One way to implement a fault-tolerant service is by using multiple servers that fail independently. The state of the service is replicated and distributed among these servers, and updates are coordinated so that even when a subset of servers fail, the service remains available. Such fault-tolerant services are generally structured in one of two ways. One approach is to replicate the service state at all servers and to present all client requests, in the same order, to all non-faulty servers. This service architecture is commonly called active replication or the ?Supported by Defense Advanced Research Projects Agency (DoD) under NASA Ames grant number NAG 2{593 and by grants from IBM, Siemens, and Xerox. Budhiraja is also supported by an IBM Graduate Fellowship. The views, opinions, and findings contained in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision. ySupported in part by the Office of Naval Research under contract N00014-91-J-1219, the National Science Foundation under Grant No. CCR-8701103, DARPA/NSF Grant No. CCR-9014363, and by a grant from IBM Endicott Programming Laboratory. zSupported in part by NSF grants CCR-8901780 and CCR-9102231 and by a grant from IBM Endicott Programming Laboratory. 1 ---------------------------------------------------------------------------- state machine approach [22] and has been widely studied from both theoretical and practical viewpoints (e.g., [9, 11, 19]). The other approach to building replicated services is to designate one server as the primary and all the others as backups. Clients make requests by sending messages only to the primary. If the primary fails, then a failover occurs and one of the backups takes over. This service architecture is commonly called the primary-backup or the primary-copy approach [1] and has been widely used in commercial fault-tolerant systems. However, the approach has not been analyzed nearly as extensively as the state machine approach. Little is known of its costs and tradeoffs, the degree of replication required, or the worst-case response time for various failure models. In this paper, we derive some of these tradeoffs. For example, in some primary{backup protocols [15] the number of servers used is more than twice the number of failures to be tolerated. We are now able to explain this phenomenon by showing that the number of servers needed depends on the failure model. With both active replication and the primary-backup approach, the goal is to provide a client with the illusion of a service that is implemented by a single server, despite failures. The key difference between active replication and the primary-backup is how each handles failures. With active replication, the effects of failures are completely masked by voting, and the service implemented is indistinguishable from a single non-faulty server. With the primary-backup approach, a request to the service can be lost if it is sent to a faulty primary.1 Thus, clients can observe the effects of failures. However, the periods during which requests can be lost are bounded by the length of time that can elapse between failure of the primary and takeover by a backup. Such behavior is an instance of what we call bofo (bounded outage finitely often). To formulate the notion of a bofo server, define a server outage to occur at time t if some client makes a server request at that time but never receives a response to that request.2 In a (k; ?){bofo server, all server outages can be grouped into at most k intervals of time, with each interval having length at most ?. Accordingly, even though some requests made to a bofo service (that is, a service that implements the abstraction of a bofo server) will be lost, this number is limited. Note that if clients of a service are restricted to send requests only to one server, then it is not possible to implement a specification that is stronger than bofo. This is because if the client sends a request to a (single) server and that server subsequently crashes, then the request can be lost and will not be processed. This paper gives lower bounds for various costs associated with implementing a bofo service by using the primary-backup approach. These lower bounds depend on message delivery delay and the class of failures to be tolerated. These bounds characterize the degree of replication, the time during which the service can be without a primary, and the amount of time it can take to respond to client requests (blocking time). In some cases, our results are surprising. For example, more than f + 1 servers are necessary to tolerate f failures of certain types (crash and link failures, receive-omission failures, or general-omission failures). Also, we have proved that if a majority of the servers can be faulty, then any primary{ backup protocol for receive{omission failures will have a run in which a non-faulty primary is is forced to let a faulty server become the primary in its place. Finally, we outline some 1Of course, the client can subsequently resend a copy of that request to the new primary. 2For simplicity, we assume in this paper that every request elicits a response. ---------------------------------------------------------------------------- primary{backup protocols. This allows us to determine which of our lower bounds are tight. The paper is organized as follows. Section 2 gives a precise specification of the primarybackup approach. Section 3 describes the system model we consider. Section 4 discusses lower bounds, and in Section 5 we outline our protocols and state which of our bounds are tight. We conclude in Section 6. 2 Specification of the Primary{Backup Approach Since we wish to derive lower bounds, we must first give a precise specification of primarybackup that is general enough to satisfy any protocol one would characterize as being primary-backup. The following four properties do this. The first property states that no more than one server can be the primary at any time. Pb1: There exists a local predicate Prmys on the state of each server s. At any time, there is at most one server s whose state satisfies Prmys.3 For brevity, whenever we say that \s is the primary (at time t)" we mean that the state of s satisfies Prmys (at time t). We define the failover time of a service to be the longest period of time during which Prmys is not true for any s. Property Pb2 distinguishes the primary-backup approach from active replication, where each client broadcasts its request to all the servers. Pb2: Each client i maintains a server identity Desti such that to make a request, client i sends a message (only) to Desti. We assume that requests sent to a server s are enqueued in a message queue at s. Pb3: If a client request arrives at a server that is not the current primary, then that request is not enqueued (and therefore is not processed). Properties Pb1{Pb3 specify a protocol for client interactions with a service, but not the obligations of the service. For example, the properties do not rule out a primary that ignores all requests. A fourth property eliminates such trivial implementations by stipulating that the service implements a single bofo server for some values of k and ?: Pb4: There exist fixed values k and ? such that the service behaves like a single (k; ?){bofo server. We believe that the above four properties characterize a primary-backup approach and have checked that many primary-backup protocols in the literature (e.g. [1, 3, 4, 7]) do satisfy this characterization. Note that Pb4 is not implementable if the number of failures (that is, the number of servers and communication components that fail) can not be bounded a priori. This is because an unbounded number of servers would be required to implement the service. In a practical system, one can implement service outages of bounded lengths by bounding the rate of failures and allowing reintegration of recovered servers and communication links. We do not address failure rates or reintegration in this paper. 3The protocol of [15] allows concurrent primaries, but only for bounded periods. If one replaces Pb1 by this weaker property, then except for the bounds on failover times, the bounds shown in Section 4 continue to hold. ---------------------------------------------------------------------------- 3 1 c p1 p2 2 Figure 1: A Simple Primary{Backup Protocol. 2.1 A Simple Primary{Backup Protocol As an example of a service based on the primary{backup approach, consider the following protocol, which tolerates crash failures of a single server. Assume that all communication is over point-to-point non-faulty links and that each link has an upper bound ffi on message delivery time.4 Refer to Figure 1. There is a primary server p1 and a backup server p2 connected by a communications link. A client C initially sends all requests to p1 (indicated by the arrow labeled 1 in the figure). Whenever p1 receives a request, it ffl processes the request and updates its state accordingly, ffl sends information about the update to p2 (message 2 in the figure), ffl without waiting for an acknowledgement from p2, sends a response to the client (message 3 in the figure). The order in which these messages are sent is important because it guarantees that, given our assumption about failures, if the client receives a response, then either p2 will eventually receive message 2 or p2 will crash. Server p2 updates its state upon receiving messages from p1. In addition, p1 sends dummy messages to p2 every o seconds. If p2 does not receive such a message for o + ffi seconds, then p2 becomes the primary. Once p2 has become the primary, it informs the clients (who update their copies of Dest) and begins processing subsequent requests from the clients. We now show how this protocol satisfies our characterization of a primary{backup protocol. Property Pb1 requires that there never be two primaries. This is satisfied by the following definitions of Prmy: Prmyp1 def= (p1 has not crashed) Prmyp2 def= (p2 has not received a message from p1 for o + ffi) Predicate Prmyp1 ^ Prmyp2 is always false in a system executing our protocol, and hence Pb1 is satisfied. The failover time for this protocol is the longest interval during which :Prmyp1 ^ :Prmyp2 can hold, and it is o + 2ffi seconds. Property Pb2 follows trivially from 4To simplify exposition, we assume that the maximum message delay between the clients and the servers is the same as the delay between the servers. However, our results can be easily extended to the case when the delays are different. ---------------------------------------------------------------------------- the description of the protocol. Property Pb3 is true because requests are not sent to p2 until after p1 has failed. Finally, Pb4 requires that the protocol implements a single bofo server for some values of k and ?. Since p1 sends message 2 before message 3, it will never be the case that p1 sends a response to the client and p2 does not get information about that response from p1. In this protocol, there is at most one switch of the primary. So there is at most one outage period i.e. k = 1. To compute ?, it suffices to compute the longest interval during which a client request may not elicit a response. Assume that p1 crashes at time tc. Thus any client request sent to p1 at tc ? ffi or later may be lost since p1 crashes at tc. Furthermore, p2 may not learn about p1's crash until tc + o + 2ffi, and clients may not learn that p2 is the primary for another ffi. So, the total period during which a request may not elicit a response is tc ? ffi through tc + o + 3ffi: the protocol implements a single (1; o + 4ffi){bofo server. 3 The Model We consider a system consisting of n servers and a set of clients. We assume that server clocks are perfectly synchronized with real time.5 Clients and servers communicate by exchanging messages through a completely connected point-to-point network.6 Each message sent is enqueued in a queue maintained by the receiving process, and a process accesses its message queue by executing a receive statement. We assume that links between processes are FIFO (i.e. if pi sends message m followed by m0 to process pj, then pj will never receive m after m0) and there is a known constant ffi such that if processes pi and pj are connected by a (non-faulty) link, then a message sent from pi to pj at time t will be enqueued in pj's queue at of before t + ffi. We are interested in identifying the costs inherent in primary{backup protocols, and so we assume that it takes no time for a server to compute a response. Our theorems characterize lower bounds; they are not invalidated by servers that require a substantial amount of time to compute a response. We model an execution of a system by a run, which is a sequence of timestamped events involving clients, servers, and message queues. These events include sending messages, enqueuing messages, receiving messages, and internal events that model computation at processes. Two runs oe1 and oe2 of the system are defined to be indistinguishable to a process p if the same sequence of events (with the same timestamps) occur at p in both oe1 and oe2. We assume that if two runs oe1 and oe2 are indistinguishable to p and p has the same initial state in both runs, then at any time t the state of p at t in oe1 is the same as the state of p at t in oe2. It is not hard to extend the definition of indistinguishability to handle nondeterministic servers. We assume that the clients can send any request at any time. If we impose restrictions on the behavior of the clients, then we can derive protocols that violate the lower bounds in this paper. 5Our protocols can be extended to the case where clocks are only approximately synchronized [14]. 6Another approach would be assume that servers are interconnected with redundant broadcast busses [2, 8]. We have not pursued this approach. ---------------------------------------------------------------------------- Define OE to be the potential causality relation [12] on server events e1 and e2. Thus OE is the transitive closure of the following relation ;: e1 ; e2 iff both e1 and e2 occur at the same server s and e1 occurs before e2, or e1 is a send event and e2 is the corresponding receive event. Informally, we say a request m is an update request if it changes the state of the service in such a way that the responses to subsequent requests depend on m. More formally, let m be a request with associated response r (and e(m) and e(r) be the events in the run associated with the receipt of m and the sending of r respectively). Then m is an update request if all request/response pairs m0=r0, where m0 was sent after r, have e(m) OE e(r0). We assume that update requests exist since otherwise the actions performed by the primary do not have to be communicated to the backups. We assume that failures occur independently from each other. We consider the following hierarchy of failure models: Crash failures: A server may fail by halting prematurely. Until it halts, it behaves correctly [13]. 7 Crash+Link failures: A server may crash or a link may lose messages (but links do not delay, duplicate or corrupt messages). Receive-Omission failures: A server may fail not only by crashing, but also by omitting to receive some of the messages directed to it over a non-faulty link. Send-Omission failures: A server may fail not only by crashing, but also by omitting to send some of the messages over a non-faulty link [10]. General-Omission failures: A server may exhibit send-omission and receive-omission failures [20]). Note that crash+link failures and the various types of omission failures are quite different. Although all of these failure models concern loss of messages, each class of failures is dealt with by a different masking technique. In particular, crash+link failures can be masked by adding redundant communication paths, while omission failures can only be masked by adding redundant servers so that faulty processes can detect their own failure and halt. We return to these masking techniques in Section 5. Failures are counted by the number of failing components (either servers or links). We say that a protocol tolerates f failures if it works correctly despite the failure of up to f faulty components (note that each faulty component may fail many times during an execution). 4 Lower Bounds For each failure model, we now give lower bounds for implementing a single (k; ?){bofo server using the primary{backup approach. 7The lower bounds we derive for crash failures also hold for fail-stop failures [21] except for the bound on failover time. The lower bound on failover time depends on the maximum duration between when a server pi fails and when failedi becomes true. ---------------------------------------------------------------------------- 4.1 Bounds on Replication The first bound is obvious. However, to introduce our notation and the proof technique that will be used later in the section, we give a formal proof of the theorem. Theorem 1 Any primary{backup protocol tolerating f crash failures requires n >= f + 1. Proof : We prove the result by contradiction. Suppose there is a protocol P for n < f + 1. Thus, P satisfies Pb4. Consider a run in which all n servers are crashed initially and clients submit R > kd?=de requests, where d is the minimum time between the sending of any two requests (d > 0). By Pb4, at least one of these requests must elicit a response. This is because the number of requests that cannot have responses must fall into at most k intervals of length at most ?, and each interval of ? can contain at most d?=de requests. However, such a response is impossible since, by assumption, all servers have crashed. 2 The following lemma is used in the rest of the theorems in this section. Lemma 4.1 Consider any protocol that satisfies Pb4. Suppose two disjoint and nonempty sets of servers A and B can be found that meet the following three properties: 1. There exists a run oea containing R > 2kd?=de requests where d is the minimum time between the sending of any two client requests (d > 0). Furthermore, in this run the servers in A do not crash and all other servers crash at time 0. 2. There exists a run oeb containing R requests. Furthermore, in this run the servers in B do not crash and all other servers crash at time 0. 3. There exists a run oeab containing R requests. Furthermore, the servers in A and B do not crash, oeab is indistinguishable from oea to all servers in A, and oeab is indistinguishable from oeb to all servers in B. At least one of the above runs violates Pb2. Proof : Suppose for contradiction that the lemma is false and runs oea, oeb and oeab all satisfy Pb2. For oea, by Pb4 at least R?kd?=de of the requests must have been received by servers in A. Similarly, for oeb, at least R? kd?=de of the requests must have been received by servers in B. Finally, since oeab is indistinguishable from oea to servers in A, they must execute the same number of receive events in both runs. The same holds for the servers in B. By Pb2, each request is sent to at most one server and so at least 2(R ? kd?=de) requests must have been sent in oeab. Since only R requests were sent, we must have R >= 2(R ? kd?=de), or R <= 2kd?=de, which contradicts the assumption that R > 2kd?=de. 2 Theorems 2 and 3 depend on two parameters of primary{backup protocols. Let ? be the maximum time that can elapse between any two successive client requests (possibly from different clients), and let D be a duration such that if some server s becomes the primary at time t0 and remains the primary through time t >= t0 +D when a client ci sends a request, ---------------------------------------------------------------------------- then Desti = s at time t. Hence, D is the minimum delay until all clients know the identity of a new primary. For simplicity of notation, we write D < ? to mean that D is bounded and ? is either unbounded or bounded and greater than D. Note that when D < ? the service must be able to detect the failure of a primary and disseminate the new primary's identity to the clients without using any messages from clients. With both send-omission failures and crash+link failures, messages may fail to reach their destinations. The following theorem shows that crash+link failures are more expensive to tolerate, as they require more replication. Theorem 2 Suppose there is at most one link between any two servers. Then any primary{ backup protocol tolerating f crash+link failures and having D < ? requires n >= f + 2. Proof : For contradiction, assume the existence of a protocol P with n < f + 2. We will show that P has three runs oea, oeb and oeab that satisfy the conditions of Lemma 4.1. From the lemma, at least one of these runs violates Pb2, which implies that P cannot be a primary{backup protocol. Let A be a set containing the one server sa and let B be the set of remaining servers. Since jAj = 1 and jBj = n ? 1 <= f , A and B can become disconnected by link failures. We first construct the run oeab in which no server crashes, postulating that the links between the servers in A and B are faulty and do not deliver any messages. As required by Lemma 4.1, clients send a total of R > 2kd?=de requests. Let < d <= ? ? D be the minimum interval between any two such requests. We postulate that a request will be sent at time t iff no request has been sent during the interval [t ? d::t) and one of the following rules hold. 1. A server s is the primary during the interval [t?D::t]. This request arrives immediately and is enqueued (at s, by Pb3 and the definition of D). 2. There is no primary at time t. This request arrives immediately and by Pb3 will never be enqueued at any server. 3. A server s is the primary at time t but another server s0 is the primary immediately after time t. If this request is sent to s, then it arrives after t, and if it is sent to any other server, then it arrives immediately. In both cases, it arrives at a server that is not the primary, and so will not be enqueued (again by Pb3). Note that, by construction, the maximum interval between any two client requests is D+ d. This interval occurs when a server s becomes the primary just before d after a client message is sent, and s remains the primary for at least D. Hence, the client will be able to send R requests within time R(D + d). This completes the construction of oeab. We now construct oea and oeb, recalling that in oea all of the servers except sa crash at time 0, and in oeb server sa crashes at time 0. The clients send the same requests and at the same times in oea and in oeb as in oeab. Furthermore, by construction these requests will arrive at the servers according to the same rules used in constructing oeab. Of course, a client request may not be delivered to the same servers in oea or oeb as in oeab, since different servers are operational in these runs. ---------------------------------------------------------------------------- Since sa does not receive any messages from servers in B in either oeab or oea, these two runs are indistinguishable to sa as long as it receives the same client requests at the same times in both runs. We show that this is the case by contradiction: let t be the earliest time that sa can distinguish between these two runs. Thus, at time t either sa received a request m in oeab but not in oea or it received a request m in oea but not in oeab. We will assume the former; the proof for the latter is similar. The request m must have been enqueued at some time t0 <= t at sa in oeab. Since m was received by sa, m must have been sent by rule 1. By rule 1, sa must have been the primary through [t0 ?D::t0] in oeab and therefore, by indistinguishability, in oea as well. By the definition of D, m would have been enqueued at sa at time t0 in oea as well. Since sa cannot distinguish between the runs before t, sa cannot receive m before t in oea, and sa must execute a receive in both oea and oeab at time t. So, it must be the case that sa receives another request m0 <> m at time t in oea. Assume that m0 was enqueued at time t00. By an indistinguishability argument similar to above, m0 must be enqueued at time t00 at sa in oeab as well. Therefore, if s received m0 in oea at time t, it must receive m0 in oeab as well, a contradiction. A similar argument can be used to show that the servers in B receive the same requests in oeb and oeab, and so these two runs are indistinguishable to the servers in B. Thus, by Lemma 4.1 P cannot be a primary{backup protocol. 2 The assumption in this theorem that D < ? is significant. As we discuss in Section 5, when D >= ? protocols that tolerate f crash+link failures can be constructed that use only f + 1 servers. The next theorem states that additional replication is required in order to tolerate receiveomission failures. The proof is similar to that of Theorem 2, and so it is omitted. Theorem 3 Any primary{backup protocol tolerating f receive-omission failures and having D < ? requires n > b3f2 c. The next lower bound holds independent of the relation between D and ?. Theorem 4 Any primary{backup protocol tolerating f general-omission failures requires n > 2f . Proof : Assume for contradiction that there is a protocol for n <= 2f . Partition the servers into two disjoint sets A and B of size at most f each. We will construct two runs oe1 and oe2. In each run, one set of servers will be faulty and the other set will be non-faulty. oe1: The servers in A are faulty and fail to communicate with all servers in B, but behave correctly otherwise. Clients send update requests until the first response is sent (this must happen, by Pb4). Assume that the first response r to an update request m is sent at time t. Say that this response is sent by server s. oe2: The same as oe1 up to time t, but if s is in B, then in oe2 it is the servers in B that are faulty and fail to communicate with all servers in A rather than the servers in A that are faulty. In either case, r is sent by a faulty server. Furthermore, no server can distinguish oe1 from oe2 through time t and therefore, the first response r is sent at time t in oe2 as well. ---------------------------------------------------------------------------- Let all of the faulty servers in oe2 crash immediately after r is sent and have clients continue to send requests until another response r0 is sent. This response must have been sent by a non-faulty server which implies that :(e(m) OE e(r0)). However this violates the fact that m is an update request. 2 4.2 Bounds on Blocking Informally, a blocking primary-backup protocol is one in which the primary must, after receiving a request m, either receive a message from another server or simply wait an interval before it can respond to m. Consider a failure-free run of a primary-backup protocol that is handling a request. Let the time that the request is received be tm and the time that the response is sent be tr. We say that this protocol is C{blocking if it is guaranteed that tr ? tm <= C holds. For example, any primary-backup protocol in which the primary sends information about a request to the backups and waits for acknowledgement before sending the response to the client will be at least 2ffi{blocking. As shown in Section 5, 0{blocking primary-backup protocols can be built for crash and crash+link failure models. For servers that take no time to compute the response to a request, the simple protocol tolerating crash failures presented in Section 2 is 0{blocking. We call such protocols nonblocking because the primary can send a reply to the client as soon as the reply has been computed. Nonblocking protocols tolerating receive-omission failures also exist as long as n > 2f , but there is can be no nonblocking primary{backup protocol tolerating send-omission or general{omission failures. Theorem 5 Any primary{backup protocol tolerating f receive-omission failures with f > 1, n <= 2f and D < ? is C{blocking for some C >= 2ffi. Proof : For contradiction, suppose there is a primary{backup protocol for n <= 2f and f > 1 that is C{blocking where C < 2ffi. Partition the servers into two sets A and B where jAj = f and jBj = n ? f <= f . We construct three runs. In all three runs, assume that all server messages take ffi to arrive. oe1: There are no failures and all client requests take ffi to arrive. Moreover, clients send update requests until some update request m evokes a response r. Let m be received at time tm by server p 2 A and r be sent at time tr by a server q 2 A (q could be the same as p). Notice that since the protocol is C{blocking where C < 2ffi, tr ? tm < 2ffi. Also since, by construction, all requests take ffi to arrive, all client requests sent after time tm + ffi will be received after time tr. oe2: Identical to oe1 until p receives m at time tm. At this point in oe2, all servers in A are assumed to crash and clients are assumed to send no request during the interval [tm + ffi::tr ]. Finally, after time tr clients are assumed to repeatedly send requests at intervals of at least d where < d <= ? ?D as follows. A request is sent at time t if no request has been sent in [t ? d::t) and one of the following rules hold. 1. A server s 2 B is the primary during the interval [t ? D::t]. This request arrives immediately and is enqueued (at s, by Pb3 and the definition of D). ---------------------------------------------------------------------------- 2. There is no primary in B at time t. This request arrives immediately by Pb3 will never be enqueued at any server. 3. A server s 2 B is the primary at time t but another server s0 2 B is the primary immediately after time t. If the request is sent to s, then it arrives after t, and if it is sent to any other server it arrives immediately. In both cases, it arrives at a server that is no the primary, and so will not be enqueued (again, by Pb3). Notice that eventually, there will be a response (say r0) in oe2 because the protocol satisfies Pb4, and by construction it must be from a request sent by rule 1. oe3: The same as oe2, except that the servers in A do not crash at time tm. Instead, the servers in B commit receive failures on all messages sent after tm by servers in A. Clients send requests at the same times as in oe2 which arrive using the same rules as oe2. Now, consider these three runs. By construction, the runs are identical up to time tm. Since all server messages take ffi to arrive, clients cannot distinguish oe1 and oe3 through tm+ ffi, and so clients send the same requests to the same servers in both oe1 and oe3. Similarly, since all server messages take ffi to arrive, the servers in B cannot distinguish between oe1 and oe3 through tm + ffi. Therefore, since tr ? tm < 2ffi, p (the server that received request m at time tm in oe1) and q (the server that sent response r at time tr in oe1) cannot distinguish between oe1 and oe3 through time tr, and so q sends response r in oe3 as well. Then, using an argument similar to the one in Theorem 2, servers in B cannot distinguish oe2 and oe3, and so response r0 also occurs in oe3. However, :(e(m) OE e(r0)) which violates the assumption that m is an update request. 2 In run oe3 of the above proof, a correct primary (p in set A) becomes the backup, while a faulty server from set B becomes the primary in p's place. It is always possible to construct such a run. This is a disconcerting property: there does not exist a primary{backup protocol that tolerates receive-omission failures with n <= 2f in which a primary cedes only when it fails. Moreover, this lower bound is tight|in [6], we give a receive-omission primary{backup protocol with n = 2f + 1 in which a primary cedes only when it fails. And, if f = 1, then the following theorem holds: its proof is similar to the proof of Theorem 5 (and is therefore omitted), except that p = q. Theorem 6 Any primary{backup protocol tolerating receive-omission failures with f = 1 and n <= 2f and having D < ? is C{blocking for some C >= ffi. Primary{backup protocols tolerating send-omission or general{omission failures exhibit the same blocking properties as those tolerating receive-omission failures, except that the restriction D < ? is no longer necessary. Here we prove just the results for send{omission failures. The results for general{omission failures then follow. Theorem 7 Any primary{backup protocol tolerating f send-omission failures with f > 1 is C{blocking for some C >= 2ffi. Proof : For contradiction, suppose there is a primary{backup protocol that is C{blocking where C < 2ffi. We consider the following two runs in which all server messages take ffi to arrive. ---------------------------------------------------------------------------- oe1: There are no failures and all client requests take ffi to arrive. Moreover, clients send update requests until some update request m evokes a response r. Let m be received at time tm by server p and r be sent at time tr by a server q (again q could be p). Notice that since the protocol is C{blocking where C < 2ffi, tr ? tm < 2ffi. Also, since by construction all requests take ffi to arrive, all client requests sent after time tm + ffi will be received after time tr. oe2: Identical to oe1 through tm. After tm, p and q fail and omit to send all messages to all servers except each other. Since, by construction, all messages take ffi to arrive, servers and clients cannot distinguish between oe1 and oe2 through tm + ffi and, as a result, p and q cannot distinguish the two runs through tm +2ffi. Therefore, since tr ? tm < 2ffi, q sends the response r at time tr in oe2 as well. Now let p and q crash at time tr and the clients send requests after time tr. By Pb4, there eventually must be some request m0 that results in a response r0. However, :(e(m) OE e(r0)), which violates the assumption that m us an update request. 2 Again, if f = 1, then the following theorem can be proved using a proof similar to Theorem 7, except that p = q. Theorem 8 Any primary{backup protocol tolerating send-omission failures with f = 1 is C{blocking for some C >= ffi. 4.3 Bounds on Failover Times Recall that the failover time is the longest interval during which Prmys is not true for any server s. In this section, we give lower bounds for failover times. In order to discuss these bounds, we postulate a fifth property of primary{backup protocols. Pb5: A correct server that is the primary remains so until there is a failure of some server or link. This is a reasonable expectation and it is valid for all protocols that we have found in the literature. Theorem 9 Any primary{backup protocol tolerating f crash failures must have a failover time of at least f ffi. Proof : The proof is by induction on f . Base case f = 0: trivially true since a failover time cannot be smaller than zero. Induction case f > 0: Suppose the theorem holds for at most f ? 1 failures, but (for a proof by contradiction) there is a protocol P for which the theorem is false when there are f failures. From the induction hypothesis, there is a run oe with at most f ? 1 failures and an interval [t0::t1] at least (f ? 1)ffi during which there is no primary. Let p1 be the server that becomes the primary at t1. Consider the two runs oe1 and oe2 that extend oe as follows: ---------------------------------------------------------------------------- oe1: Assume p1 crashes at time t1. By assumption, there exists a new primary (say p2) at time t2 < t1 + ffi. Since p1 crashes at time t1, p2 does not receive any messages from p1 that were sent after time t1. oe2: Assume that p1 is correct, there are no other crashes at or after t1 and all messages sent by p1 after time t1 take ffi to arrive. Since p2 cannot distinguish oe1 from oe2 through time t2, p2 becomes the primary at time t2 in oe2. By Pb5, however, p1 remains the primary at time t2 in oe2. This violates Pb1, and so P is not a primary{backup protocol. 2 Failover times for all other failure models have a larger lower bound: Theorem 10 Any primary{backup protocol tolerating f crash+link failures has a failover time of at least 2f ffi. Proof : The proof is by induction on f . Base case f = 0: trivially true. Induction case f > 0: Suppose the theorem holds for at most f ? 1 failures, but (for a proof by contradiction) there is a protocol P for which the theorem is false when there are f failures. From the induction hypothesis, there is a run oe with at most f ?1 failures and an interval [t0::t1] at least 2(f ? 1)ffi during which there is no primary. Let p1 be the server that becomes the primary at t1. Consider the three runs oe1, oe2 and, oe3 that extend oe as follows: oe1: Assume that p1 crashes at time t1 and all messages sent after t1 take ffi to arrive. Furthermore, the crash of p1 is the only failure at or after t1. By assumption, there exists a new primary (say p2) at time t2 < t1 + 2ffi. Since p1 crashes at time t1, p2 does not receive any messages from p1 that were sent after time t1. Furthermore, since all messages take ffi to arrive, any message that was sent after t1 + ffi can be received by p2 only after time t2. oe2: Assume that p1 is correct, there are no other failures at or after t1, and all messages sent after time t1 take ffi to arrive. Since there are no failures at or after time t1, by Pb5 p1 continues to be the primary through time t2. oe3: The same as oe2 except that the link between p1 and p2 is faulty and does not deliver any message sent by p1 to p2 after time t1. By construction, p2 cannot distinguish oe1 from oe3 through time t2, and so p2 becomes the primary at time t2 in oe3. Similarly, p1 cannot distinguish oe2 from oe3 through time t2 and so p1 remains the primary until time t2 in oe3. This violates Pb1, and so P is not a primary{backup protocol. 2 We omit the proofs of the following two theorems because they are similar to Theorem 9. Theorem 11 Any primary{backup protocol tolerating f receive-omission failures has a failover time of at least 2f ffi. Theorem 12 Any primary{backup protocol tolerating f send-omission failures has a failover time of at least 2f ffi. ---------------------------------------------------------------------------- 5 Outline of the Protocols In order to establish that the bounds given above are tight, we have developed primary{ backup protocols for the different failure models [6]. In this section, we outline these protocols and discuss which of our lower bounds are tight. Our protocol for crash failures is similar to the protocol given in Section 2. Whenever the primary receives a request from the client, it processes that request and sends information about state updates to the backups before sending a response to the client. All servers periodically send messages to each other in order to detect server failures. This protocol uses (f + 1) servers and so the lower bound in Theorem 1 is tight. Furthermore, it is nonblocking and so incurs no additional delay. It has the failover time f ffi + o for arbitrarily small and positive o , and so the lower bound in Theorem 9 is tight. In order for the protocol to tolerate crash+link failures, we add an additional server. By Theorem 2, this server is necessary. The additional server ensures that there is always at least one non-faulty path between any two correct servers, where a path contains zero or more intermediate servers. The protocol for crash failures outlined above is now modified so that a primary ensures any message sent to a backup is sent across at least one non-faulty path. This protocol uses (f + 2) servers and so Theorem 2 is tight. Furthermore, it is nonblocking and so incurs no additional delay. It has the failover time 2f ffi + o for arbitrarily small and positive o , and so Theorem 10 is tight. Most of our protocols for the various kinds of omission failures apply translation techniques [17] to the protocol for crash failures outlined above. These techniques ensure that a faulty server detects its own failure and halts, thereby translating a more severe failure to a crash failure. The translations of [17] assume a round-based protocol. Since our crash failure protocol is not round-based, we must modify the translations so that a server can send and receive messages at any time rather than just at the beginning or the end of a round. All of these resulting omission{failure protocols have failover time 2f ffi + o , and thus Theorems 11 and 12 are tight. The protocol for send-omission failures uses f +1 servers and is 2ffi + o{blocking. Furthermore, we also have a send-omission protocol for f = 1 that is ffi{blocking. Thus, Theorems 7, 8 and 12 are tight. The protocol for general-omission failures also uses 2f + 1 servers and is 2ffi + o{blocking, and so Theorem 4 is tight, and Theorems 7 and 12 are tight for general-omission failures as well. We have not been able to determine whether Theorems 3 and 5 are tight. Our protocol for receive-omission failures uses 2f + 1 servers whereas the lower bound in Theorem 3 only requires n > b3f2 c. We have constructed protocols for n = 2, f = 1 and n = 4, f = 2 but are unable to generalize these protocols. We can also show that any protocol for n <= 2f has the following odd property: there is at least one run of the protocol in which a non-faulty primary is forced to relinquish control to a backup that is faulty. However, the protocol for n = 2, f = 1 is ffi{blocking and so Theorem 6 is tight. Table 1 summarizes all of our results. ---------------------------------------------------------------------------- failure degree of amount of failover model replication blocking time crash n > f f ffi crash+link n > f + 1 y 2f ffi receive omission n > b3f2 c ? y ffi if n <= 2f and f = 1 y 2ffi if n <= 2f and f > 1 ? y if n > 2f 2f ffi send omission n > f ffi if f = 1 2ffi if f > 1 2f ffi general omission n > 2f ffi if f = 1 2ffi iff > 1 2f ffi ? Bound not known to be tight. y D < ?. Table 1: Lower Bounds. 6 Discussion We give a precise characterization for primary{backup protocols in a system with synchronized clocks and bounded message delays. We then present lower bounds on the degree of replication, the blocking time, and the failover time under various kinds of server and link failures. We finally outline a set of primary{backup protocols that show which of our lower bounds are tight. We now briefly compare our results to existing primary{backup protocols. The protocol presented in [3] tolerates one crash+link failure by using only two servers. This appears to contradict Theorem 2 which states that at least three servers are required to tolerate one failure. However, the protocol in [3] assumes that there are two links between the two servers, effectively masking a single link failure. Hence, only crash failures need to be tolerated, and this can be accomplished using only two servers (Theorem 1). A more ambitious primary{backup protocol is presented in [15]. This protocol works for the following failure model (quoted from [15]): The network may lose or duplicate messages, or deliver them late or out of order; in addition it may partition so that some nodes are temporarily unable to send messages to some other nodes. As is usual in distributed systems, we assume the nodes are fail-stop processors and the network delivers only uncorrupted messages. This failure model is incomparable with those in the hierarchy we presented. However, the protocol does tolerate general-omission failures and has optimal degree of replication for general-omission failures as it uses 2f + 1 servers. In Theorem 2, we assumed that D < ?. This assumption is crucial: we are able to construct a two-server primary{backup protocol tolerating one crash+link failure for which D >= ?. Recall that link failures are masked by adding redundant paths between the servers. ---------------------------------------------------------------------------- Our two-server crash+link protocol essentially uses the path from the primary to the backup through the client as the redundant path. Thus, there appears to be a tradeoff between the degree of replication and the time it takes for a client to learn that there is a new primary. The lower bounds on failover times given in Section 4.3 assume Pb5. This is necessary as we have constructed protocols that have failover times smaller than the lower bounds given in Section 4.3 and these protocols do not satisfy Pb5. This smaller failover time is achieved at a cost of an increased variance in service response time. Finally, in this paper we have attempted to give a characterization of primary{backup that is broad enough to include most synchronous protocols that are considered to be instances of the approach. There are protocols, however, that are incomparable to the class of protocols we analyze [5, 16, 18] since they were developed for an asynchronous setting. Such protocols cannot be cast in terms of implementing a (k; ?){bofo server for finite values of k and ?. We are currently studying possible characterizations for a primary{backup protocol in an asynchronous system and hope to extend our results to this setting. Acknowledgements We would like to thank Lorenzo Alvisi, Mike Reiter and the anonymous conference referees for their helpful comments on earlier drafts of this paper. References [1] P.A. Alsberg and J.D. Day. A principle for resilient sharing of distributed resources. In Proceedings of the Second International Conference on Software Engineering, pages 627{644, October 1976. [2] ?Ozalp Babao>=glu and Rog?erio Drummond. Streets of Byzantium: Network architectures for fast reliable broadcasts. IEEE Transactions on Software Engineering, 11(6):546{554, June 1985. [3] J.F. Barlett. A nonstop kernel. In Proceedings of the Eighth ACM Symposium on Operating System Principles, SIGOPS Operating System Review, volume 15, pages 22{ 29, December 1981. [4] Anupam Bhide, E.N. Elnozahy, and Stephen P. Morgan. A highly available network file server. In USENIX, pages 199{205, 1991. [5] Kenneth P. Birman and Thomas A. Joseph. Exploiting virtual synchrony in distributed systems. In Eleventh ACM Symposium on Operating System Principles, pages 123{138, November 1987. [6] Navin Budhiraja, Keith Marzullo, Fred Schneider, and Sam Toueg. Optimal primary{ backup protocols. In Proceedings of the Sixth International Workshop on Distributed Algorithms, Haifa, Israel, November 1992. To Appear. [7] IBM International Technical Support Centers. IBM/VS extended recovery facility (XRF) technical reference. Technical Report GG24-3153-0, IBM, 1987. ---------------------------------------------------------------------------- [8] Flaviu Cristian. Synchronous atomic broadcast for redundant broadcast channels. Journal of Real-Time Systems, 2:195{212, September 1990. [9] Flaviu Cristian, Houtan Aghili, H. Ray Strong, and Danny Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the Fifteenth International Symposium on Fault-Tolerant Computing, pages 200{206, Ann Arbor, Michigan, June 1985. A revised version appears as IBM Technical Report RJ5244. [10] Vassos Hadzilacos. Issues of Fault Tolerance in Concurrent Computations. PhD thesis, Harvard University, June 1984. Department of Computer Science Technical Report 11-84. [11] Thomas Joseph and Kenneth Birman. Reliable Broadcast Protocols, pages 294{318. ACM Press, New York, 1989. [12] Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558{565, July 1978. [13] Leslie Lamport and Michael Fischer. Byzantine generals and transaction commit protocols. Op. 62, SRI International, April 1982. [14] Leslie Lamport and P. M. Melliar-Smith. Synchronizing clocks in the presence of faults. Journal of the ACM, 32(1):52{78, January 1985. [15] Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, and Michael Williams. Replication in the Harp file system. In Proceedings of the 13th Symposium on Operating System Principles, pages 226{238, 1991. [16] Timothy Mann, Andy Hisgen, and Garret Swart. An algorithm for data replication. Technical Report 46, Digital Systems Research Center, 1989. [17] Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed systems. In Proceedings of the Seventh ACM Symposium on Principles of Distributed Computing, pages 248{262, Toronto, Ontario, August 1988. ACM SIGOPS-SIGACT. [18] B. Oki and Barbara Liskov. Viewstamped replication: A new primary copy method to support highly available distributed systems. In Seventh ACM Symposium on Principles of Distributed Computing, pages 8{17, Toronto, Ontario, August 1988. ACM SIGOPS- SIGACT. [19] M. Pease, R. Shostak, and Leslie Lamport. Reaching agreement in the presence of faults. Journal of the ACM, 27(2):228{234, April 1980. [20] Kenneth J. Perry and Sam Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering, 12(3):477{482, March 1986. ---------------------------------------------------------------------------- [21] Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222{238, August 1983. [22] Fred B. Schneider. Implementing fault tolerant services using the state machine approach: A tutorial. Computing Surveys, 22(4):299{319, December 1990. ----------------------------------------------------------------------------