
Professor Gregory Chockler
Academic and research departments
Distributed and Networked Systems Group, Surrey Centre for Cyber Security, Computer Science Research Centre, School of Computer Science and Electronic Engineering.About
Biography
I am a Professor at the Department of Computer Science, University of Surrey, UK, where I am Joint Head of the Distributed and Networked Systems group and a member of the Surrey Centre for Cyber Security.
I joined University of Surrey in January 2020 as a Full Professor. From 2012 and until January 2020, I served as a tenured faculty at Royal Holloway, University of London starting as an Associate Professor (Reader), and being promoted to Full Professor in 2017. Between 2005 and 2012, I was a Research Staff Member at IBM Research. From 2003-2005, I was a Postdoctoral Associate with the MIT Computer Science and Artificial Intelligence Laboratory. I have held visiting scientist positions at EPFL, KTH, SICS, Emory University, IBM, and IMDEA Software Institute.
Areas of specialism
ResearchResearch interests
My research interests span both fundamental and applied aspects of fault-tolerance and trustworthiness of distributed systems. While at IBM, I co-invented and participated in the development of a new event-monitoring technology, which incorporated concepts from random graph theory and weak consistency, to boost scalability of the control planes of IBM WebSphere line of products (notably WebSphere Virtual Enterprise), by several orders of magnitude. I also co-invented Speculative Paxos, a new reconfigurable replication protocol, which was integrated into the management tiers of several IBM cloud offerings (notably, IBM WebSphere Liberty and PureApp) to improve their availability and failure resilience. For this work, I received several corporate-level awards including two Outstanding Technical Achievement Awards and a Scientific Accomplishment Award. My recent work focuses on blockchain and scalable information diffusion and is funded by the Stellar Development Foundation Academic Research Grant (October 2020), IBM Shared University Research Award (2017) and Facebook Faculty Award (2015).
Research interests
My research interests span both fundamental and applied aspects of fault-tolerance and trustworthiness of distributed systems. While at IBM, I co-invented and participated in the development of a new event-monitoring technology, which incorporated concepts from random graph theory and weak consistency, to boost scalability of the control planes of IBM WebSphere line of products (notably WebSphere Virtual Enterprise), by several orders of magnitude. I also co-invented Speculative Paxos, a new reconfigurable replication protocol, which was integrated into the management tiers of several IBM cloud offerings (notably, IBM WebSphere Liberty and PureApp) to improve their availability and failure resilience. For this work, I received several corporate-level awards including two Outstanding Technical Achievement Awards and a Scientific Accomplishment Award. My recent work focuses on blockchain and scalable information diffusion and is funded by the Stellar Development Foundation Academic Research Grant (October 2020), IBM Shared University Research Award (2017) and Facebook Faculty Award (2015).
Supervision
Postgraduate research supervision
Publications
We study implementations of basic fault-tolerant primitives, such as consensus and registers, in message-passing systems subject to process crashes and a broad range of communication failures. Our results characterize the necessary and sufficient conditions for implementing these primitives as a function of the connectivity constraints and synchrony assumptions. Our main contribution is a new algorithm for partially synchronous consensus that is resilient to process crashes and channel failures and is optimal in its connectivity requirements. In contrast to prior work, our algorithm assumes the most general model of message loss where faulty channels are flaky, i.e., can lose messages without any guarantee of fairness. This failure model is particularly challenging for consensus algorithms, as it rules out standard solutions based on leader oracles and failure detectors. To circumvent this limitation, we construct our solution using a new variant of the recently proposed view synchronizer abstraction, which we adapt to the crash-prone setting with flaky channels.
Byzantine state-machine replication (SMR) ensures the consistency of replicated state in the presence of malicious replicas and lies at the heart of the modern blockchain technology. Byzantine SMR protocols often guarantee safety under all circumstances and liveness only under synchrony. However, guaranteeing liveness even under this assumption is nontrivial. So far we have lacked systematic ways of incorporating liveness mechanisms into Byzantine SMR protocols, which often led to subtle bugs. To close this gap, we introduce a modular framework to facilitate the design of provably live and efficient Byzantine SMR protocols. Our framework relies on a view abstraction generated by a special SMR synchronizer primitive to drive the agreement on command ordering. We present a simple formal specification of an SMR synchronizer and its bounded-space implementation under partial synchrony. We also apply our specification to prove liveness and analyze the latency of three Byzantine SMR protocols via a uniform methodology. In particular, one of these results yields what we believe is the first rigorous liveness proof for the algorithmic core of the seminal PBFT protocol.
Atomic broadcast is a reliable communication abstraction ensuring that all processes deliver the same set of messages in a common global order. It is a fundamental building block for implementing fault-tolerant services using either active (aka state-machine) or passive (aka primary-backup) replication. We consider the problem of implementing reconfigurable atomic broadcast, which further allows users to dynamically alter the set of participating processes, e.g., in response to failures or changes in the load. We give a complete safety and liveness specification of this communication abstraction and propose a new protocol implementing it, called Vertical Atomic Broadcast, which uses an auxiliary service to facilitate reconfiguration. In contrast to prior proposals, our protocol significantly reduces system downtime when reconfiguring from a functional configuration by allowing it to continue processing messages while agreement on the next configuration is in progress. Furthermore, we show that this advantage can be maintained even when our protocol is modified to support a stronger variant of atomic broadcast required for passive replication.
Atomic Commit Problem (ACP) is a single-shot agreement problem similar to consensus, meant to model the properties of transaction commit protocols in fault-prone distributed systems. We argue that ACP is too restrictive to capture the complexities of modern transactional data stores, where commit protocols are integrated with concurrency control, and their executions for different transactions are interdependent. As an alternative, we introduce Transaction Certification Service (TCS), a new formal problem that captures safety guarantees of multi-shot transaction commit protocols with integrated concurrency control. TCS is parameterized by a certification function that can be instantiated to support common isolation levels, such as serializability and snapshot isolation. We then derive a provably correct crash-resilient protocol for implementing TCS through successive refinement. Our protocol achieves a better time complexity than mainstream approaches that layer two-phase commit on top of Paxos-style replication.
We present a general methodology for establishing the impossibility of implementing certain concurrent objects on different (weak) memory models. The key idea behind our approach lies in characterizing memory models by their mergeability properties, identifying restrictions under which independent memory traces can be merged into a single valid memory trace. In turn, we show that the mergeability properties of the underlying memory model entail similar mergeability requirements on the specifications of objects that can be implemented on that memory model. We demonstrate the applicability of our approach to establish the impossibility of implementing standard distributed objects with different restrictions on memory traces on three memory models: strictly consistent memory, total store order, and release-acquire. These impossibility results allow us to identify tight and almost tight bounds for some objects, as well as new separation results between weak memory models, and between well-studied objects based on their implementability on weak memory models.
This paper presents a formal framework, which is based on the notion of a serialization set, that enables to compose a set of consistency conditions into a more restrictive one. To exemplify the utility of this framework, a list of very basic consistency conditions is identified, and it is shown that various compositions of the basic conditions yield some of the most commonly used consistency conditions, such as sequential consistency, causal memory, and Pipelined RAM. The paper also lists several applications that can benefit from even weaker semantics than Pipelined RAM that can be expressed as a composition of a small subset of the basic conditions.
Abstract We present an improvement to the Disk Paxos protocol by Gafni and Lamport which utilizes extended functionality and flexibility provided by Active Disks and supports unmediated concurrent data access by an unlimited number of processes. The solution facilitates coordination by an infinite number of clients using finite shared memory. It is based on a collection of read-modify-write objects with faults, that emulate a new, reliable shared memory abstraction called a ranked register. The required read-modify-write objects are readily available in Active Disks and in Object Storage Device controllers, making our solution suitable for state-of-the-art Storage Area Network (SAN) environments.
In-memory object caches, such as memcached, are critical to the success of popular web sites, such as Facebook [3], by reducing database load and improving scalability [2]. The prominence of caches implies that configuring their ideal memory size has the potential for significant savings on computation resources and energy costs, but unfortunately cache configuration is poorly understood. The modern practice of manually tweaking live caching systems takes significant effort and may both increase the variance for client request latencies and impose high load on the database backend.
We consider the fault-tolerant consensus problem in radio networks with crash-prone nodes. Specifically, we develop lower bounds and matching upper bounds for this problem in single-hop radios networks, where all nodes are located within broadcast range of each other. In a novel break from existing work, we introduce a collision-prone communication model in which each node may lose an arbitrary subset of the messages sent by its neighbors during each round. This model is motivated by behavior observed in empirical studies of these networks. To cope with this communication unreliability we augment nodes with receiver-side collision detectors and present a new classification of these detectors in terms of accuracy and completeness. This classification is motivated by practical realities and allows us to determine, roughly speaking, how much collision detection capability is enough to solve the consensus problem efficiently in this setting. We consider nine different combinations of completeness and accuracy properties in total, determining for each whether consensus is solvable, and, if it is, a lower bound on the number of rounds required. Furthermore, we distinguish anonymous and non-anonymous protocols-where "anonymous" implies that devices do not have unique identifiers-determining what effect (if any) this extra information has on the complexity of the problem. In all relevant cases, we provide matching upper bounds.
This paper presents a new algorithm, RDS (Reconfigurable Distributed Storage), for implementing a reconfigurable distributed shared memory in an asynchronous dynamic network. The algorithm guarantees atomic consistency (linearizability) in all executions in the presence of arbitrary crash failures of processors and message loss and delays. the algorithm incorporates a quorum-based read/write algorithm and an optimized consensus protocol, based on Paxos. RDS achieves the design goals of: (i) allowing read and write operations to complete rapidly, and (ii) providing long-term fault tolerance through reconfiguration, a process that evolves the quorum configurations used by the read and write operations. The new algorithm improves on previously developed alternatives by using a more efficient reconfiguration protocol, thus guaranteeing better fault tolerance and faster recovery from network instability. This paper presents RDS, a formal proof of correctness, conditional performance analysis, and experimental results.
We consider the problem of implementing a wait-free regular register from storage components prone to Byzantine faults. We present a simple, efficient, and self-contained construction of such a register. Our construction utilizes a novel building block, called a 1-regular register, which can be efficiently implemented from Byzantine fault-prone components.
We study the inherent space requirements of reliable storage algorithms in asynchronous distributed systems. A number of recent works have used codes in order to achieve a better storage cost than the well-known replication approach. However, a closer look reveals that they incur extra costs in certain scenarios. Specifically, if multiple clients access the storage concurrently, then existing asynchronous code based algorithms may store a number of copies of the data that grows linearly with the number of concurrent clients. We prove here that this is inherent. Given three parameters, (1) the data size - D bits, (2) the concurrency level c, and (3) the number of storage node failures that need to be tolerated f, we show a lower bound of Omega(min(f, c) " D) bits on the space complexity of asynchronous distributed storage algorithms. Intuitively, this implies that the asymptotic storage cost is either as high as with replication, namely O(f D), or as high under concurrency as with the aforementioned code-based algorithms, i.e., O(cD). We further present a technique for combining erasure codes with replication so as to obtain the best of both. We present an adaptive f - tolerant storage algorithm whose storage cost is O (min( f, c) " D). Together, our results show that the space complexity of providing reliable storage in asynchronous distributed systems is Theta(min( f, c) . D).
We introduce Agilis—a lightweight collaborative event processing platform that can be deployed in a Semantic Room to facilitate sharing and correlating event data generated in real time by multiple widely distributed sources. Agilis aims to balance simplicity of use and robustness on the one hand, and scalable performance in large-scale settings on the other. To this end, Agilis is built upon the open source Hadoop’s MapReduce infrastructure augmented with a RAM-based data store and several locality-oriented optimizations to improve responsiveness and reduce overhead. The processing logic is specified in a flexible high-level language, called Jaql, which supports data flows and SQL-like query constructs. We demonstrate the versatility of the Agilis framework as well as its utility for collaborative attack detection by showing how it can be leveraged in the following two attack scenarios: stealthy inter-domain port scanning, and a botnet-driven HTTP session hijacking attack. We evaluate the performance of Agilis in both these scenarios and, in the case of inter-domain port scanning, compare it to Semantic Room, which deploys the centralized high-end event processing system called Esper. Our results show that while Agilis is slower than Esper in a local area network, its relative performance improves substantially as we move toward larger scale distributed deployments.
Large-scale in-memory object caches such as memcached are widely used to accelerate popular web sites and to reduce burden on backend databases. Yet current cache systems give cache operators limited information on what resources are required to optimally accommodate the present workload. This paper focuses on a key question for cache operators: how much total memory should be allocated to the in-memory cache tier to achieve desired performance? We present our Mimir system: a lightweight online profiler that hooks into the replacement policy of each cache server and produces graphs of the overall cache hit rate as a function of memory size. The profiler enables cache operators to dynamically project the cost and performance impact from adding or removing memory resources within a distributed in-memory cache, allowing "what-if" questions about cache performance to be answered without laborious offline tuning. Internally, Mimir uses a novel lock-free algorithm and lookup filters for quickly and dynamically estimating hit rate of LRU caches. Running Mimir as a profiler requires minimal changes to the cache server, thanks to a lean API. Our experiments show that Mimir produces dynamic hit rate curves with over 98% accuracy and 2--5% overhead on request latency and throughput when Mimir is run in tandem with memcached, suggesting online cache profiling can be a practical tool for improving provisioning of large caches.
Reconfigurable state machine replication is an important enabler of elasticity for replicated cloud services, which must be able to dynamically adjust their size as a function of changing load and resource availability. We introduce a new generic framework to allow the reconfigurable state machine implementation to be derived from a collection of arbitrary non-reconfigurable state machines. Our reduction framework follows the black box approach, and does not make any assumptions with respect to its execution environment apart from reliable channels. It allows higher-level services to leverage speculative command execution to ensure uninterrupted progress during the reconfiguration periods as well as in situations where failures prevent the reconfiguration agreement from being reached in a timely fashion. We apply our framework to obtain a reconfigurable speculative state machine from the non-reconfigurable Paxos implementation, and analyze its performance on a realistic distributed testbed. Our results show that our framework incurs negligible overheads in the absence of reconfiguration, and allows steady throughput to be maintained throughout the reconfiguration periods.
This paper presents a new algorithm for implementing a reconfigurable distributed shared memory in an asynchronous dynamic network. The algorithm guarantees atomic consistency (linearizability) in all executions in the presence of arbitrary crash failures of the processing nodes, message delays, and message loss. The algorithm incorporates a classic quorum-based algorithm for read/write operations, and an optimized consensus protocol, based on Fast Paxos for reconfiguration, and achieves the design goals of: (i) allowing read and write operations to complete rapidly and (ii) providing long-term fault-tolerance through reconfiguration, a process that evolves the quorum configurations used by the read and write operations. The resulting algorithm tolerates dynamism. We formally prove our algorithm to be correct, we present its performance and compare it to existing reconfigurable memories, and we evaluate experimentally the cost of its reconfiguration mechanism.
We present Byzantine Disk Paxos, an asynchronous shared-memory consensus algorithm that uses a collection of n < 3t disks, t of which may fail by becoming non-responsive or arbitrarily corrupted. We give two constructions of this algorithm; that is, we construct two different t-tolerant (i.e., tolerating up to t disk failures) building blocks, each of which can be used, along with a leader oracle, to solve consensus. One building block is a t-tolerant wait-free shared safe register. The second building block is a t-tolerant regular register that satisfies a weaker termination (liveness) condition than wait freedom: its write operations are wait-free, whereas its read operations are guaranteed to return only in executions with a finite number of writes. We call this termination condition finite writes (FW), and show that wait-free consensus is solvable with FW-terminating registers and a leader oracle. We construct each of these t-tolerant registers from n < 3t base registers, t of which can be non-responsive or Byzantine. All the previous t-tolerant wait-free constructions in this model used at least 4t + 1 fault-prone registers, and we are not familiar with any prior FW-terminating constructions in this model. We further show tight lower bounds on the number of invocation rounds required for optimal resilience reliable register constructions, or more generally, constructions that use less than 4t + 1 fault-prone registers. Our lower bounds show that such constructions are inherently more costly than constructions that use 4t + 1 registers, and that our constructions have optimal round complexity. Furthermore, our wait-free construction is early-stopping, and it achieves the optimal round complexity with any number of actual failures.
Reaching agreement among processes sharing read/write memory is possible only in the presence of an eventual unique leader. A leader that fails must be recoverable, but on the other hand, a live and well-performing leader should never be decrowned. This paper presents the first leader algorithm in shared memory environments that guarantees an eventual leader following global stabilization time. The construction is built using light-weight lease and renew primitives. The implementation is simple, yet efficient. It is uniform, in the sense that the number of potentially contending processes for leadership is not a priori known.
IP Multicast (IPMC) in data centers becomes disruptive when the technology is used by a large number of groups, a capability desired by event notification systems. We trace the problem to root causes, and introduce Dr. Multicast (MCMD), a system that eliminates the issue by mapping IPMC operations to a combination of point-to-point unicast and traditional IPMC transmissions guaranteed to be safe. MCMD optimizes the use of IPMC addresses within a data center by merging similar multicast groups in a principled fashion, while simultaneously respecting hardware limits expressed through administrator-controlled policies. The system is fully transparent, making it backward-compatible with commodity hardware and software found in modern data centers. Experimental evaluation shows that MCMD allows a large number of IPMC groups to be used without disruption, restoring a powerful group communication primitive to its traditional role.
We discuss the challenges of devising a useful shared data cache service as a part of the cloud platform. Outside the cloud, such a service appeals to developers for two main reasons. Most importantly, data caches reduce the response latency experienced by users. For example, rendering a content page with various personalized boxes as part of a user web session often involves numerous database lookups. Therefore, if the content generation involves cheap memory accesses to a data cache instead of actual database queries, the user experiences will improve. Moreover, data caches are simple to use: they normally expose a simple get/set interface akin to key/value stores and a rudimentary mechanism to expire values [2], thus allowing result-based caching to be seamlessly integrated with existing database-driven code.
We present the design and early experience with a completely new implementation of the Bulletin Board, a topicbased distributed shared memory service employed by commercial-grade application middleware, to achieve robustness and administrative simplicity with adequate latency and costs at the required throughput and scale. To facilitate scalability, only weak consistency is provided. For robustness and ease of use, the implementation is designed in a fully peer-to-peer fashion leveraging the weakly consistent group communication services provided by a semi-structured overlay network. We discuss issues in providing good (while not perfect) stability and reliability at tolerable cost. We address scalability issues, such as supporting large numbers of processes, large subscription spaces, and complex interest patterns. We also consider comprehensive API instrumentation.
We present the architecture of a middleware platform, called Collaborative Middleware for Monitoring Financial Critical Infrastructure (CoMiFin), that facilitates collaborative protection of the financial critical infrastructure (CI). At the core of CoMiFin is a new abstraction of Semantic Room (SR), allowing the interested participants to share information and combine their computing powers to collectively resist massive scale attacks against their IT and business assets. We describe a full stack of software components aiming at realizing SR in a distributed setting. At the lowest level, the SR functionality relies on a customizable event processing platform, which can be supported through a variety of event processing and analytics containers. The containers abstract away the intricacies of the distributed environment, and allow the application developers to focus on implementing the processing logic at hand. The higher level aspects of the SR abstraction are supported by the SR Management layer, which includes components to control the SR lifecycle and deployment, inter-SR connectivity, and the contract compliance monitoring. The proposed architecture is modular and flexible, allowing the developers to easily create and customize the processing logic according to the SR business goals, plug in different types of processing platforms, and deploy the implementation in a variety of realistic settings.
We present a comprehensive methodology for proving correctness of concurrent data structures. We exemplify our methodology by using it to give a roadmap for proving linearizability of the popular Lazy List implementation of the concurrent set abstraction. Correctness is based on our key theorem, which captures sufficient conditions for linearizability. In contrast to prior work, our conditions are derived directly from the properties of the data structure in sequential runs, without requiring the linearization points to be explicitly identified.
Atomicity (or linearizability) is a commonly used consistency criterion for distributed services and objects. Although atomic object implementations are abundant, proving that algorithms achieve atomicity has turned out to be a challenging problem. In this paper, we initiate the study of systematic ways of verifying distributed implementations of atomic objects, beginning with read/write objects (registers). Our general approach is to replace the existing operational reasoning about events and partial orders with assertional reasoning about invariants and simulation relations. To this end, we define an abstract state machine that captures the atomicity property and prove correctness of the object implementations by establishing a simulation mapping between the implementation and the specification automata. We demonstrate the generality of our specification by showing that it is implemented by three different read/write register constructions: the message-passing register emulation of Attiya, Bar-Noy and Dolev, its optimized version based on real time, and the shared memory register construction of Vitanyi and Awerbuch. In addition, we show that a simplified version of our specification is implemented by a general atomic object construction based on the Lamport’s replicated state machine algorithm.
We investigate the problem of designing a scalable overlay network to support decentralized topic-based pub/sub communication. We introduce a new optimization problem, called Minimum Topic-Connected Overlay (Min-TCO), that captures the tradeoff between the scalability of the overlay (in terms of the nodes' fanout) and the message forwarding overhead incurred by the communicating parties. Roughly, the Min-TCO problem is as follows: Given a collection of nodes and their subscriptions, connect the nodes using the minimum possible number of edges so that for each topic t, a message published on t could reach all the nodes interested in t by being forwarded by onlythe nodes interested in t. We show that the decision version of Min-TCO is NP-complete, and present a polynomial algorithm that approximates the optimal solution within a logarithmic factor with respect to the number of edges in theconstructed overlay. We further prove that this approximation ratio is almost tight by showing that no polynomial algorithm can approximate Min-TCO within a constant factor (unless P=NP). We show experimentally that on typical inputs, the fanout of the overlay constructed by our approximation algorithm is significantly lower thanthat of the overlays built by the existing algorithms, and that its running time is just a small fraction of the analytical worst case bound. As Min-TCO can be shown to capture several important aspects of most known overlay-based pub/sub implementations, our study sheds light on the inherent limitations of the existing systems as well asprovides an insight into the best possible feasible solution. Finally, we introduce a flexible framework that generalizes Min-TCO and formalizes most similar overlay design problems that occur in scalable pub/sub systems. We also briefly discuss several examples of such problems, and show some results with respect to their complexity.
The Critical Infrastructure Protection Survey recently released by Symantec found that 53% of interviewed IT security experts from international companies experienced at least ten cyber attacks in the last five years, and financial institutions were often subject to some of the most sophisticated and large-scale cyber attacks and frauds. The book by Baldoni and Chockler analyzes the structure of software infrastructures found in the financial domain, their vulnerabilities to cyber attacks and the existing protection mechanisms. It then shows the advantages of sharing information among financial players in order to detect and quickly react to cyber attacks. Various aspects associated with information sharing are investigated from the organizational, cultural and legislative perspectives. The presentation is organized in two parts: Part I explores general issues associated with information sharing in the financial sector and is intended to set the stage for the vertical IT middleware solution proposed in Part II. Nonetheless, it is self-contained and details a survey of various types of critical infrastructure along with their vulnerability analysis, which has not yet appeared in a textbook-style publication elsewhere. Part II then presents the CoMiFin middleware for collaborative protection of the financial infrastructure.The material is presented in an accessible style and does not require specific prerequisites. It appeals to both researchers in the areas of security, distributed systems, and event processing working on new protection mechanisms, and practitioners looking for a state-of-the-art middleware technology to enhance the security of their critical infrastructures in e.g. banking, military, and other highly sensitive applications. The latter group will especially appreciate the concrete usage scenarios included
We introduce SpiderCast, a distributed protocol for constructing scalable churn-resistant overlay topologies for supporting decentralized topic-based pub/sub communication. SpiderCast is designed to effectively tread the balance between average overlay degree and communication cost of event dissemination. It employs a novel coverage-optimizing heuristic in which the nodes utilize partial subscription views (provided by a decentralized membership service) to reduce the average node degree while guaranteeing (with high probability) that the events posted on each topic can be routed solely through the nodes interested in this topic (in other words, the overlay is topic-connected). SpiderCast is unique in maintaining an overlay topology that scales well with the average number of topics a node is subscribed to, assuming the subscriptions are correlated insofar as found in most typical workloads. Furthermore, the degree grows logarithmically in the total number of topics, and slowly decreases as the number of nodes increases. We show experimentally that, for many practical work-loads, the SpiderCast overlays are both topic-connected and have a low per-topic diameter while requiring each node to maintain a low average number of connections. These properties are satisfied even in very large settings involving up to 10,000 nodes, 1,000 topics, and 70 subscriptions per-node, and under high churn rates. In addition, our results demonstrate that, in a large setting, the average node degree in SpiderCast is at least 45% smaller than in other overlays typically used to support decentralized pub/sub communication (such as e.g., similarity-based, rings-based, and random overlays).
A distributed storage service lets clients abstract a single reliable shared storage device using a collection of possibly unreliable computing units. Algorithms that implement this abstraction offer certain tradeoffs, and vary according to dimensions such as complexity, the consistency semantics provided, and the types of failures tolerated.
We study the inherent space requirements of shared storage algorithms in asynchronous fault-prone systems. Previous works use codes to achieve a better storage cost than the well-known replication approach. However, a closer look reveals that they incur extra costs somewhere else: Some use unbounded storage in communication links, while others assume bounded concurrency or synchronous periods. We prove here that this is inherent, and indeed, if there is no bound on the concurrency level, then the storage cost of any reliable storage algorithm is at least f+1 times the data size, where f is the number of tolerated failures. We further present a technique for combining erasure-codes with full replication so as to obtain the best of both. We present a storage algorithm whose storage cost is close to the lower bound in the worst case, and adapts to the concurrency level.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
An effective means for building Internet-scale distributed applications, and in particular those involving group-based information sharing, is to deploy peer-to-peer overlay networks. The key pre-requisite for supporting these types of applications on top of the overlays is efficient distribution of messages to multiple subscribers dispersed across numerous multicast groups. In this paper, we introduce Magnet: a peer-to-peer publish/subscribe system which achieves efficient message distribution by dynamically organizing peers with similar subscriptions into dissemination structures which preserve locality in the subscription space. Magnet is able to significantly reduce the message propagation costs by taking advantage of subscription correlations present in many large-scale group-based applications. We evaluate Magnet by comparing its performance against a strawman pub/sub system which does not cluster similar subscriptions by simulation. We find that Magnet outperforms the strawman by a substantial margin on clustered subscription workloads produced using both generative models and real application traces.
We consider the fault-tolerant consensus problem in wireless ad hoc networks with crash-prone nodes. We develop consensus algorithms for single-hop environments where the nodes are located within broadcast range of each other. Our algorithms tolerate highly unpredictable wireless communication, in which messages may be lost due to collisions, electromagnetic interference, or other anomalies. Accordingly, each node may receive a different set of messages in the same round. In order to minimize collisions, we design adaptive algorithms that attempt to minimize the broadcast contention. To cope with unreliable communication, we augment the nodes with collision detectors and present a new classification of collision detectors in terms of accuracy and completeness, based on practical realities. We show exactly in which cases consensus can be solved, and thus determine the requirements for a useful collision detector.We validate the feasibility of our algorithms, and the underlying wireless model, with simulations based on a realistic 802.11 MAC layer implementation and a detailed radio propagation model. We analyze the performance of our algorithms under varying sizes and densities of deployment and varying MAC layer parameters. We use our single-hop consensus algorithms as the basis for solving consensus in a multi-hop network, demonstrating the resilience of our algorithms to a challenging and noisy environment.
Recent evidence of successful Internet-based attacks and frauds involving financial institutions highlights the inadequacy of the existing protection mechanisms, in which each instutition implements its own isolated monitoring and reaction strategy. Analyzing on-line activity and detecting attacks on a large scale is an open issue due to the huge amounts of events that should be collected and processed. In this paper, we propose a large-scale distributed event processing system, called intelligence cloud, allowing the financial entities to participate in a widely distributed monitoring and detection effort through the exchange and processing of information locally available at each participating site. We expect this approach to be able to handle large amounts of events arriving at high rates from multiple domains of the financial scenario. We describe a framework based on the intelligence cloud where each participant can receive early alerts enabling them to deploy proactive countermeasures and mitigation strategies.
Communication-efficiency is of key importance when constructing robust services in limited bandwidth environments, such as sensor networks. We focus on communication-efficiency in the context of quorum systems, which are useful primitives for building reliable distributed systems. To this end, we exhibit a new probabilistic quorum construction in which every node transmits at most O(log(2) n) bits per quorum access, where n is the number of nodes in the system. Our implementation, in addition to being communication efficient, is also robust in the face of communication failures. In particular it guarantees consistency (with high probability) in the face of network, partitions. To the best of our knowledge, no existing probabilistic quorum systems achieve polylogarithmic communication complexity and are resilient to network partitions.
Byzantine state-machine replication (SMR) ensures the consistency of replicated state in the presence of malicious replicas and lies at the heart of the modern blockchain technology. Byzantine SMR protocols often guarantee safety under all circumstances and liveness only under synchrony. However, guaranteeing liveness even under this assumption is nontrivial. So far we have lacked systematic ways of incorporating liveness mechanisms into Byzantine SMR protocols, which often led to subtle bugs. To close this gap, we introduce a modular framework to facilitate the design of provably live and efficient Byzantine SMR protocols. Our framework relies on a view abstraction generated by a special SMR synchronizer primitive to drive the agreement on command ordering. We present a simple formal specification of an SMR synchronizer and its bounded-space implementation under partial synchrony. We also apply our specification to prove liveness and analyze the latency of three Byzantine SMR protocols via a uniform methodology. In particular, one of these results yields what we believe is the first rigorous liveness proof for the algo-rithmic core of the seminal PBFT protocol.
Atomic Commit Problem (ACP) is a single-shot agreement problem similar to consensus, meant to model the properties of transaction commit protocols in fault-prone distributed systems. We argue that ACP is too restrictive to capture the complexities of modern transactional data stores, where commit protocols are integrated with concurrency control, and their executions for different transactions are interdependent. As an alternative, we introduce Transaction Certification Service (TCS), a new formal problem that captures safety guarantees of multi-shot transaction commit protocols with integrated concurrency control. TCS is parameterized by a certification function that can be instantiated to support common isolation levels, such as serializability and snapshot isolation. We then derive a provably correct crash-resilient protocol for implementing TCS through successive refinement. Our protocol achieves a better time complexity than mainstream approaches that layer two-phase commit on top of Paxos-style replication. 2012 ACM Subject Classification Theory of computation → Distributed computing models
Atomic multicast is a communication primitive that delivers messages to multiple groups of processes according to some total order, with each group receiving the projection of the total order onto messages addressed to it. To be scalable, atomic multicast needs to be genuine, meaning that only the destination processes of a message should participate in ordering it. In this paper we propose a novel genuine atomic multicast protocol that in the absence of failures takes as low as 3 message delays to deliver a message when no other messages are multicast concurrently to its destination groups, and 5 message delays in the presence of concurrency. This improves the latencies of both the fault-tolerant version of classical Skeen's multicast protocol (6 or 12 message delays, depending on concurrency) and its recent improvement by Coelho et al. (4 or 8 message delays). To achieve such low latencies, we depart from the typical way of guaranteeing fault-tolerance by replicating each group with Paxos. Instead, we weave Paxos and Skeen's protocol together into a single coherent protocol, exploiting opportunities for white-box optimisations. We experimentally demonstrate that the superior theoretical characteristics of our protocol are reflected in practical performance pay-offs.
Partially synchronous Byzantine consensus protocols typically structure their execution into a sequence of views, each with a designated leader process. The key to guaranteeing liveness in these protocols is to ensure that all correct processes eventually overlap in a view with a correct leader for long enough to reach a decision. We propose a simple view synchronizer abstraction that encapsulates the corresponding functionality for Byzantine consensus protocols, thus simplifying their design. We present a formal specification of a view synchronizer and its implementation under partial synchrony, which runs in bounded space despite tolerating message loss during asynchronous periods. We show that our synchronizer specification is strong enough to guarantee liveness for single-shot versions of several well-known Byzantine consensus protocols, including HotStuff, Tendermint, PBFT and SBFT. We furthermore give precise latency bounds for these protocols when using our synchronizer. By factoring out the functionality of view synchronization we are able to specify and analyze the protocols in a uniform framework, which allows comparing them and highlights trade-offs. 2012 ACM Subject Classification Theory of computation → Distributed computing models
The advent of byte-addressable non-volatile memory (NVM) technologies has enabled the development of low-latency high-throughput durable applications, i.e., applications that are capable of recovering from full-system crashes. However, programming such applications is error-prone as efficiency gains often require fine-grained (programmer-controlled) management of low-level persistence instructions. We propose Mangosteen, a high-level programming framework that allows developers to transform an existing lineariz-able in-memory application to a corresponding durably lin-earizable version using NVM. Our framework's API consists of a set of callback hooks that interpose on an application's request processing flow with minimal developer effort. Man-gosteen executes client operations on DRAM and persists their effects using binary instrumentation and redo logging. Mangosteen's concurrency control facilitates batching of read-write requests to minimize the cost of persistence, while allowing read-only requests to execute concurrently. A novel intra-batch deduplication mechanism further reduces persistence overheads for common OLTP workloads. Our empirical evaluation results show that Mangosteen-enabled applications outperform state-of-the-art solutions across the entire spectrum of read-write ratios. In particular, the Mangosteen-based version of Redis demonstrates throughput gains of between 2×–5× in comparison to prior work.
Atomic broadcast is a reliable communication abstraction ensuring that all processes deliver the same set of messages in a common global order. It is a fundamental building block for implementing fault-tolerant services using either active (aka state-machine) or passive (aka primary-backup) replication. We consider the problem of implementing reconfigurable atomic broadcast, which further allows users to dynamically alter the set of participating processes, e.g., in response to failures or changes in the load. We give a complete safety and liveness specification of this communication abstraction and propose a new protocol implementing it, called Vertical Atomic Broadcast, which uses an auxiliary service to facilitate reconfiguration. In contrast to prior proposals, our protocol significantly reduces system downtime when reconfiguring from a functional configuration by allowing it to continue processing messages while agreement on the next configuration is in progress. Furthermore, we show that this advantage can be maintained even when our protocol is modified to support a stronger variant of atomic broadcast required for passive replication. 2012 ACM Subject Classification Theory of computation → Distributed computing models
We present a general methodology for establishing the impossibility of implementing certain concurrent objects on different (weak) memory models. The key idea behind our approach lies in characterizing memory models by their mergeability properties, identifying restrictions under which independent memory traces can be merged into a single valid memory trace. In turn, we show that the mergeability properties of the underlying memory model entail similar mergeability requirements on the specifications of objects that can be implemented on that memory model. We demonstrate the applicability of our approach to establish the impossibility of implementing standard distributed objects with different restrictions on memory traces on three memory models: strictly consistent memory, total store order, and release-acquire. These impossibility results allow us to identify tight and almost tight bounds for some objects, as well as new separation results between weak memory models, and between well-studied objects based on their implementability on weak memory models.
We study implementations of basic fault-tolerant primitives, such as consensus and registers, in message-passing systems subject to process crashes and a broad range of communication failures. Our results characterize the necessary and sufficient conditions for implementing these primitives as a function of the connectivity constraints and synchrony assumptions. Our main contribution is a new algorithm for partially synchronous consensus that is resilient to process crashes and channel failures and is optimal in its connectivity requirements. In contrast to prior work, our algorithm assumes the most general model of message loss where faulty channels are flaky, i.e., can lose messages without any guarantee of fairness. This failure model is particularly challenging for consensus algorithms, as it rules out standard solutions based on leader oracles and failure detectors. To circumvent this limitation, we construct our solution using a new variant of the recently proposed view synchronizer abstraction, which we adapt to the crash-prone setting with flaky channels. 2012 ACM Subject Classification Theory of computation → Distributed computing models
We study implementations of basic fault-tolerant primitives, such as consensus and registers, in message-passing systems subject to process crashes and a broad range of communication failures. Our results characterize the necessary and sufficient conditions for implementing these primitives as a function of the connectivity constraints and synchrony assumptions. Our main contribution is a new algorithm for partially synchronous consensus that is resilient to process crashes and channel failures and is optimal in its connectivity requirements. In contrast to prior work, our algorithm assumes the most general model of message loss where faulty channels are flaky, i.e., can lose messages without any guarantee of fairness. This failure model is particularly challenging for consensus algorithms, as it rules out standard solutions based on leader oracles and failure detectors. To circumvent this limitation, we construct our solution using a new variant of the recently proposed view synchronizer abstraction, which we adapt to the crash-prone setting with flaky channels.
Byzantine state-machine replication (SMR) ensures the consistency of replicated state in the presence of malicious replicas and lies at the heart of the modern blockchain technology. Byzantine SMR protocols often guarantee safety under all circumstances and liveness only under synchrony. However, guaranteeing liveness even under this assumption is nontrivial. So far we have lacked systematic ways of incorporating liveness mechanisms into Byzantine SMR protocols, which often led to subtle bugs. To close this gap, we introduce a modular framework to facilitate the design of provably live and efficient Byzantine SMR protocols. Our framework relies on a view abstraction generated by a special SMR synchronizer primitive to drive the agreement on command ordering. We present a simple formal specification of an SMR synchronizer and its bounded-space implementation under partial synchrony. We also apply our specification to prove liveness and analyze the latency of three Byzantine SMR protocols via a uniform methodology. In particular, one of these results yields what we believe is the first rigorous liveness proof for the algorithmic core of the seminal PBFT protocol.
Atomic broadcast protocols ensure that messages are delivered to a group of machines in some total order, even when some of these machines can fail. These protocols are key to making distributed services fault-tolerant, as their total order guarantee allows keeping multiple service replicas in sync. But, unfortunately, atomic broadcast protocols are also notoriously expensive. We present a new protocol, called Acuerdo, that improves atomic broadcast performance by using remote direct memory addressing (RDMA). Acuerdo is built from the ground up to perform communication using one-side RDMA writes, which do not use the CPU of the remote machine, and is explicitly designed to minimize waiting on the critical path. Our experimental results demonstrate that Acuerdo provides raw throughput comparable to or exceeding other RDMA atomic broadcast protocols, while improving latency by almost 2𝑥.
Partially synchronous models are often assumed for designing distributed protocols because they capture realistic timing assumptions, such as the asynchronous and synchronous periods that the system can experience. In some of these models, protocols need to estimate network delays. Some protocols fix the global message delay bound for all executions, which leads to sub-optimal solutions in terms of latency, because this bound must be chosen conservatively. And other protocols employ delay estimation mechanisms that only give an upper bound on the delay without quantifying the estimation error. The performance of these protocols depends on how close their estimations are in relation to the actual network delay. For instance, some Byzantine consensus protocols use timeouts based on this estimation. We formalize this problem as the Global Delay Bound Estimation (GDBE) and address it by introducing a distributed oracle that enriches partial synchronous models. This oracle produces estimates of the channel delays that allow processes to derive an efficient global bounded estimate. Oracles and global bounded estimates, provide a framework that facilitates the design of protocols for partially synchronous models and the analysis of their time complexity. We formalize the properties of the oracle and the proposed framework and show that it can be implemented in the presence of crash failures. In contrast, we prove that GDBE cannot be solved in the Byzantine failure model, and show how to circumvent this impossibility using an extra assumption. Finally, we show how to use our framework to implement a view synchronizer thus obtaining an efficient solution for Byzantine consensus.
Byzantine state-machine replication (SMR) ensures the consistency of replicated state in the presence of malicious replicas and lies at the heart of the modern blockchain technology. Byzantine SMR protocols often guarantee safety under all circumstances and liveness only under synchrony. However, guaranteeing liveness even under this assumption is nontrivial. So far we have lacked systematic ways of incorporating liveness mechanisms into Byzantine SMR protocols, which often led to subtle bugs. To close this gap, we introduce a modular framework to facilitate the design of provably live and efficient Byzantine SMR protocols. Our framework relies on a view abstraction generated by a special SMR synchronizer primitive to drive the agreement on command ordering. We present a simple formal specification of an SMR synchronizer and its bounded-space implementation under partial synchrony. We also apply our specification to prove liveness and analyze the latency of three Byzantine SMR protocols via a uniform methodology. In particular, one of these results yields what we believe is the first rigorous liveness proof for the algorithmic core of the seminal PBFT protocol.
Partially synchronous Byzantine consensus protocols typically structure their execution into a sequence of views, each with a designated leader process. The key to guaranteeing liveness in these protocols is to ensure that all correct processes eventually overlap in a view with a correct leader for long enough to reach a decision. We propose a simple view synchronizer abstraction that encapsulates the corresponding functionality for Byzantine consensus protocols, thus simplifying their design. We present a formal specification of a view synchronizer and its implementation under partial synchrony, which runs in bounded space despite tolerating message loss during asynchronous periods. We show that our synchronizer specification is strong enough to guarantee liveness for single-shot versions of several well-known Byzantine consensus protocols, including HotStuff, Tendermint, PBFT and SBFT. We furthermore give precise latency bounds for these protocols when using our synchronizer. By factoring out the functionality of view synchronization we are able to specify and analyze the protocols in a uniform framework, which allows comparing them and highlights trade-offs.
Partially synchronous Byzantine consensus protocols typically structure their execution into a sequence of views, each with a designated leader process. The key to guaranteeing liveness in these protocols is to ensure that all correct processes eventually overlap in a view with a correct leader for long enough to reach a decision. We propose a simple view synchronizer abstraction that encapsulates the corresponding functionality for Byzantine consensus protocols, thus simplifying their design. We present a formal specification of a view synchronizer and its implementation under partial synchrony, which runs in bounded space despite tolerating message loss during asynchronous periods. We show that our synchronizer specification is strong enough to guarantee liveness for single-shot versions of several well-known Byzantine consensus protocols, including PBFT and HotStuff. We furthermore give precise latency bounds for these protocols when using our synchronizer. By factoring out the functionality of view synchronization we are able to specify and analyze the protocols in a uniform framework, which allows comparing them and highlights trade-offs.
Atomic Commit Problem (ACP) is a single-shot agreement problem similar to consensus, meant to model the properties of transaction commit protocols in fault-prone distributed systems. We argue that ACP is too restrictive to capture the complexities of modern transactional data stores, where commit protocols are integrated with concurrency control, and their executions for different transactions are interdependent. As an alternative, we introduce Transaction Certification Service (TCS), a new formal problem that captures safety guarantees of multi-shot transaction commit protocols with integrated concurrency control. TCS is parameterized by a certification function that can be instantiated to support common isolation levels, such as serializability and snapshot isolation. We then derive a provably correct crash-resilient protocol for implementing TCS through successive refinement. Our protocol achieves a better time complexity than mainstream approaches that layer two-phase commit on top of Paxos-style replication.