Foundational distributed systems papers
I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.)
I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section.
Time and State in Distributed Systems
Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978.
Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985.
Virtual Time and Global States of Distributed Systems. Mattern, F. 1988.
Expository papers and blog posts
There is No Now. Justin Sheehy, ACM Queue 2015
Why Logical Clocks are Easy. Carlos Baquero and Nuno Preguiça, ACM Queue 2016.
Logical clocks and Vector clocks modeling in TLA+/PlusCal
Impossibility results
Coordinated Attack or Two Generals problem is a fundamental impossibility in distributed systems. It started more of folks theorem, so instead of pointing to a paper, I provide a link to the wikipedia page.
Impossibility of Distributed Consensus with One Faulty Process, Fischer, Lynch and Patterson, JACM, 1985
Unreliable Failure Detectors for Reliable Distributed Systems, Tushar Deepak Chandra and Sam Toueg, Journal of the ACM, 1996.
Harvest, Yield, and Scalable Tolerant Systems, Armando Fox, Eric A. Brewer, 1999
CAP Twelve years later: How the rules have changed, Eric Brewer, 2012.
Expository papers and blog posts
A Brief Tour of FLP Impossibility
Paper summary: Perspectives on the CAP theorem
Consensus and state machine replication
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. B. Oki and B. Liskov. SOSP 1988
Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial, Fred Schneider, 1990
How to build a highly available system with consensus, Butler Lampson, 1996
The Part-Time Parliament (Paxos) Leslie Lamport, 1998. (See context)
Practical Byzantine Fault Tolerance. Miguel Castro, Barbara Liskov. OSDI 1999.
Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse and Fred B. Schneider, OSDI 2004.
ZooKeeper: Wait-free coordination for Internet-scale systems. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, Benjamin Reed, Usenix ATC 2010.
Tango: Distributed Data Structures over a Shared Log, Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, Aviad Zuckk. SOSP 2013.
There is more consensus in Egalitarian parliaments. Iulian Moraru, David G. Andersen, Michael Kaminsky, SOSP 2013.
Flexible Paxos: Quorum intersection revisited. Heidi Howard, Dahlia Malkhi, Alexander Spiegelman. 2016.
WormSpace: A modular foundation for simple, verifiable distributed systems. Ji-Yong Shin, Jieung Kim, Wolf Honore, Hernán Vanzetto, Srihari Radhakrishnan, Mahesh Balakrishnan, Zhong Shao, SOCC'19.
Expository papers and blog posts
In Search of an Understandable Consensus Algorithm. Diego Ongaro, John Ousterhout, Usenix ATC, 2014.
Paxos Made Moderately Complex. Robbert Van Renesse and Deniz Altinbuken, ACM Computing Surveys, 2015.
Consensus in the Cloud: Paxos Systems Demystified. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, 2016.
Modeling Paxos and Flexible Paxos in Pluscal and TLA+
Dissecting performance bottlenecks of Paxos protocols.
Distributed algorithms
Self-stabilizing systems in spite of distributed control, Edsgar W. Dijkstra, CACM 1974.
The Drinking Philosophers Problem, K. M. Chandy, J. Misra, ACM TOPLAS 1984
Sparse partitions, Baruch Awerbuch, David Peleg, FOCS 1990.
Distributed reset, Anish Arora, Mohamed Gouda, 1994
The Arrow Distributed Directory Protocol, Michael J. Demmer, M. Herlihy, DISC 1998.
Expository papers and blog posts
Dijkstra's stabilizing token ring algorithm
Modeling the hygienic dining philosophers algorithm in TLA+
Miscellaneous
Hints for computer system design, Butler Lampson, 1983
The role of distributed state, John Ousterhout, 1990
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. Matt Welsh, David Culler, and Eric Brewer, SOSP 2001
Crash only software, George Candea, Armando Fox, HotOS 2003
Expository papers and blog posts
Learning about distributed systems: where to start?
Cloud computing, big data storage/processing
Lessons from Giant-Scale Services. Eric A. Brewer, IEEE Internet Computing, 2001.
MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat, OSDI 2004.
Optimistic Replication, Yasushi Saito and Marc Shapiro, 2005.
Dynamo: Amazon’s Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, ACM SIGOPS 2007.
On designing and deploying Internet scale services, James Hamilton, LISA 2007
Life beyond Distributed Transactions: an Apostate's Opinion, Pat Helland, CIDR 2007.
Conflict-free Replicated Data Types. Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski, 2011.
Consistency Analysis in Bloom: a CALM and Collected Approach, Peter Alvaro, Neil Conway, Joseph M. Hellerstein, William R. Marczak, CIDR 2011.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012.
Tail at scale. Jeff Dean, Luiz Andre Barroso, Commn of the ACM, 2013.
Spanner: Google’s Globally Distributed Database, ACM, 2013.
TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016.
Expository papers
Cloud Programming Simplified: A Berkeley View on Serverless Computing, 2019.
Comments
Post a Comment