Readings in Databases
A list of papers essential to understanding databases and building new data systems. The list is curated and maintained by Reynold Xin (@rxin). If you think a paper should is part of the This list, please submit a pull request. It might take a while since I need to go over the the paper.
If You is reading this and taking the effort to understand these papers, we would love to talk to your about opportunities At Databricks.
Table of Contents
- Basics and Algorithms
- Essentials of Relational Databases
- Classic System Design
- Columnar Databases
- Data-parallel computation
- Consensus and consistency
- Trends (Cloud Computing, Warehouse-scale Computing, New Hardware)
- Miscellaneous
- External Reading Lists
Basics and Algorithms
The Five-minute Rule Ten years later, and other computer Storage Rules of Thumb (1997): This paper (and the original one P Roposed years earlier) illustrates a quantitative formula to calculate whether a data page should being cached in memory O R not. It is a delight-to-read Jim Gray approach to an array of related problems, e.g. how big should a page size was.
Alphasort:a cache-sensitive Parallel External Sort (1995): Sorting is one of the most essential algorithms in databases, As it is used to do joins, aggregations, and sorts. In Algorithms 101 class, CS students is asked to reason about big O complexity and ignore the constant factor. In practice, however, the change in constant from L2 cache can be as big as or three orders of magnitude. This was a good paper to learn on all of the tricks fast sorting implementations use.
Patience is a virtue:revisiting, Merge and Sort on modern processors: sorting revisited. Actually also a good survey on sorting algorithms used in practice and their trade-offs.
Essentials of Relational Databases
Architecture of a database System: Joe Hellerstein ' s great overview of relational Database systems. This essay walks readers through all components essential to relational database systems.
A relational Model of data for Large Shared Data Banks (1970): Codd's argument for data Independence (from 1970), a Fundam Ental concept in relational databases. Despite the current NoSQL trend, I believe ideas from this paper is becoming increasingly important in massively parallel Data Systems.
Aries:a Transaction Recovery Method supporting fine-granularity Locking and Partial rollbacks Using write-ahead Logging ( 1992): The first algorithm that actually works:it supports concurrent execution of transactions without losing data even In the presence of failures. This paper was very hard-read because it mixes a lot of low level details in the explanation of the high level algorithm . Perhaps try understand ARIES (log recovery) by reading a database textbook before attempting to read this paper.
Efficient Locking for Concurrent Operations on B-trees (1981) and the R*-tree:an efficient and robust Access Method for P Oints and Rectangles (1990): B-tree is a core data structure in databases (not just relational). It is optimized and have a low read amplification factor for random lookups of on-disk data. R-tree is a extension of b-tree to support lookups of multi-dimensional data, e.g. Geodata.
Improved Query performance with Variant Indexes (1997): Analytical databases and OLTP databases require different trade-of Fs. These is reflected in the choices of indexing data structures. This paper talks is about a number of the index data structures more suitable for analytical databases.
-
On optimistic Methods for Concurrency Control (1981): There is both ways to support Concurrency. The first is the pessimistic-i.e. to lock shared data preemptively. This paper explains a alternatively to locking called optimistic Concurrency Control. Optimistic approaches assume conflicts is rare and executes transactions without acquiring locks. Before committing the transactions, the database system checks for conflicts and Aborts/restarts transactions if conflicts Arise.
-
Access Path Selection in a relational Database Management System (1979): The Basics of query optimization. SQL is declarative, i.e. specify using a query language what the data you want, isn't how do you want it. There is usually multiple ways (query plans) of executing a query. The database system examines multiple plans and decides on an optimal one (Best-effort). This process is called query optimization. The traditional doing query optimization is to has a cost-model for different access methods and query plans. This paper explains the Cost-model and a dynamic programming algorithm to pick the best plan.
eddies:continuously Adaptive Query Processing (+): Traditional query optimization (and the cost model used) is static. There is problems with the traditional model. First, it's hard-to-build the cost model absent of data statistics. Second, query execution environment might change on long running queries and a static approach cannot capture the change. Analogous to fluid dynamics, this paper proposes a set of techniques that optimize query execution dynamically. I don ' t think ideas in Eddies has made their A-to-commercial systems yet, but the paper are very refreshing to read an D might become more important now.
Classic System Design
A history and Evaluation of System R (1981): There were System R from IBM and Ingres from Berkeley, both systems that Showe D relational database was feasible. This paper describes System R. It's impressive and scary to note, the internals of relational database systems in, look a lot like System R in 1 981.
The Google File System (2003) and Bigtable:a distributed Storage System for Structured Data (2006): Both Core components O F Google ' s data infrastructure. GFS is a append-only Distributed file system for large sequential reads (data-intensive applications). BigTable is high-performance distributed data store, builds on GFS. One-to-think about it was that GFS was optimized for the throughput, and BigTable explains how to build a low-latency D ATA store on top of GFS. Some of these might has been replaced by newer proprietary technologies internal to Google, but the ideas stand.
Chord:a Scalable Peer-to-peer Lookup Service for Internet Applications (2001) and Dynamo:amazona€?s highly Available Key -value Store: Chord was born in the-when distributed hash tables is a hot. It does one thing, and does it really well:how to look up the location of a key in a completely distributed setting (peer -to-peer) using consistent hashing. The Dynamo paper explains how to build a distributed Key-value store using Chord. Note Some design decisions change from Chord to Dynamo, e.g. finger table O (logn) vs O (N), because in Dynamo's case, Amazo N has more control over nodes in a data center, while Chord assumes peer-to-peer nodes in wide area networks.
Columnar Databases
Columnar storage and column-oriented query engine is critical to analytical workloads, e.g. OLAP. It ' s been almost years since it first came out (the MonetDB paper in 1999), and almost every commercial warehouse Datab ASE has a columnar engine by now.
C-store:a column-oriented DBMS (2005) and the Vertica Analytic Database:c-store 7 years later (+): C-Store is an INFL Uential, academic system done by the folks in New England. Vertica is the commercial incarnation of C-Store.
Column-stores vs. Row-stores:how Different is they really? (a): Discusses the importance of both the columnar storage and the query engine.
Dremel:interactive Analysis of Web-scale Datasets (in): A jaw-dropping paper when Google published it. Dremel is a massively parallel analytical database used at Google for Ad-hoc queries. The system runs on thousands of nodes to process terabytes of data in seconds. It applies columnar storage to complex, nested data structures. The paper talks a lot about the nested data structure support, and are a bit light on the details of the the query execution. Note that a number of open source projects is claiming they is building "Dremel". The Dremel system achieves low-latency through massive parallelism and columnar storage, so the model doesn ' t necessarily Make sense outside Google since very few companies in the world can afford thousands of nodes for Ad-hoc queries.
Data-parallel computation
-
mapreduce:simplified Data processing on Large Clusters (2004): MapReduce is both a programming model (borrowed fro M an old concept in functional programming) and a system at Google for distributed data-intensive computation. The programming model is so simple yet expressive enough to capture a wide range of programming needs. The system, coupled with the model, is fault-tolerant and scalable. It's probably fair to say, half of the academia is now working on problems heavily influenced by MapReduce.
Resilient distributed datasets:a fault-tolerant abstraction for In-memory Cluster Computing (+): This is the Paper behind the Spark Cluster computing project at Berkeley. Spark exposes a distributed memory abstraction called RDD, which is an immutable collection of records distributed across A cluster ' s memory. RDDs can be transformed using MapReduce style computations. The RDD abstraction can be as orders of magnitude more efficient for workloads that exhibit strong temporal locality, e.g. Qu ery processing and iterative machine learning. Spark is a example of why it's important to separate the MapReduce programming model from its execution engine.
Shark:sql and Rich Analytics at scale: Describes the Shark system, which are the SQL engine built on top of Spark. More importantly, the paper discusses what previous SQL on Hadoop/mapreduce query engines were slow.
Spanner (): Spanner is "a scalable, multi-version, globally distributed, and synchronously replicated database". The linchpin that allows all this functionality is the TrueTime API which lets spanner order events between nodes without Having them communicate. There is some speculation, the TrueTime API is very similar to a vector clock and each node have to store less data. Sadly, a paper on TrueTime are promised, but hasn ' t yet been released.
Consensus and consistency
Paxos Made Simple (2001): Paxos is a fault-tolerant distributed consensus protocol. It forms the basis of a wide variety of distributed systems. The idea is simple and notoriously difficult to understand (perhaps due to the "the" original Paxos paper was written).
-
The Raft Consensus algorithm: Raft is a Consensus algorithm designed as an alternative to Paxos. It was meant to being more understandable than Paxos by means of separation of logic, but it's also formally proven safe and offers some new features. [1] Raft offers a generic-distribute a state machine across a cluster of computing systems, ensuring this each node In the cluster agrees upon, the same series of state transitions.
Cap Twelve years later:how the "Rules" has Changed (+): The cap theorem, proposed by Eric Brewer, asserts. Any NE T?-worked Shared-data System can has only a three desirable properties:consistency, availability, and Partition-tol Erance. A number of NoSQL stores reference CAP to justify their decision to sacrifice consistency. This is Eric Brewer's writeup on CAP in retrospective, explaining "' 2 of 3 ' formulation were always misleading because it T Ended to oversimplify the tensions among properties. "
Trends (Cloud Computing, Warehouse-scale Computing, New Hardware)
-
A View of Cloud Computing: This is the paper on cloud Computing. This paper discusses the economics and obstacles of cloud computing (referring to the elasticity of resources, not the con Sumer-facing "Cloud") from a technical perspective. The obstacles presented in this paper would impact design decisions for systems running in the cloud.
-
The Datacenter as a computer:an Introduction to the Design of Warehouse-scale Machines:google ' s Luiz android? Barroso and Urs H?? Lzle explains the basics of data center hardware and software for Warehouse-scale computing. There is an accompanying video. The video talks about the importance of cutting long-tail latency in massively parallel systems. The other key idea is the disaggregation of resources. Technologies such as Gfs/hdfs already disaggregate disks because of high network bandwidth, but yet to see the same trend Applying to drams because that ' d require low-latency networking.
Miscellaneous
Reflections on Trusting Trust (1984): Ken Thompson's Turing Award acceptance speech in 1984, describing black box backdoor Issues and pointing out Trust are not absolute.
What Goes Around Comes around:michael stonebraker and Joseph M. Hellerstein provide a summary of years of data Model P Roposals, grouped into 9 different eras. The paper discusses the proposals of each era, and show that there is only a few basic data modeling ideas, and most of them have Been around a long time. Later proposals inevitably bear a strong resemblance to certain earlier proposals.
External Reading Lists
A number of schools has their own reading lists for graduate students in databases.
- Berkeley PhD prelim Exam reading list and CS286 Grad database class reading list
- Brown CSCI 2270 Advanced Topics in Database Management
- Stanford PhD Qualifying Exam
- Mit:database Systems 6.830 Year and year 2010
- Wisconsin Database Qualifying Exam Reading List (2014)
- CMU 15-721 Database Systems Reading List (Spring 2016)
Make a copy of your own database-English