Big Data technology have been extremely disruptive with open source playing a dominant role in shaping its evolution. While on one hand it had been disruptive, the other it had led to a complex ecosystem where new frameworks, libraries a ND tools is being released pretty much every day, creating confusion as technologists struggle and grapple with the Delug E.
If you were a Big Data enthusiast or a technologist ramping up (or scratching your head), it was important to spend some ser IOUs time deeply understanding the architecture of key systems to appreciate its evolution. Understanding the architectural components and subtleties would also help you choose and apply the appropriate technology For your the use case. In my journey through the last few years, some literature have helped me become a better educated data professional. My goal here's to isn't only share the literature but consequently also use the opportunity to put some sanity into the lab Yrinth of open source systems.
One caution, most of the reference literature included are hugely skewed towards deep architecture overview (in most cases Original papers) than simply provide you with basic overview. I firmly believe that deep dive would fundamentally help you understand the nuances, though would don't provide you with any Shortcuts, if you want to get a quick basic overview.
Key Architecture Layers
- file Systems-Distributed file systems which provide storage, fault tolerance, scalability, reliability, and Avai Lability.
- Data Stores–evolution of application databases into Polyglot storage with application specific databases instead of one size fits all. Common ones is Key-value, Document, Column and Graph.
- Resource Managers–provide Resource management capabilities and support schedulers for high utilization and Throu Ghput.
- Coordination–systems that manage state, distributed coordination, consensus and lock management.
- Computational Frameworks–a lot of work are happening at the this layer with highly specialized compute frameworks for Streaming, Interactive, Real time, Batch and iterative Graph (BSP) processing. Powering these is complete computation runtimes like Bdas (Spark) & Flink.
- Data Analytics –analytical (consumption) tools and libraries, which support exploratory, descriptive, predictive, STA Tistical analysis and machine learning.
- Data integration–these include not only the orchestration tools for managing pipelines but also metadata Managem Ent.
- Operational Frameworks –these provide scalable frameworks for monitoring & benchmarking.
The modern data architecture is evolving with a goal of reduced latency between data producers and consumers. This consequently was leading to real time and low latency processing, bridging the traditional batch and interactive layer s into hybrid architectures like LAMBDA and Kappa.
- lambda-established architecture for a typical data pipeline. More details.
- Kappa–an alternative architecture which moves the processing upstream to the Stream layer.
- Summingbird–a Reference Model on bridging the online and traditional processing models.
Before dive into the actual layers, here is some general documents which can provide you a great background on N OSQL, Data Warehouse scale Computing and distributed Systems.
- Data Center as a computer–provides a great background on warehouse scale computing.
- NOSQL Data Stores–background on a diverse set of Key-value, document and column oriented Stores.
- NoSQL thesis–great background on distributed systems, first generation NoSQL systems.
- Large scale Data Management-covers The data model, the system architecture and the consistency model, ranging from tradit ional database vendors to new emerging internet-based enterprises.
- Eventual consistency–background on the different consistency models for distributed systems.
- Cap Theorem–a Nice background on the cap and its evolution.
There also have been in the past a fierce debate between traditional Parallel DBMS with Map Reduce paradigm of processing. Pro Parallel DBMS (another) paper (s) is rebutted by the pro MapReduce one. Ironically the Hadoop community from then have come full circle with the introduction of MPI style GKFX nothing based PR Ocessing on Hadoop-sql on Hadoop.
As the focus shifts to low latency processing, there are a shift from traditional disk based storage file systems to an EM Ergence of in memory file Systems-which drastically reduces the I/O & Disk serialization cost. Tachyon and Spark RDD is examples of that evolution.
- Google file system-the seminal work on distributed file Systems which shaped the Hadoop file System.
- Hadoop File system–historical context/architecture on evolution of HDFS.
- Ceph File System–an Alternative to HDFS.
- Tachyon–an in memory storage system to handle the modern day low latency data processing.
File Systems has also seen an evolution on the file formats and compression techniques. The following references gives you a great background on the merits of row and column formats and the shift towards newer Nested column oriented formats which is highly efficient for Big Data processing. Erasure codes is using some innovative techniques to reduce the triplication (3 replicas) schemes without compromising Da Ta recoverability and availability.
- Column oriented vs Row-stores–good overview of data layout, compression and materialization.
- Rcfile–hybrid PAX structure which takes the best of both the column and row oriented stores.
- Parquet–column oriented format first covered in Google's Dremel ' s paper.
- Orcfile–an improved column oriented format used by Hive.
- Compression–compression techniques and their comparison on the Hadoop ecosystem.
- Erasure Codes–background on Erasure Codes and techniques; Improvement on the default triplication in Hadoop to reduce storage cost.
Broadly, the distributed data stores is classified On acid & BASE stores depending on the continuum of strong to weak consistency respectively. BASE further is classified to KeyValue, Document, Column and graph-depending on the underlying schema & supported Data structure. While there is multitude of systems and offerings in this space, I has covered few of the more prominent ones. I apologize if I have missed a significant one ...
Key Value Stores Dynamo–key-value Distributed Storage System cassandra–inspired by Dynamo; A multi-dimensional Key-value/column oriented data store. Voldemort–another one inspired by Dynamo, developed at LinkedIn.
Column oriented Stores Bigtable–seminal paper from Google on distributed column oriented data stores. Hbase–while There is no definitive paper, this provides a good overview of the technology. Hypertable–provides a good overview of the architecture.
Document oriented Stores Couchdb–a popular document oriented data store. Mongodb–a Good introduction to MongoDB architecture.
Graph Neo4j–most Popular Graph database. Titan–open source Graph Database under the Apache license.
ACID I see a lot of evolution happening in the open source community which would try and catch up with what Google have Done–3 Out of the prominent papers below is from Google, they has solved the globally distributed consistent data store proble M.
Megastore–a highly available distributed consistent database. Uses Bigtable as its storage subsystem. spanner–globally distributed synchronously replicated linearizable database which supports SQL access. Mesa–provides consistency, high availability, reliability, fault tolerance and scalability for large data and query Volu Mes. Cockroachdb–an Open Source version of Spanner (led by former engineers) in active development.
While the first generation of Hadoop ecosystem started and monolithic schedulers like YARN, the evolution are towards Hierarchical schedulers (Mesos), which can manage distinct workloads, across different kind of compute workloads, to Achiev e higher utilization and efficiency.
Yarn–the Next Generation the Hadoop compute framework. Mesos–scheduling between multiple diverse cluster computing frameworks.
These was loosely coupled with schedulers whose primary function was schedule jobs based on scheduling Policies/configurati On. schedulers Capacity scheduler-introduction to different features of capacity Scheduler. FairShare Scheduler-introduction to different features of fair Scheduler. Delayed scheduling-introduction to Delayed scheduling for FairShare scheduler. Fair & Capacity Schedulers–a Survey of Hadoop schedulers.
These is systems that is used for coordination and state management across distributed data systems. Paxos–a simple version of the classical paper; Used for distributed systems consensus and coordination. Chubby–google ' s distributed locking service that implements Paxos. Zookeeper–open source version inspired from Chubby though are general coordination service than simply a locking service
The execution runtimes provide an environment for running distinct kinds of compute. The most common runtimes is
Spark–its popularity and adoption is challenging the traditional Hadoop ecosystem. Flink–very similar to Spark ecosystem; Strength over Spark are in iterative processing.
The frameworks broadly can be classified based on the model and latency of processing
Batch Mapreduce–the seminal paper from Google on MapReduce. MapReduce Survey–a dated, yet A good paper; Survey of Map Reduce frameworks.
iterative (BSP) Pregel–google ' s paper on large scale graph processing giraph-large-scale distributed graph processing system modelled Around Pregel graphx-graph computation framework that unifies graph-parallel and data parallel computation. Hama-general BSP computing engine on top of Hadoop open source graph processing survey of open source systems modelled Around Pregel BSP.
Streaming stream processing –a Great overview of the distinct real time processing systems Storm –real time Big Data processing system samza -Stream processing framework from LinkedIn Spar K streaming –introduced the Micro batch architecture bridging the traditional batch and interactive processing.
Interactive Dremel–google ' s paper on what it processes interactive big data workloads, which laid the groundwork for multiple open so Urce SQL Systems on Hadoop. Impala–mpi style processing on make Hadoop performant for interactive workloads. Drill–a Open Source implementation of Dremel. Shark–provides a good introduction to the data analysis capabilities on the Spark ecosystem. Shark–another great paper which goes deeper into SQL access. Dryad–configuring & executing parallel data pipelines using DAG. Tez–open source implementation of Dryad using YARN. Blinkdb-enabling Interactive queries over data samples and presenting results annotated with meaningful error bars
RealTime Druid–a Real time OLAP data store. Operationalized Time series Analytics databases Pinot–linkedin OLAP data store very similar to Druid.
The analysis tools range from declarative languages like SQL to procedural languages like Pig. Libraries on the other hand is supporting out of the box implementations of the most common data mining and machine learn ing libraries.
Tools Pig–provides a good overview of Pig Latin. Pig–provide An introduction of what to build data pipelines using Pig. Hive–provides An introduction of Hive. Hive–another good paper to understand, the motivations behind Hive at Facebook. Phoenix–sql on Hbase. Join algorithms for Maps Reduce–provides a great introduction to different joins algorithms on Hadoop. Join algorithms for MAP reduce–another great paper on the different join techniques.
Libraires Mllib–machine language Framework on Spark. Sparkr–distributed R on the Spark framework. Mahout–machine Learning Framework on traditional MAP Reduce.
Data Integration Data integration frameworks provide good mechanisms to ingest and outgest data between Big data systems. It ranges from the orchestration pipelines to the metadata framework with the support for lifecycle management and governance.
ingest/messaging Flume–a framework for collecting, aggregating and moving large amounts of the log data from many different sources to a cent Ralized data store. Sqoop–a tool to move data between Hadoop and relational data stores. Kafka–distributed messaging system for data processing
Etl/workflow Crunch–library for writing, testing, and running MapReduce pipelines. Falcon–data management framework that helps automate movement and processing of Big data. Cascading–data manipulation through scripting. Oozie–a Workflow Scheduler System to manage Hadoop jobs.
Metadata HCATALOG-A table and Storage management layer for Hadoop.
Serialization Protocolbuffers–language neutral serialization format popularized by Google. Avro–modeled around Protocol buffers for the Hadoop ecosystem.
Finally the operational frameworks provide capabilities for metrics, benchmarking and performance optimization to manage W Orkloads.
Monitoring Frameworks Opentsdb–a time Series metrics systems built on top of HBase. Ambari-system for collecting, aggregating and serving Hadoop and system metrics
Benchmarking Ycsb–performance evaluation of NoSQL systems. Gridmix–provides benchmark for Hadoop workloads by running a mix of synthetic jobs Background on Big Data Benchmarking W ITH the key challenges associated.
I hope that the papers is useful as you embark or strengthen your journey. I am sure there is few hundred more papers that I might has inadvertently missed and a whole bunch of systems that I mi Ght is unfamiliar with-apologies in advance as don ' t mean to offend anyone though happy to be educated ....
Open source Big Data architecture papers for DATA professionals