Big Data architecture in post-Hadoop era (RPM)

Source: Internet
Author: User
Tags cassandra hortonworks hadoop ecosystem sqoop

Original: Fei

referring to the Big data analytics platform, we have to say that Hadoop systems, Hadoop is now more than 10 years old, many things have changed, the version has evolved from 0.x to the current 2.6 version. I defined 2012 years later as the post-Hadoop platform era, not without Hadoop, but with other selection additions like NoSQL (not just SQL). I've also written about some of Hadoop's introductory articles on how to learn Hadoop-Fei's answer, in order to give you a cushion, simply talk about some of the relevant open source components.
Background Article
  • Hadoop: an open source data analytics platform that addresses the reliable storage and processing of big data (large to one computer cannot be stored and a computer cannot process within the required time). Suitable for the processing of unstructured data, including hdfs,mapreduce basic components.
  • HDFS: provides an elastic data storage system across servers.
  • MapReduce: technology provides a standardized process for sensing data locations: reading data, mapping data (map), re-scheduling data using a key value, and then simplifying (Reduce) the data to get the final output.
  • Amazon Elastic Map Reduce (EMR): Managed solution that runs on Amazon Elastic Compute Cloud (EC2) and simple strorage Service (S3) is made up of network-scale infrastructure. If you need a one-time or uncommon Big data processing, EMR may save you money. But EMR is highly optimized to work with data in S3, and there is a higher latency.
  • Hadoop also includes a range of technology extension systems, including Sqoop, Flume, Hive, Pig, Mahout, Datafu, and Hue.
      • Pig: A platform for analyzing large datasets, which consists of a high-level language for expressing data analysis programs and an infrastructure for evaluating these programs.
      • Hive: A data Warehouse system for Hadoop that provides a SQL-like query language that makes it easy to summarize data, specific queries, and analysis.
      • Hbase: A distributed, scalable, big data repository that supports random, real-time read/write access.
      • Sqoop: A tool designed to efficiently transfer bulk data for data transfer between Apache Hadoop and structured data repositories such as relational databases.
      • Flume: A distributed, reliable, and usable service for efficiently collecting, summarizing, and moving large volumes of log data.
      • ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing packet services.
  • Cloudera: The most-formed version of Hadoop, with the most deployment cases. Provides powerful deployment, management, and monitoring tools. Developed and contributed to Impala projects that deal with big data in real time.
  • hortonworks: 100% open-source Apache Hadoop provider was used. Many enhancements have been developed and submitted to the core backbone, enabling Hadoop to run locally on platforms including Windows Server and Azure.
  • MapR: Get better performance and ease of use while supporting the local UNIX file system instead of HDFs. Provides high-availability features such as snapshots, mirroring, or stateful recovery. Leading the Apache Drill project is the open source implementation of Google's Dremel, which is designed to execute SQL-like queries to provide real-time processing.
principle Article

Data storage

Our goal is to be a reliable system that supports large scale expansion and easy maintenance. There is a locality (local law) inside the computer. Access from bottom to top is getting faster, but storage costs are even greater.

Relative memory, disk and SSD need to consider the placement of data, because the performance will vary greatly. Disk benefits are persistent, cost-per-unit, and easy to back up. But as memory is cheap, many data sets can be considered to be placed directly into memory and distributed across machines, some based on Key-value, and memcached on caches. Memory persistence can be done through (with the battery's RAM), written in advance to snapshot, or replicated in other machine memory. The previous state needs to be loaded from disk or network when rebooting. In fact, write to disk is used in the Append log above, read the words directly from memory. Like Voltdb, Memsql,ramcloud relational and memory-based databases provide high performance to solve the problems of previous disk management.

Hyperloglog & Bloom Filter & Countmin Sketch

Are all algorithms applied to big data, and the general idea is to use a set of independent hash functions to process the input sequentially. Hyperloglog is used to calculate the cardinality of a very large set (that is, how many different elements are reasonable in total), and the hash value is divided by the number of consecutive 0 of the high-level statistic, and the lower value as the data block. Bloomfilter, the values of all the hash functions are calculated and tagged in the preprocessing phase for the input. When looking for a particular input whether or not it has occurred, just look for this series of hash functions with no markup on the corresponding value. For Bloomfilter, there may be false Positive, but there can be no false negative. Bloomfilter can be seen as finding data structures that have or are not available (the frequency of data is greater than 1). Countmin sketch on the basis of bloomfilter, it can be used to estimate the frequency of an input (not limited to more than 1).

CAP theorem

In short, there are three features: consistency, availability, and network partitioning, which can take up to two. Design different types of systems to weigh more. There are many algorithms and advanced theories in the distributed system, such as: Paxos algorithm (Paxos distributed consistency algorithm-telling Zhuge Liang's anti-traversal), gossip protocol (Cassandra Learning Note gossip protocol), Quorum (Distributed System), time logic, Vector Clocks (four of the consistency algorithms: timestamps and vectors), Byzantine general issues, two-phase submissions, etc., require patient study.

Technical Articles


Depending on the latency requirements (SLA), the size of the data storage, the amount of updates, the analysis requirements, the large data processing architecture also needs to be designed flexibly. It describes the big data components in different areas.

Say big data technology still need to mention google,google new three carriage,spanner, F1, Dremel

Spanner: A highly scalable, multi-version, globally distributed internal Google database with synchronous replication features to support distributed transactions with external consistency, designed to span hundreds of thousands of servers across the world, including trillions of rows of records! (Google is so domineering ^ ^)

F1: built on spanner, leveraging the rich features of spanner, and providing a two-level index of distributed SQL and transactional consistency, has succeeded in replacing the old manual mysql in the AdWords advertising business Shard solution.

Dremel: A method for analyzing information that can be run on thousands of servers, similar to the SQL language, that can handle a huge amount of data (petabytes) of network size at a very fast speed, in just a matter of seconds.


2014 the hottest big data technology spark, what's the book recommendation on Spark? -Fei's answer was introduced. The main intent is to do faster data analysis based on memory calculations. It also supports graph calculation, flow calculation and batch processing. Berkeley AMP Lab's core members come out to set up the company Databricks to develop cloud products.


Uses a method similar to SQL database query optimization, which is the main difference from the current version of Apache Spark. It can apply a global optimization scheme to a query for better performance.


Announcing the Confluent Platform 1.0 Kafka is described as the "central nervous system" of LinkedIn, which manages the flow of information gathered from various applications, which are processed and distributed throughout. Unlike the traditional enterprise information queuing system, KAFKA is processing all data flowing through a company in near real-time, and now has established a real-time information processing platform for LinkedIn, Netflix, Uber and Verizon. The advantage of Kafka is near real-time.


Handle Five billion Sessions a day in real time,twitter real-time computing framework. The so-called flow processing framework is a distributed, high-fault-tolerant real-time computing system. Storm makes continuous flow calculations easy. It is often used in real-time analytics, online machine learning, continuous computing, distributed remote calls, and ETL.


A streaming computing framework for LinkedIn's main push. Several comparisons were made with other similar spark,storm. Integrates well with Kafka as the primary storage node and intermediary.

LAMBDA Architecture

Nathan wrote the article "How to defeat the CAP theory" how to beat the cap theorem, proposed lambda Architecture, the main idea is to some of the high latency but the amount of data is large or a batch architecture, But for instant real-time data using streaming framework, and then build a service layer to merge the data flow on both sides, this system can balance the real-time efficient and batch scale, see the brain hole open, really effective, by many companies in the production system.


Lambda architecture issues to maintain two sets of systems, Twitter developed the Summingbird to do one-time programming, run multiple places. Seamlessly connect batch and stream processing to reduce conversion overhead between them by consolidating batch and stream processing. It explains the system runtime.


Data is traditionally stored in a tree-like structure (hierarchical structure), but it is difficult to express many-to-many relationships, relational database is to solve this problem, in recent years found that the relational database is also not the spirit of new NoSQL appeared as Cassandra,mongodb,couchbase. NoSQL is also divided into these categories, document type, graph operation type, column storage, Key-value type, different systems to solve different problems. There is not a one-size-fits-all plan.


In Big data architectures, Cassandra's primary role is to store structured data. DataStax's Cassandra is a column-oriented database that provides high availability and durability services through a distributed architecture. It implements hyper-scale clustering and provides a consistency type called "final consistency", which means that at any point in time, the same database entries in different servers can have different values.

SQL on Hadoop

There are many Sql-on-hadoop projects in the open source community that focus on competing with some commercial data warehousing systems. Includes Apache Hive, Spark SQL, Cloudera Impala, Hortonworks Stinger, Facebook Presto, Apache Tajo,apache Drill. Some are based on the Google Dremel design.


Cloudera's new query system, which provides SQL semantics to query petabytes of big data stored in Hadoop's HDFs and HBase, claims to be 5-10 times faster than Hive, but has recently been overshadowed by Spark's thunder, and is more inclined to the latter.


The Apache community is similar to Dremel's open source version-drill. A distributed system designed for interactive analysis of large data sets.


Open source data storage designed for real-time statistical analysis on top of large datasets. This system aggregates a column-oriented storage layer, a distributed, shared-nothing architecture, and an advanced indexing structure to achieve arbitrary exploratory analysis of 1 billion-row-level tables within seconds.

Berkeley Data Analytics Stack

It says Spark, there's a bigger blueprint in Berkeley AMP Lab, Bdas, with a lot of star projects, in addition to spark, including:

Mesos: A resource management platform for distributed environments that enables Hadoop, MPI, and spark operations to execute in a unified resource management environment. It is good for Hadoop2.0 support. Twitter,coursera are in use.

Tachyon: is a highly fault-tolerant Distributed file system that allows files to be reliably shared in the cluster framework at the speed of memory, just like Spark and MapReduce. Project Sponsor Li Haoyuan said that the current development is very fast, even more than spark at the time is still amazing, has set up a start-up company Tachyon Nexus.

BLINKDB: It's also interesting to run a large-scale parallel query engine that runs interactive SQL queries on massive amounts of data. It allows users to increase query response time by weighing the accuracy of the data, which is controlled within the allowable error range.


The classic solution offered by Hadoop Big Brother.

HDP (Hadoop Data Platform)

Hortonworks proposed the structure selection.


Amazon Redshift is a version of Paraccel. It is a (massively parallel computer) architecture, is a very convenient data warehousing solution, SQL interface, with each cloud service seamless connection, the biggest feature is fast, TB to PB level of very good performance, I am also directly used in the work, It also supports different hardware platforms, and if you want to be faster, you can use SSDs.


A fully AWS-based data processing solution.


Reference Links

The Hadoop Ecosystem Table

How to beat the CAP theorem

Lambda Architecture

Questioning the Lambda Architecture

Source: from

Big Data architecture in post-Hadoop era (RPM)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.