It's time to start with the times before big data comes up. Temporarily according to a city analogy, anyway landscape meaning is probably the meaning of "scenery".
There have been a variety of studies, discussions and practices on mathematics, statistics, algorithms, and programming languages since the advent of the big data concept. In this era, algorithms and a variety of mathematical knowledge as building materials (such as rebar, bricks), programming language as a binder (such as cement) constitute a small house (such as an application), forming a small piece of a small village (such as a server). There is no highway between village and village (GFS, HDFS, Flume, Kafka, etc.), there is only a muddy road (such as RPC), the economic model is also a small workshop-style economy. The internet was not developed at first, and the speed was unpleasant, and the old-fashioned way was completely manageable, but with the rise of social networks and smartphones, it changed everything. Website traffic hundreds of times to improve, data become more diversified, computer hardware performance can not be stabilized according to Moore's law, small villages, small workshop production model is doomed to be limited. People need more powerful models ... The Big Data application domain summary is divided into the offline computation and the real-time computation. As the amount of data increases, the OLTP model is already difficult to handle, and OLAP becomes the mainstream. Whether real-time or off-line computing, the basic idea is the same, namely: Divide and conquer.
Beginning, people thought that as long as there is a strong central database, that is, to build a large throughput between all villages, and an inclusive (non-relational, NoSQL) warehouse, used to transfer the production of a large number of heterogeneous goods in each village can stimulate economic growth. But it didn't take long for people to realize that it was an idea that too young and simple because the size of the warehouse was always capped.
Then the concept of MapReduce was first proposed by Google, to solve the problem of large-scale cluster cooperative operation, since a computer has limited performance, why not unite them together? Its ambitious, hope for each village to establish a "village pass" Road, that is, GFS, is the Google Distributed File system meaning, will be different server hard disk connection, outside look like a huge hard disk. And then building the MapReduce on it is a factory that dispatches the labor and supplies of each village to make these villages work as an economy. The inhabitants became wealthy.
However, the rich only "Google Town", the rest of the world's villages and towns still live a primitive life. This time Yahoo and Apache a bunch of people in line with the spirit of the Lele, imitate the idea of Google, created HDFs (Hadoop Distributed File system, corresponding to GFS), Hadoop (corresponding to Google MapReduce), and public all the blueprint, For free worldwide use. So the whole world was built around the factory and people became wealthy. In this era, Hadoop is called the Big data infrastructure.
As the saying goes: Clad lust, the factory's leadership is not satisfied with the extensive production of the Village factory, also no longer want to hire so many labor, so mahout, HBase, Hive, Pig came into being, they are CNC machine tools, machining centers, only a few operators can make the whole factory operation up, Since then, people have been contented and clothed.
Of course, a few more ambitious capitalists, not satisfied with the current productivity, in pursuit of higher profits (this is the essence of capitalism), the development of more efficient system spark, can be 10 times times the speed of Hadoop production products, the new era has just begun ...
HBase: A highly reliable, high-performance, column-oriented, scalable distributed storage system that leverages HBase technology to build large, structured data clusters on inexpensive PC servers. Like Facebook, take it for big real-time apps Facebook's New Realtime Analytics system:hbase to Process billion Events Per Day
Pig:yahoo developed, parallel execution of the data flow processing engine, which contains a scripting language, called Pig Latin, used to describe these data streams. The Pig Latin itself provides many traditional data operations, while allowing users to develop custom functions to read, process, and write data themselves. It is also heavily used in LinkedIn.
A data Warehouse tool, led by Hive:facebook, can map a structured data file into a database table and provide a complete SQL query function that converts SQL statements into MapReduce tasks. The advantage is that learning costs are low, and simple mapreduce statistics can be quickly implemented with class-SQL statements. Like some data scientist can query directly, do not need to learn other programming interfaces.
Cascading/scalding:cascading is a company technology acquired by Twitter, mainly providing some abstract interfaces to the data pipeline, and then launching the cascading-based Scala version called scalding. Coursera is a programming interface with scalding as MapReduce that is placed on Amazon's EMR run.
Zookeeper: A distributed, open-source distributed Application Coordination Service is an open source implementation of Google's chubby.
Oozie: An open source framework based on the workflow engine. Contributed by Cloudera to Apache, it can provide task scheduling and coordination for Hadoop mapreduce and pig jobs.
Azkaban: Like the above, LinkedIn's open source Workflow system for Hadoop provides management tasks similar to cron.
The Tez:hortonworks's optimized mapreduce execution engine, compared to MapReduce, is a better performance for the Tez.
Hadoop vs Spark Hadoop:
Distributed batch processing, emphasizing batch processing, often used for data mining, analysis Spark: An open-source cluster computing system based on memory computing, designed to make data analysis faster, Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make spark more advantageous in some workloads, in other words, Spark enables a memory-distributed dataset that optimizes iterative workloads in addition to providing interactive queries. Spark is implemented in the Scala language and uses Scala as its application framework.
Unlike Hadoop, Spark and Scala are tightly integrated, and Scala can manipulate distributed datasets as easily as local collection objects. Although the Spark was created to support an iterative job on a distributed dataset, it is actually a supplement to Hadoop that can be run in parallel in the Hadoop file system. This behavior can be supported by a third-party cluster framework named Mesos.
Developed by the University of California, Berkeley AMP Lab (Algorithms,machines,and People Lab), Spark can be used to build large, low-latency data analytics applications.
While there are similarities between Spark and Hadoop, it provides a new cluster computing framework with useful differences. First, Spark is designed for a specific type of workload in cluster computing, that is, workloads that reuse working data sets (such as machine learning algorithms) between parallel operations. To optimize these types of workloads, Spark introduces the concept of memory cluster computing, which caches data sets in memory in memory cluster calculations to shorten access latencies.
In terms of big data processing, it is believed that Hadoop is familiar, and Hadoop, based on Googlemap/reduce, provides the developer with a map, reduce primitive, which makes the parallel batch process very simple and graceful.
There are many types of data set operations offered by Spark, unlike Hadoop, which provides only map and reduce two operations. such as Map,filter, Flatmap,sample, Groupbykey, Reducebykey, Union,join, Cogroup,mapvalues, Sort,partionby and many other types of operation, They refer to these operations as transformations. It also provides count,collect, reduce, lookup, save, and many other actions.
These various types of data set operations provide convenience to upper-level applications. The communication model between processing nodes is no longer the only data shuffle a pattern like Hadoop. Users can name, materialize, and control the partitioning of intermediate results.
Quoted in the knowledge
Talking about big data and the Hadoop family