The Hadoop system runs on a compute cluster of commodity business servers that provide large-scale parallel computing resources while providing large-scale distributed data storage resources.
On the big data processing software system, with the open-source development of the Apache Hadoop system, based on the original basic subsystem including HDFS, MapReduce and HBase, the Hadoop platform has evolved into a complete large-scale Data Processing Ecosystem. Figure 1-15 shows the basic components and ecosystem of the Hadoop platform.
1.MapReduce parallel computing framework
MapReduce parallel computing framework is a parallel program execution system. It provides a parallel processing model and process that includes both Map and Reduce stages and provides a parallel programming model and interface that allows programmers to write big data parallel handlers quickly and easily. MapReduce to key data input to deal with the data, and can automatically complete the data division and scheduling management. At the time of program execution, MapReduce parallel computing framework will be responsible for scheduling and allocating computing resources, dividing and inputting and outputting data, scheduling the execution of programs, monitoring the execution status of programs, and responsible for the synchronization of each computing node and the collection of intermediate results during program execution . The MapReduce framework provides a complete set of programming interfaces for programmers to develop MapReduce applications.
2. Distributed File System HDFS
HDFS (Hadoop Distributed File System) is an open source distributed file system similar to Google GFS. It provides a scalable, reliable and highly available large-scale data distributed storage management system. Based on the file system of the local Linux system that is physically distributed on each data storage node, it provides an upper layer application with a logically integrated system Large-scale data storage file system. Similar to GFS, HDFS adopts multi-copy (default is 3 copies) data redundancy storage mechanism and provides effective data error detection and data recovery mechanism, which greatly improves the reliability of data storage.
3. Distributed database management system HBase
To overcome the disadvantage of HDFS that it is difficult to manage structured / semi-structured mass data, Hadoop provides HBase, a large-scale distributed database management and query system. HBase is a distributed database based on HDFS. It is a distributed and extensible NoSQL database that provides real-time read and write and random access to structured, semi-structured and even unstructured big data. HBase provides a three-dimensional data management model based on rows, columns and timestamps. Each table in HBase can have as many as several billion records per row (up to several million rows) and millions of records per record The field.
Public service module Common
Common is a set of class libraries and API programming interfaces that provide underlying support services and common tools for the entire Hadoop system. These underlying services include the Hadoop abstract file system FileSystem, Remote Procedure Call RPC, the System Configuration Tool Configuration, and the serialization mechanism. In versions 0.20 and earlier, Common contained HDFS, MapReduce, and other public project content; since version 0.21, HDFS and MapReduce were separated into separate subprojects and the rest consisted of Hadoop Common.
5. Data Serialization System Avro
Avro is a data serialization system used to transform data structures or data objects into a format that facilitates data storage and network transmission. Avro offers a wealth of data structure types, fast and compressible binary data formats, filesets for storing persistent data, remote invocation RPCs, and simple dynamic language integration.
6. Distributed Coordination Services Framework Zookeeper
Zookeeper is a distributed coordination service framework that is primarily designed to address conformance issues in distributed environments. Zookeeper is mainly used to provide functions such as system reliability maintenance, data status synchronization, unified naming service, and distributed application configuration item management frequently needed in distributed applications. Zookeeper can be used to maintain some important data in system operation and management in distributed environment and to provide mechanisms to monitor the data status changes in conjunction with other Hadoop subsystems (such as HBase, Hama, etc.) or user development Of the application system to solve the problem of system reliability management and data status maintenance in distributed environment.
7 distributed data warehouse processing tools Hive
Hive is a data warehouse built on Hadoop for managing structured / semi-structured data stored in HDFS or HBase. It was originally developed by Facebook and used to process and analyze a large number of user and log data, Facebook in 2008 to contribute to Apache Hadoop open source project. In order to facilitate traditional database users who are familiar with SQL to use Hadoop system for data query analysis, Hive allows to directly use SQL-like HiveQL query language as a programming interface to write data query and analysis programs and provide the data extraction and transformation, storage management and Query analysis, and HiveQL statements in the underlying implementation is converted to the corresponding MapReduce program to be executed.
Data Flow Processing Tool
Pig is a platform for handling large data sets, contributed by Yahoo! to Apache as an open source project. It simplifies the data analysis process using Hadoop and offers a domain-oriented, high-level abstraction called Pig Latin, which allows programmers to implement complex data analysis tasks as data-flow scripts on Pig operations that ultimately execute Will be automatically converted to a MapReduce task chain by the system and executed on Hadoop. Yahoo! has a large number of MapReduce jobs through Pig.
9. Key-value database system Cassandra
Cassandra is a distributed KV-based database system originally developed by Facebook for storing relatively simple formatted data such as mailboxes. Facebook later contributed Cassandra as a Hadoop open source project. Based on Amazon's proprietary fully distributed Dynamo, Cassandra combines Google BigTable's Column Family-based data model with a highly scalable, ultimately uniform, distributed, structured key-value storage system. It combines Dynamo's distribution technology and Google's Bigtable data model to better meet the needs of mass data storage. In the meantime, Cassandra changes vertical scale to horizontal scale, and Cassandra offers richer features than other typical key data storage models.
10. Log Data Processing System Chukwa
Chukwa is an open source em and has good adaptability and scalability. It uses HDFS for data storage, MapReduce for data processing, and flexible yet powerful aids to analyze, display, and monitor data results.
11. Scientific Computing Fundamentals Tool Library Hama
Hama is a computational framework based on the BSP (Bulk Synchronous Parallel) model, which mainly provides a set of support frameworks and tools to support large-scale scientific computing or graph computation with complex data correlation. Hama is similar to Pregel developed by Google, and Google uses Pregel to make calculations such as BFS, PageRank, and so on. Hama works seamlessly with Hadoop's HDSF and uses HDFS to persist the tasks and data that needs to be run. Due to the flexibility of BSP in parallel computing models, Hama framework can be widely used in large-scale scientific computing and graph computing. It performs different big data computing and processing tasks such as matrix calculation, ranking calculation, PageRank and BFS.
12. Data Analysis Mining Tool Library Mahout
Mahout is derived from the Apache Lucene subproject and its primary goal is to create and provide classic machine learning and data mining parallelization algorithm libraries to alleviate the programming burden on programmers who need to use these algorithms for data mining without having to go by themselves Implement these algorithms. Mahout now includes widely used machine learning and data mining algorithms such as Clustering, Classification, Recommendation Engine, Frequent Itemsets Mining. In addition, it provides tools and frameworks that include data input and output tools and data integration with other data storage management systems.
13. Relational data exchange tool Sqoop
Sqoop is an acronym for SQL-to-Hadoop, a tool for fast, batch data exchange between a relational database and the Hadoop platform. It can batch import data from a relational database into Hadoop's HDFS, HBase, Hive, and in turn import data from the Hadoop platform into a relational database. Sqoop takes full advantage of the parallelism of Hadoop MapReduce, the entire data exchange process based on MapReduce parallelized rapid processing.
Flume is a distributed, highly reliable, highly available system developed and maintained by Cloudera for large-scale log in complex environments. It abstracts data from the process of generating, transmitting, processing, and exporting into a data stream and allows data senders to be defined in the data source to support the based on a variety of different transport protocols and to provide simple data on log data Filtering, format conversion and other processing capabilities. At output, Flume supports writing log data to custom output destinations.