Apache Hadoop Introductory Tutorial Chapter I.

Last Update:2018-05-10 Source: Internet

Author: User

Tags benchmark zookeeper hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apache Hadoop is a distributed system infrastructure developed by the Apache Foundation. Enables users to develop reliable, scalable, distributed computing applications without knowing the underlying details of the distributed environment.

The Apache Hadoop Framework allows users to use a simple programming model to implement the distributed processing of large datasets in a computer cluster. It is designed to support scaling from a single server to thousands of machines, leveraging the local compute and storage provided by each machine, rather than relying on hardware to provide high availability. itself is designed to detect and handle failures in the application layer, and for computer clusters, the top layer of each machine is designed to be fault-tolerant in order to provide a highly available service.

The core design of the Apache Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.

About Apache Hadoop

As mentioned in the previous section of MapReduce, Apache Hadoop was inspired by Google's GFS and MapReduce, which produced the Apache Hadoop Distributed File System NDFs (Nutch Distribute D File System), and the latter is also included in Apache Hadoop as one of the core components.

The prototype of Apache Hadoop began in 2002 with the Apache Nutch. Nutch is an open source Java-implemented search engine. It provides all the tools we need to run our own search engines, including full-text search and Web crawlers.

Then in 2003 Google published a technical academic paper about Google File system (GFS). GFS, also known as Google File System, is a proprietary filesystem designed by Google to store massive amounts of search data.

2004 Nutch founder Doug Cutting (also the founder of Apache Lucene) has implemented a distributed file storage system named NDFs based on Google's GFS paper.

In 2004, Google published a technical academic paper that introduced MapReduce to the world. 2005 Doug Cutting was also based on MapReduce, which was implemented in the Nutch search engine.

In 2006, Yahoo! Hired Doug Cutting,doug cutting to name NDFs and MapReduce upgrades as Hadoop. Yahoo! Opened an independent team to goug cutting specializing in the development of Hadoop.

In January 2008, Hadoop became the top project for Apache. Hadoop was then successfully applied to other companies, including Last.fm, Facebook, the New York Times, and so on.

In February 2008, Yahoo! announced that its search engine products were deployed on a Hadoop cluster with 10,000 cores.

In April 2008, Hadoop broke the world record, known as the fastest system for sorting 1TB data. For a record of this report, refer to Apache Hadoop Wins Terabyte Sort Benchmark (see https://developer.yahoo.com/blogs/hadoop/ apache-hadoop-wins-terabyte-sort-benchmark-408.html).

Until now, the latest version of Apache Hadoop is 2.7.3.

Apache Hadoop It mainly has the following advantages:

High reliability. Hadoop's ability to store and process data in bits is worth trusting.
High scalability. Hadoop distributes data between available computer sets and completes computational tasks, and these clusters can be easily extended to thousands of nodes.
Efficiency. Hadoop can dynamically move data between nodes and ensure the dynamic balance of individual nodes, so processing is very fast.
High level of fault tolerance. Hadoop has the ability to automatically save multiple copies of data and automatically reassign failed tasks.
Low cost. Hadoop is open source, and the cost of software for a project is thus greatly reduced.
Apache Hadoop Core Components

Apache Hadoop contains the following modules:

Hadoop Common: A common utility to support other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed filesystem that provides high-throughput access to application data.
Hadoop YARN: A job scheduling and cluster resource management framework.
Hadoop MapReduce: A parallel processing system for large data sets based on YARN.
Other projects related to Apache Hadoop include:

Ambari: A Web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, including support for Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides dashboards to view the health of clusters, such as heat maps, and to be able to view the MapReduce, Pig, and Hive applications in a user-friendly manner to facilitate the diagnosis of their performance.
Avro: Data serialization system.
Cassandra: Scalable, multi-master database with no single point of failure.
Chukwa: Data acquisition system for managing large distributed systems.
HBase: A scalable, distributed database that supports large table storage of structured data. (The contents of HBase are described in later chapters)
Hive: A data Warehouse infrastructure that provides summary data and specific queries.
Mahout: An extensible library of machine learning and data mining.
Pig: A high-level data flow parallel computing language and execution framework.
a fast and universal computing engine for spark:hadoop data. Spark provides a simple and powerful programming model to support a wide range of applications, including ETL, machine learning, stream processing, and graphics computing. (The contents of Spark are described in later chapters)
TEZ: A common data flow programming framework built on Hadoop YARN. It provides a powerful and flexible engine to perform any DAG task for the processing of batch and interactive data. TEZ is being adopted by other frameworks in Hive, Pig, and Hadoop ecosystems, and can also be used as the underlying execution engine with other commercial software, such as ETL tools, to replace Hadoop MapReduce.
ZooKeeper: A high-performance distributed application Coordination Service. (The contents of the ZooKeeper are described in later chapters)

Many people know that I have big data training materials, all naïve thought I have a full set of big data development, Hadoop, spark and other video learning materials. I want to say that you are right, I do have big data development, Hadoop, Spark's full set of video materials.
If you are interested in big data development, you can pick up the group to receive free learning information: 763835121

Apache Hadoop Introductory Tutorial Chapter I.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More