Hadoop learning note _ 2_hadoop source and System Overview [continued]

Last Update:2014-08-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop origin and System Overview

Hadoop source -- Lucene

Lucene is an open-source software developed by Doug cutting. It writes code in Java to implement full-text search functions similar to Google. It provides the full-text search engine architecture, including the complete query engine and index engine.

It was released on the personal website and SourceForge early in 2001 and became a subproject of the Apache Software Foundation Jakarta.

Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.

For big data scenarios, Lucene faces the same difficulties as Google. Forcing Doug cutting to learn and imitate Google's solutions to these problems [it can be said that Doug has built a Google solution]

Lucene has a miniature version: nutch

From Lucene to nutch, from nutch to hadoop

In-, Google made public some details about GFS and mapreduce ideas. Based on this, Doug cutting and others spent two years working on DFS and mapreduce mechanisms to soar the performance of nutch.

Later, Yahoo! Zhao 'an Doug cutting and its project

Hadoop was officially introduced to the Apache Foundation as part of the Lucene sub-project nutch in the fall of 2005. In February March 2006, mapreduce and nutch Distributed File System (NDfS) were included in a project called hadoop respectively.

Hadoop's name comes from Doug cutting's son's toy elephant.

Current height of hadoop

1. hadoop has become an open-source software that implements de facto cloud computing standards.

2. It contains dozens of child projects with powerful vitality (such as hbase, hive, pig, etc)

3. It has been able to run on thousands of nodes, constantly breaking the world record for data processing and sorting time

Hadoop subproject family

Mapreduce and HDFS: open-source implementation of Google map-Reduce and GFS, which constitute two pillars of hadoop technology.

Hbase: open-source implementation of bigtable, A nosql [non-relational] database, is a column-based database [different from traditional row-based databases like Oracle], which is stored in columns; mainly for data analysis, column-based storage can increase the response speed and reduce the IO volume.

Pig: a lightweight programming language that converts a command into a mapreduce program by using a command like a system, controls mapreduce work, and returns the result to the user. [similar to a converter between a user and mapreduce]

Hive: it is equivalent to a er from the SQL language to the mapreduce program. It is intended for database engineers. [hive can be understood as a relational database, but its functions are relatively simple. It can only be part of the sql92 standard.]

Chukwa: equivalent to a data integration tool;

Zookeper: Mainly used for inter-process communication between server nodes;

Hadoop Architecture

Rank: rack

1. namenode [Name node]

A) HDFS daemon

B) How are recording files divided into data blocks and the nodes on which these data blocks are stored?

C) centralized management of memory and I/O

D) It is a single point of failure, which will cause the cluster to crash.

2. Secondary namenode [secondary Name node]

A) Auxiliary background programs that monitor the HDFS status

B) each cluster has

C) Communicate with namenode and regularly save HDFS metadata snapshots

D) When namenode fails, it can be used as a backup namenode.

3. datanode

A) each slave server runs

B) reads and writes HDFS data blocks to the local file system.

HDFS: = namenode + secondary namenode + datanode

4. jobtracker [job tracker, on the master server]

A) background programs used to process jobs (user-submitted code)

B) decide which files are involved in processing, cut the job into a task, and assign nodes.

C) monitoring tasks and restarting failed tasks (different nodes)

D) each cluster has only one jobtracker located on the master node.

5. tasktracker

A) located on the slave node, combined with datanode (Principles of code and data)

B) manage tasks on each node (assigned by jobtracker)

C) each node has only one tasktracker, but one tasktracker can start multiple JVMs to execute map or reduce tasks in parallel.

D) Interaction with jobtracker [Inform jobtracker of its own situation]

6. master and slave

A) Master: namenode, secondary namenode, and jobtracker. You can use a browser (for viewing the Management Interface) and other hadoop tools;

B) slave: tasktracker and datanode

C) the master is not necessarily unique: Generally, the namenode and datanode are placed on the same server, and the secondary namenode is placed on another server.

Hadoop ideas

Analysis Methods in hadoop System

1) Mainstream: Java programs

2) Lightweight scripting language: Pig

3) stable SQL skills transition: hive

4) nosql: hbase

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop learning note _ 2_hadoop source and System Overview [continued]

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop learning note _ 2_hadoop source and System Overview [continued]

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support