First knowledge of Hadoop

Last Update:2014-12-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop origins

Involves: Lucene,solr,nutch,hadoop

@auther Ayy

@date 2014/12/21

1. Lucene, Nutch, SOLR

Lucene is a Java-based full-text Search toolkit created by Doug Cutting, which is the next most famous project in Apache. The main functions are: Processing index, spelling checker, some analysis and participle.

(Lucene was originally posted on personal sites and sourceforge,2001 years, becoming a subproject of Apache project Jakarta)

SOLR and Nutch were also sub-projects under Lucene, but later Nutch became a separate project. The main function of Nutch is to fetch the data and extract the relevant information, which can be considered as a miniature version of Google.

SOLR is a lucene-based server-side program that provides the relevant API and management related functions such as Http,python,xml,json.

2, Hadoop is a nutch under the sub-project for distributed processing, and later became a project under Apache. Previously, Nutch could use Hadoop for distributed data fetching and processing.

(2003-2004 Google opened the implementation details of GFS and MapReduce, and Doug Cutting spent two years in his spare time implementing the DFS and MapReduce mechanisms, making nutch performance soar.

In 2005, Hadoop was formally introduced to the Apache sub-project as part of Nutch, a subproject of the Apache project Lucene. )

3. Hadoop Sub-project family

The bottom line is the core code of Hadoop, which implements the two pillars of MapReduce and HDFs for Hadoop, based on the core code.

Pig for non-Java programmers, a lightweight language, input data processing, data analysis commands. The system can be converted to a mapreduce program.

Hive is equivalent to a mapper from SQL to MapReduce. For database engineers, Hive transforms SQL into a distributed task of MapReduce. Hive can therefore be understood as a distributed database, presumably supporting some of the SQL92 standards.

HBase is a nosql, column-based storage database. Benefit: increase the corresponding speed and reduce the IO amount. It is also a popular technology at present.

ZooKeeper Coordination Tool.

Chukwa data collectors, you can collect the data generated by the server.

4. Hadoop architecture

Hdfs

Namenode records how the files are split into blocks, and what nodes exist on those blocks. Cons: Single point of failure

Data read process:

Data write Process:

The Secondnamenode communicates with the Namenode and periodically saves the HDFS metadata. But at present when Namenode encountered a fault, need to manually secondnamenode access.

Datanode writes HDFS data blocks to the file system of the current machine.

Mapreduce

Jobtracker determines which files are involved in processing, cuts the task into task, and assigns it to the appropriate node. Hadoop assigns a task to its corresponding data node. Jobtracker also monitors the progress of the task, reports to the user, and restarts the failed task.

Tasktracker interacts with Jobtracker and performs tasks.

First knowledge of Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

First knowledge of Hadoop

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support