Full-text Indexing-lucene,solr,nutch,hadoop Nutch and Hadoop

Last Update:2014-10-11 Source: Internet

Author: User

Tags solr hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Full-text index-lucene,solr,nutch,hadoop Lucene

Full-text index-lucene,solr,nutch,hadoop SOLR

I was in last year, I want to lucene,solr,nutch and Hadoop a few things to give a detailed introduction, but because of the time of the relationship, I still only wrote two articles, respectively introduced the Lucene and SOLR, then did not write, but my heart is still looking forward to, Although I have not really worked on the Nutch and Hadoop real-time project, but the company will soon be doing hadoop big data monitoring, I have always said, to be a prepared person, so I have never stopped from last year to the Hadoop-related technology learning, thinking, actual combat.

In the first half of last year, I was in my last company, one day Mister said let me study the search engine knowledge (the company has SEO business), so I search from the Internet, finally I accidentally found an open source search engine Apache under the project-nutch, so, I was addicted to it, looking for information from the Internet, setting up the environment (only on the Linux system), and finally able to run. I learned a little bit about the search engine knowledge, but let me most unexpectedly is nutch unexpectedly is the cause of the Hadoop project, the fuse, the parent project. Originally I wanted to write two separate projects on Nutch and Hadoop, but the real need to spend a lot of personal time, and the nutch environment is not good to build, and after all, my focus is on Hadoop, so I would like to talk about Nutch and Hadoop background and introduction in this article.

Hadoop was created by Doug Cutting, the founder of Apache Lucene, whichis a widely used text search system . Hadoop originated in Nutch, an open-source web search engine, which itself is part of the Lucene project.

1. Hadoop background

The Nutch project began in 2002 with a running web crawling tool and a search engine system that soon surfaced. But later, developers think that this architecture is not scalable enough to solve billions of of web page search problems.

2003 Google published an article on the Google Distributed file system, referred to as GFS. GFS or similar architectures can address the storage requirements of oversized files that they generate during page crawls and indexing. In particular, GFS can save a lot of time spent on system management.

In 2004, they started to implement an open source implementation, the Nutch Distributed File System NDFs.

In 2004, Google published a paper to introduce their mapreduce system to the world.

At the beginning of 2005, Nutch's developers implemented a mapreduce system on Nutch, and all the major algorithms of Nutch were ported to work with MapReduce and NDFs in the middle of the year.

Nutch's NDFs and MapReduce implementations are not known for the search field. In February 2006, developers moved NDFs and mapreduce out of Nutch to form a sub-project of Lucene called Hadoop. At about the same time, Doug cutting joined Yahoo, where Yahoo organized a dedicated team and resources.

In the January 2008, Hadoop has become the top project of Apache, proving its success, diversity, and sustainability.

2. Apache Hadoop and Hadoop ecosystem

Although Hadoop is known for MapReduce and Distributed File System HDFs, the name Hadoop is also used collectively for a set of related projects, as follows:

2.1, Common

A set of distributed file systems and common I/O components and Interfaces (serialization, Java RPC, and persistent data structures)

2.2,Arvo

A serialization system that supports efficient, cross-language RPC and permanent storage of data.

2.3.MapReduce

Distributed data processing model and execution environment, running in large commercial clusters.

2.4.HDFS

Distributed file system, running in large commercial clusters.

2.5.Pig

A data flow language and runtime environment for retrieving very large datasets. Pig runs on the MapReduce and HDFs clusters.

2.6.Hive

A distributed, column-stored Data Warehouse. Hive manages the data stored in HDFS and provides a SQL-based query language (translated by the runtime engine into a mapreduce job) to query the data.

2.7.HBase

A distributed, column-based storage database. HBase uses HDFs as the underlying storage, while support for batch-based calculations and point queries for MapReduce.

2.8,zookepper

A distributed, high-availability coordination service. Zookepper provides basic services such as distributed locks for building distributed applications.

2.9,Sqoop

A tool for efficiently transferring data between databases and HDFs.

3. Follow-up

Although I was exposed to Hadoop last year, and I did some study (on-line videos and materials), I still didn't learn it well and actually used it. Just this year the company is ready to design and development of big data monitoring, the company gave us these developers bought a few books on Hadoop, but I glanced at it, finally bought a "Hadoop authoritative guide", I think this book is better, I also finished, in order to summarize the knowledge of Hadoop and easy to find, I am ready to put some of the more important knowledge to share in the summary, please look forward to!

Full-text Indexing-lucene,solr,nutch,hadoop Nutch and Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More