Full-text index-lucene,solr,nutch,hadoop Lucene
Full-text index-lucene,solr,nutch,hadoop SOLR
I was in last year, I want to lucene,solr,nutch and Hadoop a few things to give a detailed introduction, but because of the time of the relationship, I still only wrote two articles, respectively introduced the Lucene and SOLR, then did not write, but my heart is still looking forward to, Although I have not really worked on the Nutch and Hadoop real-time project, but the company will soon be doing hadoop big data monitoring, I have always said, to be a prepared person, so I have never stopped from last year to the Hadoop-related technology learning, thinking, actual combat.
In the first half of last year, I was in my last company, one day Mister said let me study the search engine knowledge (the company has SEO business), so I search from the Internet, finally I accidentally found an open source search engine Apache under the project-nutch, so, I was addicted to it, looking for information from the Internet, setting up the environment (only on the Linux system), and finally able to run. I learned a little bit about the search engine knowledge, but let me most unexpectedly is nutch unexpectedly is the cause of the Hadoop project, the fuse, the parent project. Originally I wanted to write two separate projects on Nutch and Hadoop, but the real need to spend a lot of personal time, and the nutch environment is not good to build, and after all, my focus is on Hadoop, so I would like to talk about Nutch and Hadoop background and introduction in this article.
Hadoop was created by Doug Cutting, the founder of Apache Lucene, whichis a widely used text search system . Hadoop originated in Nutch, an open-source web search engine, which itself is part of the Lucene project.
1. Hadoop background
The Nutch project began in 2002 with a running web crawling tool and a search engine system that soon surfaced. But later, developers think that this architecture is not scalable enough to solve billions of of web page search problems.
2003 Google published an article on the Google Distributed file system, referred to as GFS. GFS or similar architectures can address the storage requirements of oversized files that they generate during page crawls and indexing. In particular, GFS can save a lot of time spent on system management.
In 2004, they started to implement an open source implementation, the Nutch Distributed File System NDFs.
In 2004, Google published a paper to introduce their mapreduce system to the world.
At the beginning of 2005, Nutch's developers implemented a mapreduce system on Nutch, and all the major algorithms of Nutch were ported to work with MapReduce and NDFs in the middle of the year.
Nutch's NDFs and MapReduce implementations are not known for the search field. In February 2006, developers moved NDFs and mapreduce out of Nutch to form a sub-project of Lucene called Hadoop. At about the same time, Doug cutting joined Yahoo, where Yahoo organized a dedicated team and resources.
In the January 2008, Hadoop has become the top project of Apache, proving its success, diversity, and sustainability.
2. Apache Hadoop and Hadoop ecosystem
Although Hadoop is known for MapReduce and Distributed File System HDFs, the name Hadoop is also used collectively for a set of related projects, as follows:
2.1, Common
A set of distributed file systems and common I/O components and Interfaces (serialization, Java RPC, and persistent data structures)
2.2,Arvo
A serialization system that supports efficient, cross-language RPC and permanent storage of data.
2.3.MapReduce
Distributed data processing model and execution environment, running in large commercial clusters.
2.4.HDFS
Distributed file system, running in large commercial clusters.
2.5.Pig
A data flow language and runtime environment for retrieving very large datasets. Pig runs on the MapReduce and HDFs clusters.
2.6.Hive
A distributed, column-stored Data Warehouse. Hive manages the data stored in HDFS and provides a SQL-based query language (translated by the runtime engine into a mapreduce job) to query the data.
2.7.HBase
A distributed, column-based storage database. HBase uses HDFs as the underlying storage, while support for batch-based calculations and point queries for MapReduce.
2.8,zookepper
A distributed, high-availability coordination service. Zookepper provides basic services such as distributed locks for building distributed applications.
2.9,Sqoop
A tool for efficiently transferring data between databases and HDFs.
3. Follow-up
Although I was exposed to Hadoop last year, and I did some study (on-line videos and materials), I still didn't learn it well and actually used it. Just this year the company is ready to design and development of big data monitoring, the company gave us these developers bought a few books on Hadoop, but I glanced at it, finally bought a "Hadoop authoritative guide", I think this book is better, I also finished, in order to summarize the knowledge of Hadoop and easy to find, I am ready to put some of the more important knowledge to share in the summary, please look forward to!
Full-text Indexing-lucene,solr,nutch,hadoop Nutch and Hadoop