Hadoop origins
Involves: Lucene,solr,nutch,hadoop
@auther Ayy
@date 2014/12/21
1. Lucene, Nutch, SOLR
Lucene is a Java-based full-text Search toolkit created by Doug Cutting, which is the next most famous project in Apache. The main functions are: Processing index, spelling checker, some analysis and participle.
(Lucene was originally posted on personal sites and sourceforge,2001 years, becoming a subproject of Apache project Jakarta)
SOLR and Nutch were also sub-projects under Lucene, but later Nutch became a separate project. The main function of Nutch is to fetch the data and extract the relevant information, which can be considered as a miniature version of Google.
SOLR is a lucene-based server-side program that provides the relevant API and management related functions such as Http,python,xml,json.
2, Hadoop is a nutch under the sub-project for distributed processing, and later became a project under Apache. Previously, Nutch could use Hadoop for distributed data fetching and processing.
(2003-2004 Google opened the implementation details of GFS and MapReduce, and Doug Cutting spent two years in his spare time implementing the DFS and MapReduce mechanisms, making nutch performance soar.
In 2005, Hadoop was formally introduced to the Apache sub-project as part of Nutch, a subproject of the Apache project Lucene. )
3. Hadoop Sub-project family
The bottom line is the core code of Hadoop, which implements the two pillars of MapReduce and HDFs for Hadoop, based on the core code.
Pig for non-Java programmers, a lightweight language, input data processing, data analysis commands. The system can be converted to a mapreduce program.
Hive is equivalent to a mapper from SQL to MapReduce. For database engineers, Hive transforms SQL into a distributed task of MapReduce. Hive can therefore be understood as a distributed database, presumably supporting some of the SQL92 standards.
HBase is a nosql, column-based storage database. Benefit: increase the corresponding speed and reduce the IO amount. It is also a popular technology at present.
ZooKeeper Coordination Tool.
Chukwa data collectors, you can collect the data generated by the server.
4. Hadoop architecture
Hdfs
Namenode records how the files are split into blocks, and what nodes exist on those blocks. Cons: Single point of failure
Data read process:
Data write Process:
The Secondnamenode communicates with the Namenode and periodically saves the HDFS metadata. But at present when Namenode encountered a fault, need to manually secondnamenode access.
Datanode writes HDFS data blocks to the file system of the current machine.
Mapreduce
Jobtracker determines which files are involved in processing, cuts the task into task, and assigns it to the appropriate node. Hadoop assigns a task to its corresponding data node. Jobtracker also monitors the progress of the task, reports to the user, and restarts the failed task.
Tasktracker interacts with Jobtracker and performs tasks.
First knowledge of Hadoop