What is http://www.nowamagic.net/librarys/veda/detail/1767 hadoop?
Hadoop was originally a subproject under Apache Lucene. It was originally a project dedicated to distributed storage and distributed computing separated from the nutch project. To put it simply, hadoop is a software platform that is easier to develop and run to process large-scale data. You can develop distributed programs without understanding the details of the distributed underlying layer. Make full use of the power of clusters for high-speed computing and storage.
Hadoop is the most popular tool for classifying search keywords on the Internet, but it can also solve many problems requiring great scalability. For example, if you want to grep a 10 TB giant file, what will happen? In traditional systems, this takes a long time. However, hadoop can greatly improve the efficiency by taking these issues into account and adopting parallel execution mechanisms.
Hadoop is a software framework that can process large amounts of data in a distributed manner. However, hadoop is processed in a reliable, efficient, and scalable manner. Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of work data to ensure that it can be redistributed for failed nodes. Hadoop is efficient because it works in parallel and accelerates processing through parallel processing. Hadoop is still scalable and can process petabytes of data. In addition, hadoop depends on Community servers, so it is relatively low cost and can be used by anyone.
The following lists the main features of hadoop:
- Scalable: reliable storage and processing of gigabit (PB) data.
- Low Cost: data can be distributed and processed by a server group composed of common machines. These Server clusters can have up to thousands of nodes.
- Efficient: by distributing data, hadoop can process the data in parallel on the node where the data is located, which enables fast processing.
- Reliability: hadoop can automatically maintain multiple copies of data and automatically redeploy (Redeploy) computing tasks after a task fails.
Hadoop implements a Distributed File System (HDFS. HDFS features high fault tolerent and is designed to be deployed on low-cost (low-cost) hardware. It also provides high throughput to access application data, suitable for applications with large data sets. HDFS relaxed (relax) POSIX requirements (requirements) so that you can access the data in the streaming Access File System in the form of a stream.
Hadoop also implements the mapreduce distributed computing model. Mapreduce splits application work into many small pieces of work ). HDFS creates multiple data blocks (replicas) for reliability and places them in the compute node of the server group (compute nodes ), mapreduce can process the data on the node where they are located.
As shown in:
Hadoop APIs are divided into the following major packages ):
- Org. Apache. hadoop. conf: defines the configuration file processing API for system parameters.
- Org. Apache. hadoop. FS: defines the abstract file system API.
- Org. Apache. hadoop. DFS: Implementation of the hadoop Distributed File System (HDFS) module.
- Org. Apache. hadoop. IO: defines general I/O APIs for reading and writing data objects such as networks, databases, and files.
- Org. Apache. hadoop. IPC: a tool used for network servers and clients. It encapsulates basic modules of Asynchronous Network I/O.
- Org. Apache. hadoop. mapred: Implementation of the hadoop Distributed Computing System (mapreduce) module, including task distribution and scheduling.
- Org. Apache. hadoop. Metrics: defines an API for Performance Statistics, mainly used for mapred and DFS modules.
- Org. apache. hadoop. record: defines the I/o API class for records and a record description language translator to simplify serialization of records into a language-neutral format (Language-neutral manner ).
- Org. Apache. hadoop. Tools: defines some common tools.
- Org. Apache. hadoop. util: defines some public APIs.
Hadoop Framework Structure
MAP/reduce is a distributed computing model used for large-scale data processing. It was originally designed and implemented by Google engineers and has been published by Google. The definition of MAP/reduce is a programming model, which is used to process and generate large-scale data sets. You can define a map function to process a key/value pair to generate a batch of intermediate key/value pairs, and then define a reduce function to combine all the values with the same key in the middle. Many tasks in the real world can be expressed using this model.
MAP/reduce model computing
Hadoop's map/reduce framework is also implemented based on this principle.
Mapreduce workflow is not complex. As shown in, each map (worker) reads a certain number of data blocks in key-value format (split0-split4, 64 MB/block ). For word statistics on text files, data in key-value format can be defined as <string filename, string file_content>.
Then each map processes the data separately and outputs the intermediate results (intermediate files) organized in key-value format ). Taking word statistics as an example, map outputs a record <word, "1"> for each word in the data. That is to say, in the output result of a map, there may be repeated n times <word, "1"> (you can optimize it through combiner to make it output as <word, "N"> to reduce network traffic ).
Then, shuffle is used to aggregate the intermediate results of the same key to the same node. For example, all the workers of the three map phase have the intermediate results of <word, "1">, shuffle is used to collect the records of the same key into the worker of the same reduce phase, and then a reduce is used to process all records containing the same key. That is to say, all <word, "1"> generated in the MAP Phase stage will be processed by a reduce worker on the reduce phase for merging. The final output result.
A reliable, efficient, and scalable Processing Solution for large-scale distributed data processing platform hadoop