Hadoop Knowledge System full note (not finished)

Source: Internet
Author: User

Data flow

The MapReduce job (job) is the unit that the client executes: It includes input data, mapreduce programs, and configuration information . Hadoop divides the input data into small, equal-length data sent to the MapReduce, which is called the input shard. Hadoop creates a map task for each shard that runs the user-defined map function to analyze the records in each shard.

The size of the shards here, and if the shards are too small, then the total time to manage the shards and the total time the map task is created determines the total time that the job is executed. For big Data jobs , an ideal shard size is often the size of an HDFS block , which is 64MB by default (can be specified by configuration file)

Hadoop has the best performance when the execution node of the map task and the storage node for the input data are the same node . This is why the size of the best shards is the same as the block size, which is the largest guaranteed amount of data stored on a single node if the partition spans two blocks, it is almost impossible for any HDFs node to store two blocks of data at the same time, so that part of the distribution must be transmitted over the network to the node. This is obviously inefficient compared to using local data to run the map task.

The reduce task does not have the advantage of data local reading, and the input of a single reduce task is often derived from the output of all mapper . Therefore, the output of an ordered map must be transferred over the network to the node that the reduce task is running, where it is merged, and then passed to the user-defined reduce function. In general, the data flow for multiple reduce tasks becomes "shuffle" because the input for each reduce task is provided by many map tasks.

Hadoop Stream

Flow is for word processing, and when used in text mode, it has a row-oriented view of the data. The input data of the map transmits the standard input stream to the map function, which is a row-by-row transmission, and then writes the rows to the standard output . The framework calls Mapper's map () method to handle every record read, but the map program can decide how to handle the input stream, can easily read and process multiple rows at the same time, and the user's Java map implementation is a stack record, but it can still consider handling multiple lines, The practice is to bring together the previous lines in the instance variables in mapper (available in other languages).

Design of HDFs

  HDFs is a file system designed for storing large files in streaming data access mode , running on a cluster of commercial hardware.

  Streaming data access: One-write, multiple-read mode is the most efficient, and a dataset is typically generated or copied from a data source, and then a variety of analyses are performed on that basis.

  Low Latency Data access: applications requiring low latency access to data in the millisecond range not applicable to Hdfs,hdfs is optimized for high data throughput, which may be at the expense of latency. (Low latency Access can refer to HBase)

  a large number of small files:Namenode Stores the metadata of the file system, and the limit of the number of files is determined by the amount of namenode memory. Each file, index directory, and block account for approximately 150 bytes, so if there are 1 million files, each file occupies a block, which requires at least 300MB of memory.

  Multi-user write, arbitrarily modify the file: The file in HDFs has only one writer.

The block of HDFs is larger than the disk block, and is intended to reduce the overhead of addressing . By making a block large enough, the time to transfer data from the disk can be far greater than the time it was to locate the starting end. Therefore, the time to transfer a file consisting of multiple blocks depends on the disk transfer rate.

File Read and Write

Read data in HDFs

The client reads the file that you want to open by calling the open () of the FileSystem object. For HDFs, this object is an example of a distributed file system.

(1) Distributedfilesystem uses RPC to call Namenodeto determine the location of the block at the beginning of the file, and for each block, Namenode returns the address of the data node that has the copy of the block. These data nodes are then sorted according to their distance from the client, and if the client itself is a data node, it is read from the Local data node. (Distributed filesystem returns a Fsdata InputStream instead wraps a Dfsinputstream object)

(2) Dfsinputstrea m, which stores the data node address of the block at the beginning of the file, israndomly connected to the nearest data node of these blocks , and the data is returned to the client from the data node by repeatedly invoking read ()in the data stream. When the end of the block is reached, Dfsinputsteam closes the connection to the data node and then finds the best data node for the next block.

(3) When the client reads data from the stream, the block is read in the order that Dfsinputstream opens the new connection to the data node . It also calls Namenode to retrieve the location of the data node for the next set of required blocks. Once the client finishes reading, it calls Close () on the file system data entry.

The point of this design is that the client directly contacts the data node to retrieve the data and directs the Namenode to the best data node in each block. Because data flow is distributed across all data nodes in this cluster, this design enables HDFS to scale to the maximum number of concurrent clients. NAMENODE provides block location requests whose data is stored in memory and is very efficient.

File Write

  

Hadoop Knowledge System full note (not finished)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.