Some preliminary concepts of Hadoop and HDFs

Source: Internet
Author: User
Tags mixed serialization hadoop fs

After the successful installation of Hadoop, many of the concepts of Hadoop are smattering, with an initial understanding of the online documentation and the Hadoop authoritative guide.


1. What issues does Hadoop solve?

Store and analyze large amounts of data.

Scenario: HDFs is stored, and MapReduce is analyzed, supplemented by other functions.

Questions to consider in your design:

Big Data Read and write problems, disk IO capacity constraints, led to parallel processing mode.

The possibility of hardware failure, as well as the need for backup of multiple hardware, is required to ensure reliability.

Data transmission, also need to ensure reliability.


and to form the ability to query or process ad hoc immediately.


2. Comparison with relational databases

Relational database: Good at processing structured data, its general use of the structure of B-tree, to facilitate the updating of a small number of data, or continuous updating, but for a large amount of data update, efficiency than MapReduce low, reason: "Sort/merge" to rebuild the database, facilitate point query, provide low-latency index and a small amount of data updates.

MapReduce: Good at dealing with semi-structured, unstructured data, large amounts of data analysis, emphasizing a large amount of data acquisition capabilities and batch processing to analyze the entire data set, rather than low-latency data access. A write-once, multiple-query feature (a simple continuous model). The way of horizontal linear expansion


When a relational database cannot read petabytes of data at one time, it originates from the disk speed limit:

The ability of the disk to be addressed depends on the location of the hardware head to the disk;

The rate of transmission depends on the bandwidth of the disk.

and the development of addressing capability is much slower than the transmission rate.


Structured data: Entity data with a given format

Semi-structured data: Similar tables, but each table can have its own format

Unstructured data: text or images


relational database for structured data, complete, and no redundancy

MapReduce for better handling of semi-structured and unstructured data


3. Map/reduce Related terms

Key/value pairs: The General key is the offset, and the value is text.

Class_path: Environment variable: The path of the added application.

Map function: Implementation of MAP task

Reduce function: Implementation of the reduce task

Main function: Program entry

Jobtracker: Schedules the tasks running on the tasktracker to coordinate all jobs running on the system.

Tasktracker: While running the task, send the results and status of the run to Jobtracker


Shard Input Split: divides the data into equal-length chunks, and then builds a map task for each shard

Data localization Optimization: Run the map task on the data on the local node of HDFs to get the best performance

Partition partition: Multiple reduce tasks, each reduce task requires a partition

The data flow between the mixed-wash shuffle:map and the reduce task is mixed, many-to-many relationships

Combiner merge function: Its output as input to reduce, save bandwidth, reduce the amount of data output


4. Once written, the meaning of multiple reads.

Each disk or machine copies data from the data source to local, that is, write once, and then, for a long time, only reads the local data when parsing, that is, multiple reads.


5. What is the scenario where HDFs is not applicable?

1) Low-latency scenarios

HDFs has high data throughput performance at the cost of potentially high time delays.

HBase behaves better when asked for a delay of only dozens of milliseconds.

2) Large number of small files

HDFs is suitable for larger data blocks. When there are a lot of small files, such as hundreds of millions of small files, Namenode management ability is not enough.

3) Multi-user writing, arbitrary modification of files

When more than one user modifies a file, or modifies a file at any time, it is prone to conflict and is less efficient.


6. What are the characteristics of common NFS and HDFs?

1. Advantages: Transparency, programming convenience, easy, only need to open,close,fread some library operation.

2. Cons: No data redundancy, all data on a single machine, data replication, there may be bandwidth limitations.

HDFs is designed to overcome the drawbacks of NFS. Storage is reliable, easy to read, and integrated with MapReduce. Scalability is strong. Highly configurable (a bunch of configuration files). Support Web interface: http://namenode-name:50070/traffic file system. Support Shell interface operation.


Data block: Default 64MB. What are the advantages. Minimizes addressing overhead and simplifies system design

Namenode: Manager

Datanode: Workers

Client: Interacts with Namenode, Datanode. For the user, hide Namenode, datanode operation.

What are the fault-tolerant mechanisms of namenode? Permanently write to disk, or mirror Namenode (secondary namenode)


HDFS Architecture


How the command-line interface works.

The HDFs daemon determines the host and port of the Namenode based on the "Fs.default.name" (HDFs URI) so that the client can know where Namenode is running and link to it



Shell interface Operation:

Detailed Description: http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html

Example

Hadoop Fs-command Parameterlist

Command

-ls

-copyfromlocal

-mkdir


R, W, x three permissions.


Classification of Hadoop file systems:

File, HDFs, Hftp, Hsftp, Har, HFS (Cloud storage), FTP S3, S3

--------------

7. Checkpoint Node

Periodically back up namespace Checkpoint:fsimage, and edit (log after checkpoint) from Namenode.

Its interface is configured by Dfs.backup.address and dfs.backup.http.address.

In addition, you can configure backup cycles and block sizes:

Fs.checkpoint.period, set to 1 hour by default

Fs.checkpoint.size, set to 64MB by Default,edit log maximum value, when this is reached, even if the period time is undecided, still need to checkpoint backup


The checkpoint path of checkpoint node is always the same as the checkpoint path of Namenode.


8. Backup Node

The same functionality as Checkpoint node, except that you don't need to download fsimage and edit. Because the contents of its memory are the same as Namenode (synchronous).

Import checkpoint has a special command, process.


9. Load Balancer Rebalancer

Http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer

--------------

HDFS provides a range of interfaces to support the following features:

Data read, write, query, delete, archive, and ensure consistency


How to maintain the balance of hadoop clusters.

Using equalizer balancer to improve the uniform distribution of blocks in the cluster.


Ten. Hadoop I/O

Data integrity: Data checksum checksum, with CRC-32, or 4 bytes, with a cost of less than 1%. Datanode detects the integrity of the data at the time of replication. and has the ability to repair, that is, the new copy data.

Compression: Reduces file storage space and accelerates data transmission over the network and on disk. The compressed tool has bzip2 gzip LZO, BZIP2 has an advantage in space, while LZO has an advantage in time, gzip centered

Space bzip2 gzip LZO time


Serialization: Converts a structured object to a byte stream for easy network transmission or permanent disk storage.

Deserialization: Converts a byte stream to a reverse process of a structured object.

Rpc:remote Procedure Call Remote procedure calls, the means of communication between multiple nodes on Hadoop. In the form of serialization and deserialization.


Avro.

is a data serialization system that is independent of the programming language. Detailed descriptions are available in the avro.apache.org. Designed to address the lack of language portability in Hadoop.

Avro has a rich data schema to parse resolution.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.