After the successful installation of Hadoop, many of the concepts of Hadoop are smattering, with an initial understanding of the online documentation and the Hadoop authoritative guide.
1. What issues does Hadoop solve?
Store and analyze large amounts of data.
Scenario: HDFs is stored, and MapReduce is analyzed, supplemented by other functions.
Questions to consider in your design:
Big Data Read and write problems, disk IO capacity constraints, led to parallel processing mode.
The possibility of hardware failure, as well as the need for backup of multiple hardware, is required to ensure reliability.
Data transmission, also need to ensure reliability.
and to form the ability to query or process ad hoc immediately.
2. Comparison with relational databases
Relational database: Good at processing structured data, its general use of the structure of B-tree, to facilitate the updating of a small number of data, or continuous updating, but for a large amount of data update, efficiency than MapReduce low, reason: "Sort/merge" to rebuild the database, facilitate point query, provide low-latency index and a small amount of data updates.
MapReduce: Good at dealing with semi-structured, unstructured data, large amounts of data analysis, emphasizing a large amount of data acquisition capabilities and batch processing to analyze the entire data set, rather than low-latency data access. A write-once, multiple-query feature (a simple continuous model). The way of horizontal linear expansion
When a relational database cannot read petabytes of data at one time, it originates from the disk speed limit:
The ability of the disk to be addressed depends on the location of the hardware head to the disk;
The rate of transmission depends on the bandwidth of the disk.
and the development of addressing capability is much slower than the transmission rate.
Structured data: Entity data with a given format
Semi-structured data: Similar tables, but each table can have its own format
Unstructured data: text or images
relational database for structured data, complete, and no redundancy
MapReduce for better handling of semi-structured and unstructured data
3. Map/reduce Related terms
Key/value pairs: The General key is the offset, and the value is text.
Class_path: Environment variable: The path of the added application.
Map function: Implementation of MAP task
Reduce function: Implementation of the reduce task
Main function: Program entry
Jobtracker: Schedules the tasks running on the tasktracker to coordinate all jobs running on the system.
Tasktracker: While running the task, send the results and status of the run to Jobtracker
Shard Input Split: divides the data into equal-length chunks, and then builds a map task for each shard
Data localization Optimization: Run the map task on the data on the local node of HDFs to get the best performance
Partition partition: Multiple reduce tasks, each reduce task requires a partition
The data flow between the mixed-wash shuffle:map and the reduce task is mixed, many-to-many relationships
Combiner merge function: Its output as input to reduce, save bandwidth, reduce the amount of data output
4. Once written, the meaning of multiple reads.
Each disk or machine copies data from the data source to local, that is, write once, and then, for a long time, only reads the local data when parsing, that is, multiple reads.
5. What is the scenario where HDFs is not applicable?
1) Low-latency scenarios
HDFs has high data throughput performance at the cost of potentially high time delays.
HBase behaves better when asked for a delay of only dozens of milliseconds.
2) Large number of small files
HDFs is suitable for larger data blocks. When there are a lot of small files, such as hundreds of millions of small files, Namenode management ability is not enough.
3) Multi-user writing, arbitrary modification of files
When more than one user modifies a file, or modifies a file at any time, it is prone to conflict and is less efficient.
6. What are the characteristics of common NFS and HDFs?
1. Advantages: Transparency, programming convenience, easy, only need to open,close,fread some library operation.
2. Cons: No data redundancy, all data on a single machine, data replication, there may be bandwidth limitations.
HDFs is designed to overcome the drawbacks of NFS. Storage is reliable, easy to read, and integrated with MapReduce. Scalability is strong. Highly configurable (a bunch of configuration files). Support Web interface: http://namenode-name:50070/traffic file system. Support Shell interface operation.
Data block: Default 64MB. What are the advantages. Minimizes addressing overhead and simplifies system design
Namenode: Manager
Datanode: Workers
Client: Interacts with Namenode, Datanode. For the user, hide Namenode, datanode operation.
What are the fault-tolerant mechanisms of namenode? Permanently write to disk, or mirror Namenode (secondary namenode)
HDFS Architecture
How the command-line interface works.
The HDFs daemon determines the host and port of the Namenode based on the "Fs.default.name" (HDFs URI) so that the client can know where Namenode is running and link to it
Shell interface Operation:
Detailed Description: http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html
Example
Hadoop Fs-command Parameterlist
Command
-ls
-copyfromlocal
-mkdir
R, W, x three permissions.
Classification of Hadoop file systems:
File, HDFs, Hftp, Hsftp, Har, HFS (Cloud storage), FTP S3, S3
--------------
7. Checkpoint Node
Periodically back up namespace Checkpoint:fsimage, and edit (log after checkpoint) from Namenode.
Its interface is configured by Dfs.backup.address and dfs.backup.http.address.
In addition, you can configure backup cycles and block sizes:
Fs.checkpoint.period, set to 1 hour by default
Fs.checkpoint.size, set to 64MB by Default,edit log maximum value, when this is reached, even if the period time is undecided, still need to checkpoint backup
The checkpoint path of checkpoint node is always the same as the checkpoint path of Namenode.
8. Backup Node
The same functionality as Checkpoint node, except that you don't need to download fsimage and edit. Because the contents of its memory are the same as Namenode (synchronous).
Import checkpoint has a special command, process.
9. Load Balancer Rebalancer
Http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer
--------------
HDFS provides a range of interfaces to support the following features:
Data read, write, query, delete, archive, and ensure consistency
How to maintain the balance of hadoop clusters.
Using equalizer balancer to improve the uniform distribution of blocks in the cluster.
Ten. Hadoop I/O
Data integrity: Data checksum checksum, with CRC-32, or 4 bytes, with a cost of less than 1%. Datanode detects the integrity of the data at the time of replication. and has the ability to repair, that is, the new copy data.
Compression: Reduces file storage space and accelerates data transmission over the network and on disk. The compressed tool has bzip2 gzip LZO, BZIP2 has an advantage in space, while LZO has an advantage in time, gzip centered
Space bzip2 gzip LZO time
Serialization: Converts a structured object to a byte stream for easy network transmission or permanent disk storage.
Deserialization: Converts a byte stream to a reverse process of a structured object.
Rpc:remote Procedure Call Remote procedure calls, the means of communication between multiple nodes on Hadoop. In the form of serialization and deserialization.
Avro.
is a data serialization system that is independent of the programming language. Detailed descriptions are available in the avro.apache.org. Designed to address the lack of language portability in Hadoop.
Avro has a rich data schema to parse resolution.