Alibabacloud.com offers a wide variety of articles about hadoop copy directory from hdfs to hdfs, easily find your hadoop copy directory from hdfs to hdfs information here online.
Tags: 3.0 end TCA Second Direct too tool OTA run1. Distributing HDFs Compressed Files (-cachearchive)Requirement: WordCount (only the specified word "The,and,had ..." is counted), but the file is stored in a compressed file on HDFs, there may be multiple files in the compressed file, distributed through-cachearchive;-cacheArchive hdfs://host:port/path/to/file.tar
1. Import the Hadoop jar packageAdd the hadoop/share/common/directory, hadoop/share/common/lib/directory, hadoop/hdfs/directory, and the next jar package to eclipse.2. Start Encoding CallStaticFileSystem fs=NULL; Public Static voidMain (string[] args) throws Exception {//T
First, build the Hadoop development environment
The various codes that we have written at work are run on the server, and the operation code of HDFS is no exception. In the development phase, we use eclipse under Windows as the development environment to access HDFS running in the virtual machine. That is, access to
HDFs block of data
Disk data block is the smallest unit of data read/write for disk, typically 512 bytes,
There are also data blocks in the HDFs, and the default is 64MB. So the large files on the HDFs are divided into many chunk. Files that are small (less than 64MB) on HDFs will not occupy the entire block of space
it also has a negative impact, when the edits content is large, the startup of namenode will become very slow.In this regard, secondnamenode provides the ability to aggregate fsimage and edits. First, copy the data in namenode, then perform merge aggregation, and return the aggregated results to namenode, in addition, the local backup is retained, which not only speeds up the startup of namenode, but also increases the redundancy of namenode data.Io
HDFS is one of our common components in big data. HDFS is an indispensable framework in the hadoop ecosystem. Therefore, when we enter hadoop, we must have a certain understanding of it. First, we all know that HDFS is a Distributed File System in the
of small files, the pressure is amazing!
However, if the block size is too large, it is not good because the single point of reading and writing slows down and the re-transmission of errors is inconvenient.
The smaller the block division, the more pressure the namenode memory has.
Therefore, we need to divide the data according to the actual situation. Generally, 64 m, 128 M, and m are common.
Specific modification:
Copy related content from the
sent data block and waits for the data node in the pipeline to inform that the data has been written successfully.
If the data node fails to be written:
Close pipeline and put the data blocks in ack queue into the beginning of data queue.
The current data block is assigned a new identifier by the metadata node in the data node that has been written. After the faulty node is restarted, it can be noticed that the data block is outdated and deleted.
Failed data nodes are removed from
Design objectives:
-(Hardware failure is normal, not accidental) automatic rapid detection to deal with hardware errors
-Streaming Access data (data batch processing)
-Transfer calculation is more cost-effective than moving the data itself (reducing data transfer)
-Simple data consistency model (one write, multiple read file access model)
-Heterogeneous Platform portability
HDFS Architecture
Adopt Master-slaver Mode:
Namenode Central Server (Master)
HDFs Add Delete nodes and perform HDFs balance
Mode 1: Static add Datanode, stop Namenode mode
1. Stop Namenode
2. Modify the slaves file and update to each node
3. Start Namenode
4. Execute the Hadoop balance command. (This is used for the balance cluster and is not required if you are just adding a node)
-----------------------------------------
Mode 2:
The Filestatus class in Hadoop can be used to view the meta information of files or directories in HDFs, any file or directory can get the corresponding filestatus, and here is a simple demo of the relevant API for this class:
* */package COM.CHARLES.HADOOP.FS;
Import Java.net.URI;
Import Java.sql.Timestamp;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileStatus;
hadoop2.7.1 performance conditions:650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/71/5F/wKiom1XMbLzg47GhAASCy-xlOBM716.jpg "title=" a8.png "alt=" wkiom1xmblzg47ghaascy-xlobm716.jpg "/> writes multiple batches of files to HDFs, and after the test cluster is upgraded to hadoop2.7.1, the client does not report timeout and" all Datanode Bad ... "exception, service side also did not report timeout exception. In addition, this bug was found to
. D1 and R1 are both vswitches, and the underlying layer is datanode.Then, rackid =/D1/R1/H1 of H1, parent of H1 is R1, and parent of R1 is D1. You can usetopology.script.file.nameConfiguration. With the rackid information, you can calculate the distance between two datanode.
Distance (/D1/R1/H1,/D1/R1/H1) = 0 same datanodeDistance (/D1/R1/H1,/D1/R1/H2) = 2 different datanode under the same rackDistance (/D1/R1/H1,/D1/R1/H4) = 4 different datanode in the same IDCDistance (/D1/R1/H1,/D2/R3/H7) =
HDFs Design Principles
1. Very large documents:
The very large here refers to the hundreds of MB,GB,TB. Yahoo's Hadoop cluster has been able to store PB-level data
2. Streaming data access:
Based on a single write, read multiple times.
3. Commercial hardware:
HDFs's high availability is done with software, so there is no need for expensive hardware to guarantee high availability, with PCs or virtual m
can store. It also eliminates concerns about metadata, because blocks are only part of the data stored, and the metadata of the file, such as county information, does not need to be stored with the block, so that other systems can manage the metadata separately.And blocks are well suited for data backup to provide data fault tolerance and availability. Copying each block to a few separate machines (by default, 3) ensures that data is not lost after a block, disk, or machine failure occurs. If a
Hadoop HDFS clusters are prone to unbalanced disk utilization between machines, such as adding new data nodes to clusters. When HDFS is unbalanced, many problems will occur, such as Mr.ProgramThe advantages of local computing cannot be well utilized, the network bandwidth usage between machines cannot be better, and the machine disk cannot be used. It can be seen
All the source code on the GitHub, Https://github.com/lastsweetop/styhadoop
Read data using Hadoop URL read
A simpler way to read HDFS data is to open a stream through the Java.net.URL, but before you call it beforehand The Seturlstreamhandlerfactory method is set to Fsurlstreamhandlerfactory (this factory takes the parse HDFs protocol), which can only be invok
Reprint please indicate the source, http://blog.csdn.net/lastsweetop/article/details/9001467
All source code on GitHub, Https://github.com/lastsweetop/styhadoop read data using Hadoop URL read A simpler way to read HDFS data is to open a stream via Java.net.URL, but before that, it's Seturlstreamhandlerfactory method is set to Fsurlstreamhandlerfactory (the factory takes the parse
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.