complete the unfinished part of the previous section, and then analyze the internal principle of the HDFs read-write file.
Enumerating FilesThe Liststatus () method of the FileSystem (Org.apache.hadoop.fs.FileSystem) can list the contents of a directory.
Public filestatus[] Liststatus (Path f) throws FileNotFoundException, Ioexception;public filestatus[] Liststatus (Path[] files) throws FileNotFoundException, Ioexception;public filestatus[] Liststatus (Path F, pathfilter filter) throws FileNotFoundException, Ioexception;public filestatus[] liststatus (path[] files, Pathfilter filter) throws FileNotFoundException, IOException;
This set of methods, all receive the path parameter, if path is a file, the return value is an array, there is only one element in the array, is the Filestatus object of the file represented by this path; If path is a directory, The return value array is an array of filestatus of all files and directories under that directory, possibly a 0-long array, or if the parameter is path[], the return value is equivalent to calling the single path multiple times and then consolidating the return value into an array; If the argument contains Pathfilter, Pathfilter will filter the returned file or directory, return the file or directory that satisfies the condition, and the condition is customized by the developer, and the usage is similar to Java.io.FileFilter. The following program receives a set of paths, and then lists the Filestatus
Import Java.net.uri;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.filestatus;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.fileutil;import Org.apache.hadoop.fs.path;public Class Liststatus {public static void Main (string[] args) throws Exception { String uri = args[0]; Configuration conf = new configuration (); FileSystem fs = Filesystem.get (Uri.create (URI), conf); path[] paths = new Path[args.length]; for (int i = 0; i < paths.length; i++) { paths[i] = new Path (args[i]); } filestatus[] status = Fs.liststatus (paths); path[] Listedpaths = fileutil.stat2paths (status); for (Path p:listedpaths) { System.out.println (p);}} }
upload the program, and then do the following:$hadoopListstatus//user/user/norristhen list/under,/user/,/user/norris/All files and directories under. Ways to execute programs under Hadoop see previous blog post (http://blog.csdn.net/norriszhang/article/details/39648857)
file patterns listing files and directories with wildcard charactersthe Globstatus method of filesystem is to use wildcards to list files and directories. Glob is the meaning of a wildcard. The wildcard characters supported by FileSystem are: *: Match 0 or more characters?: match 1 characters [AB]: matches the characters listed in square brackets [^ab]: matches characters not listed in square brackets [a]: matches the range of characters listed in square brackets [^a-b] : A character {a b} that matches the range of characters listed in square brackets {A, a}: or match A or match b\c: escaped, if C is a meta-character, it represents the character itself, such as \[, which represents the character [
Public filestatus[] Globstatus (path Pathpattern) throws Ioexception;public filestatus[] Globstatus (Path Pathpattern, Pathfilter filter) throws IOException;
AlthoughPathpattern is very powerful, but there are some situations that are not satisfied, such as to exclude a particular file, then you need to use Pathfilter.
Package Org.apache.hadoop.fs;public interface Pathfilter { Boolean accept (path path);}
This is the definition of the Pathfilter interface, as long as the implementation of the Accept method, return whether the path is selected. This accept method receives the parameter is a path, that is, the useful information can only get the path and the file name, like the modification time ah, the permission Ah, the owner, the size ah what, can not get, no filestatus so strong, so, if we want to select the file by the modified time , you should take a timestamp on the name of the file. (Of course, you can get that information again through filesystem, but ...) That's not worth it, is it? I'm not sure how much it will cost .
Deleting Filesthe Delete method of filesystem deletes a file or directory (permanently deleted). Public Boolean Delete (Path F, Boolean recursive) throws IOException;Delete F, if f is a file or empty directory, then no matter what recursive pass, delete, if f is a non-empty directory, then recursive is true when the contents of the directory are all deleted, if Recursive is false, do not delete, and throws IOException.
The following is a deeper analysis of the data flow process when HDFs reads and writes filesRead File AnatomyThe first step: The client opens a file by calling the Filesystem.open () method, which is actually the open method of invoking the Distributedfilesystem instance for HDFs;Step Two: Distributedfilesystem accesses Namenode through a remote method call (RPC) to get the location information of the first few blocks of the file, and for each block, Namenode will return all datanodes nodes that have the block data information, such as the configuredDfs.replication is 3, each block will return 3 datanodes node information, these nodes are sorted by distance from the client, if the client who initiated the read file on the datanode containing the block, then the Datanode is ranked first (this situation in the Map task is common), the client reads the data from the native computer. On how to judge the distance from the client, the network topology will be discussed in a moment. Distributedfilesystem's Open method returns a Fsdatainputstream,fsdatainputstream wrapped in a dfsinputstream, Dfsinputstream truly manages the I/O of Datanodes and Namenode;Step Three: The client calls the Fsdatainputstream.read () method, and Fsdatainputstream has cached the address of the datanode where the previous blocks of the file are located, The connection is then read from the first address (i.e. the nearest datanode) of the first block;Fourth Step: repeatedly call the Read () method, the data constantly flow from the Datanode to the client;Fifth Step: When the data of a block is read, Dfsinputstream will close the current Datanode connection, open the next block where the optimal Datanode connection continues to read, which is transparent to the client, in the client's view, is to read a continuous stream;Sixth step: such a block a block read down, when more block storage information is needed, Dfsinputstream will call Namenode again, get the next block of storage location information, until the client stops reading, Call the Fsdatainputstream.close () method to end the entire read process.
In the read process, if Dfsinputstream has an error communicating with Datanode, it tries to get the current block data from the next closest Datanode node. The Dfsinputstream also logs the Datanode node where the error occurred so that it does not attempt to go to those nodes later when the block data is read. Dfsinputstream will also do checksum check after reading the block data on Datanode, if checksum fails, it will first report the data on this namenode to Datanode. Then try a datanode with the current block. in this set of design, the most important point is: the client under the guidance of Namenode, directly to the optimal Datanode read data, such a design allows HDFS to support large-scale concurrency, because the client reads the data traffic distributed on each node of the cluster, Namenode simply provides location information through memory without providing data, and if the client obtains data through Namenode, the number of clients is greatly limited.
how Hadoop decides which Datanode is closest to the clientwhat does "near" mean on the network? Bandwidth is the most scarce resource when big data flows, so it is reasonable to define the distance between them with the bandwidth before two nodes. in practice, so many nodes, in every two nodes between the measurement of bandwidth is unrealistic, Hadoop took a compromise way, it takes the network structure as a tree, the distance between two nodes is two points up to the father, grandfather, ancestor ... Until two nodes have a common ancestor, they both walk the sum of the steps. No one rules how many levels a tree must have, but the usual practice is to be divided into "data center", "Rack", "node" level three, the more in the front, the smaller the communication bandwidth, such as two data center communication between the data center than two racks between the communication is slower, two racks between the communication is slower than the two nodes in the same rack. So, follow from fast to slow respectively is:-The Machine-Two nodes in the same rack-two nodes with different racks in the data center-Two nodes in a different data centerIf the data center is represented by D, R represents the rack, and N represents the node, then the/D1/R1/N1 represents the 1th node on the 1 data center 1 rack. -Distance (/d1/r1/n1,/d1/r1/n1) = 0//same machine-Distance (/d1/r1/n1, /d1/r1/n2) = 2//Two machines on the same rack, each of which has a number of steps of R1 to the Common parent node is 1, so the distance is 2-Distance (/d1/r1/n1,/d1/r2/n3) = 4//Two racks in the same data center-Distance (/d1/r1/n1,/d2/r3/n4) = 6//Different data centers
Finally, Hadoop has no way of knowing your network topology, so you have to tell it by configuration. By default, it considers all nodes to be nodes on the same rack, that is, the distance between any two is the same. In small-scale clusters, this default configuration is sufficient, but larger clusters require more configuration, and later in the cluster configuration.
Write file AnatomyHow did the file get written into HDFs? The following descriptions may be overly detailed, but this will help you understand the data consistency model of HDFS. let's discuss the process of creating a new file, writing data to it, and then closing the file:The first step: The client calls the Distributedfilesystem.create () method to create a file;Step Two: Distributedfilesystem initiates a remote method call (RPC) to Namenode and creates a file, but Namenode does not associate it to any block; Namenode did a lot of checking at this step, Ensure that the file does not currently exist, that the client has permission to create the file, and so on. If these checks are passed, Namenode creates a new file record, otherwise the creation fails and the client returns IOException. Distributedfilesystem returns a fsdataoutputstream, like reading a file, the Fsdataoutputstream is wrapped in a dfsoutputstream, By it to actually handle the communication with Datanodes and Namenode. The third step: The client writes data to the Dfsoutputstream, dfsoutputstream the data into a packet, throws it into a queue called the data queue, Datastreamer is responsible for requesting a new block from Namenode. The new block is allocated on n (the default 3) node, and the 3 nodes form a pipeline. Fourth Step: Datastreamer the packet in the data queue is piped to the first node, the first node is piped to the second node, and the second one is fed to the third. Fifth Step: Dfsoutputstream also in the internal maintenance of a notification queue, called ACK queue, the inside is the sent packets, a package only by all the pipeline Datanodes notification received, will be removed. If any of the Datanode receive fails, first, the pipeline shuts down, and then the packets in the ACK queue are put back to the head of the data queue so that the failed node downstream node does not lose the data. The node that is currently successfully receiving data will be assigned a new identity after negotiation with Namenode, so that incomplete data can be deleted when the bad node recovers back. Then open the pipeline to move the bad node out, the data will continue to the other good nodes, until the nodes are completed, when the pipeline is actually less replication of a node, to Namenode report that now the block does not reach the set number of copies, and then returned to success, Late Namenode organizes an asynchronous task to restore the number of replicas to a set value. Then, the next packet and data block are written correctly. The above operation, is transparent to the client, the client does not know that these things happen, only know that the file is successful. What if multiple datanodes fail? Hdfs-site.xml has a configuration dfs.replication.min, the default value is 1, meaning that as long as there are 1 Datanode received successfully, it is believed that the data is written successfully. The client receives a successful write return. Later Hadoop initiates an asynchronous task that restores the number of replicas to the value of the Dfs.replication setting (default 3). Sixth step: When the client finishes writing the data, it calls the close () method of the stream, which sends all the remaining packets in the data queue to the pipeline. The seventh step: All the packages received a write success feedback, the client notification Namenode write file completed. Because DataStream write the file before the Namenode to apply for the location of the block information, so when the completion of the writing file, Namenode already know where each block is, it just wait for the minimum number of copies to write success, you can return to success.
Namenode How to choose which nodes a block is written to? Hadoop is a trade-off in this algorithm. are written on the same node, or written on the same rack node, it is certainly the most efficient, because the bandwidth of the data transmission is the largest, but this is not distributed redundancy, in case the node fails, or the rack power down, this data can no longer be read. Of course, it is best to write to three machines at random, preferably in a different data center, but that is too inefficient. Even if you write on a node in the same datacenter, there are a number of options. Starting with the 1.x version, this strategy becomes pluggable. the current Hadoop default policy is:First: If the client is running on the current cluster, the first replica exists on the current node, and if the client is not running on the current cluster, the first replica node is randomly selected. Of course, this random will consider not to choose already have a lot of data or is currently dealing with a large flow of datanode;Second part: Select a random node that is not on the same rack as the first one;Third part: Select another random node on the same rack as the second one;MORE: If more copies are needed, the other nodes are randomly selected, just as much as possible on multiple racks, without having too many copies on one rack. * The book is written when Hadoop does not support cross-datacenter deployment, the current version does not know whether to remove this restriction, if so, then this strategy is not also considered across the data center, it is unclear. Overall, this strategy balance takes into account reliability (data is distributed on different racks), write bandwidth (only one write is required across racks), read performance (two datanodes on the rack when reading data), and data is distributed across the cluster.
Data consistency Modelfor a file system, the so-called data consistency model, that is, a write file operation is written into the data, at what time can be seen by the operation of other read files. HDFs balances data consistency, so it may not be possible to read the data written in the same way as the local file system. When a file is created, it is immediately visible, but when the data is written in, even if you call flush, the operation of the read file may not be visible, the length of the file might be 0. (in the previous section of the progress callback when the file was written, I did the experiment, while writing, the callback when I slept for 1 seconds, and then the other side kept looking at the file to write how big, the results have been 0, until the end of the program to complete the write, only to see the true size of the file, it was not flush, It now seems to be the result of this particular data consistency model. )HDFS Data consistency model is in block, a block is written, you will see a block of data, not finished a block, you can not see the block data. Block 1 finished, the other read operation can see the contents of the block, then Block 2 is writing, but other read operation is not visible, until Block 2 is complete, began to write block 3,block 2 data can be seen by other read operations. HDFs provides a method, Fsdataoutputstrean.sync (), that forces the currently written data to be visible to other read operations. In a later version of 1.x, this sync () method is deprecated, instead of Hflush (), and a HSync () method, which declares that the data consistency is stronger, but by the time the book is written, the HSync () method is not implemented and simply calls the Hflush (). closing the file will not show the call to the sync () method, which is the closed file, all of its data can be seen by other readers.
This data consistency model has an impact on the application. The developer of the application should be aware that it is possible to have up to one block of data loss if the data is read when the write operation occurs, or when there is a problem with the client or system. If your application is not acceptable, it is necessary to take the appropriate strategy to call the Hflush () method at the appropriate time, but frequent calls to Hflush will affect throughput, so you have to weigh the robustness of the program and the throughput two, choosing the appropriate frequency for calling Hflush.
importing data with Flume and Sqoopwriting a program to put data into HDFs is better than using existing tools. Because there are now very mature tools to do this, and have covered most of the demand. Flume is a tool for Apache's massive data movement. One of the typical applications isdeploy the flume on a Web server machine,collect the logs on the Web server and import them into HDFs. It also supports various log writes. Sqoop is also an Apache tool used to bulk import large amounts of structured data into HDFS, such as importing data from relational databases into hive. Hive is the data warehouse running on Hadoop, which is described later in this chapter.
Hadoop HDFs (3) Java Access Two-file distributed read/write policy for HDFs