Java Operation HDFS Development environment Construction and HDFS read-write process

Source: Internet
Author: User
Tags hdfs dfs aliyun

Java Operation HDFS Development environment construction

We have previously described how to build hdfs pseudo-distributed environment on Linux, and also introduced some common commands in HDFs. But how do you do it at the code level? This is what is going to be covered in this section:

1. First use idea to create a MAVEN project:



Maven defaults to a warehouse that does not support CDH, and needs to be configured with the CDH warehouse in Pom.xml, as follows:

  <repositories>    <repository>      <id>cloudera</id>      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>    </repository>  </repositories>

Then also need to go to the settings.xml file, the value in the &lt;mirrorOf&gt; tag is configured to indicate that in addition to the *,!cloudera *,!cloudera Aliyun Warehouse also use the Cloudera warehouse, as follows:

<mirror>    <id>alimaven</id>    <name>aliyun maven</name>    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>    <mirrorOf>*,!cloudera</mirrorOf></mirror>

Then configure the dependent packages:

  <properties>    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>    

So our environment is set up, very simple, this is the benefit of using MAVEN, we only need to add dependency, MAVEN will automatically help us to download all the jar package, and no need to manually add the jar.

Java API Operations HDFs file system

Once the engineering environment has been built, we can invoke the Hadoop API to manipulate the HDFs file system, so let's write a test case to create a directory on the HDFs file system:

Package Org.zero01.hadoop.hdfs;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.FileSystem ; Import Org.apache.hadoop.fs.path;import Org.junit.after;import org.junit.before;import Org.junit.Test;import java.net.uri;/** * @program: Hadoop-train * @description: Hadoop HDFS Java API operation * @author: @create: 2018-03-25 13:5 9 **/public class Hdfsapp {//HDFS file System server address and port public static final String Hdfs_path = "hdfs://192.168.77.130:8020    ";    The Operation object of the HDFs file system is FileSystem FileSystem = null;    Configure the object configuration configuration = null; /** * Create HDFs directory */@Test public void mkdir () throws exception{//need to pass a path object filesystem.mkdirs    (New Path ("/hdfsapi/test"));        }//Prepare resources @Before public void setUp () throws Exception {configuration = new configuration ();        The first parameter is the URI of the server, the second parameter is the configuration object, the third parameter is the file system user name FileSystem = filesystem.get (new URI (Hdfs_path), configuration, "root"); System.out.println ("HdfsaPp.setup ");        }//Frees resources @After public void TearDown () throws Exception {configuration = null;        FileSystem = null;    System.out.println ("Hdfsapp.teardown"); }}

Operation Result:

You can see that it is running successfully, and then go to the server and see if the file is more than the directory we created:

[[email protected] ~]# hdfs dfs -ls /Found 3 items-rw-r--r--   1 root supergroup  311585484 2018-03-24 23:15 /hadoop-2.6.0-cdh5.7.0.tar.gzdrwxr-xr-x   - root supergroup          0 2018-03-25 22:17 /hdfsapi-rw-r--r--   1 root supergroup         49 2018-03-24 23:10 /hello.txt[[email protected] ~]# hdfs dfs -ls /hdfsapiFound 1 itemsdrwxr-xr-x   - root supergroup          

As above, the directory on behalf of us was created successfully.

Let's add a method to test the creation of the file and write some content into the file:

/** * 创建文件 */@Testpublic void create() throws Exception {    // 创建文件    FSDataOutputStream outputStream = fileSystem.create(new Path("/hdfsapi/test/a.txt"));    // 写入一些内容到文件中    outputStream.write("hello hadoop".getBytes());    outputStream.flush();    outputStream.close();}

After successful execution, the same goes to the server to see if there is a file we created, and whether the content of the file is what we write:

[[email protected] ~]# hdfs dfs -ls /hdfsapi/testFound 1 items-rw-r--r--   3 root supergroup         

Each operation has to go to the server to view, very troublesome, in fact, we can also directly in the code to read the file system in the content of a file, the following example:

/** * 查看HDFS里某个文件的内容 */@Testpublic void cat() throws Exception {    // 读取文件    FSDataInputStream in = fileSystem.open(new Path("/hdfsapi/test/a.txt"));    // 将文件内容输出到控制台上,第三个参数表示输出多少字节的内容    IOUtils.copyBytes(in, System.out, 1024);    in.close();}

Now that you know how to create a directory, a file, and read the contents of a file, perhaps we also need to know how to rename the file, as in the following example:

/** * 重命名文件 */@Testpublic void rename() throws Exception {    Path oldPath = new Path("/hdfsapi/test/a.txt");    Path newPath = new Path("/hdfsapi/test/b.txt");    // 第一个参数是原文件的名称,第二个则是新的名称    fileSystem.rename(oldPath, newPath);}

increase, check, change we already know how to operate, the last one to delete the operation, the following example:

/** * 删除文件 * @throws Exception */@Testpublic void delete()throws Exception{    // 第二个参数指定是否要递归删除,false=否,true=是    fileSystem.delete(new Path("/hdfsapi/test/mysql_cluster.iso"), false);}

The file of the increase, delete, check, change are all finished, let's see how to upload local files to the HDFs file system, I have a local.txt file here, the file content is as follows:

This is a local file

Write the test code as follows:

/** * 上传本地文件到HDFS */@Testpublic void copyFromLocalFile() throws Exception {    Path localPath = new Path("E:/local.txt");    Path hdfsPath = new Path("/hdfsapi/test/");    // 第一个参数是本地文件的路径,第二个则是HDFS的路径    fileSystem.copyFromLocalFile(localPath, hdfsPath);}

After performing the above method successfully, we go to HDFs to see if the copy was successful:

[[email protected] ~]# hdfs dfs -ls /hdfsapi/test/Found 2 items-rw-r--r--   3 root supergroup         12 2018-03-25 22:33 /hdfsapi/test/b.txt-rw-r--r--   3 root supergroup         

The above demonstrates uploading a small file, but if I need to upload a larger file, and want to have a progress bar, you have to use the following method:

/** * 上传大体积的本地文件到HDFS,并显示进度条 */@Testpublic void copyFromLocalFileWithProgress() throws Exception {    InputStream in = new BufferedInputStream(new FileInputStream(new File("E:/Linux Install/mysql_cluster.iso")));    FSDataOutputStream outputStream = fileSystem.create(new Path("/hdfsapi/test/mysql_cluster.iso"), new Progressable() {        public void progress() {            // 进度条的输出            System.out.print(".");        }    });    IOUtils.copyBytes(in, outputStream, 4096);    in.close();    outputStream.close();}

Similarly, after successful execution of the above method, we go to HDFs to see if the upload was successful:

[[email protected] ~]# hdfs dfs -ls -h /hdfsapi/test/Found 3 items-rw-r--r--   3 root supergroup         12 2018-03-25 22:33 /hdfsapi/test/b.txt-rw-r--r--   3 root supergroup         20 2018-03-25 22:45 /hdfsapi/test/local.txt-rw-r--r--   3 root supergroup    812.8 M 2018-03-25 23:01 /hdfsapi/test/mysql_cluster.iso[[email protected] ~]#

Since there is a natural download file upload files, and upload files in two ways. So there are two ways to download files, such as the following example:

/** * 下载HDFS文件1 *  */@Testpublic void copyToLocalFile1() throws Exception {    Path localPath = new Path("E:/b.txt");    Path hdfsPath = new Path("/hdfsapi/test/b.txt");    fileSystem.copyToLocalFile(hdfsPath, localPath);}/** * 下载HDFS文件2 * */@Testpublic void copyToLocalFile2() throws Exception {    FSDataInputStream in = fileSystem.open(new Path("/hdfsapi/test/b.txt"));    OutputStream outputStream = new FileOutputStream(new File("E:/b.txt"));    IOUtils.copyBytes(in, outputStream, 1024);    in.close();    outputStream.close();}
    • Note: The first download shown above may have a null pointer error on the Windows operating system, and a second approach is recommended on Windows

Let's show you how to list all the files in a directory, example:

/** * 查看某个目录下所有的文件 * * @throws Exception */@Testpublic void listFiles() throws Exception {    FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/hdfsapi/test/"));    for (FileStatus fileStatus : fileStatuses) {        System.out.println("这是一个:" + (fileStatus.isDirectory() ? "文件夹" : "文件"));        System.out.println("副本系数:" + fileStatus.getReplication());        System.out.println("大小:" + fileStatus.getLen());        System.out.println("路径:" + fileStatus.getPath() + "\n");    }}

The console prints the following results:

这是一个:文件副本系数:3大小:12路径:hdfs://192.168.77.130:8020/hdfsapi/test/b.txt这是一个:文件副本系数:3大小:20路径:hdfs://192.168.77.130:8020/hdfsapi/test/local.txt这是一个:文件副本系数:3大小:852279296路径:hdfs://192.168.77.130:8020/hdfsapi/test/mysql_cluster.iso

Note that from the console print results, we can see a problem: We have set a replica factor of 1 in Hdfs-site.xml, why is the query file see the coefficient is 3?

In fact, this is because these files are uploaded locally through the Java API, we do not set the replica coefficients locally, so we will use the default copy factor of Hadoop: 3.

If we were put on the server with the HDFs command, we would take the replica coefficients we set in the configuration file. If you do not believe, you can modify the path to the root directory in the code, when the console output is as follows:

这是一个:文件副本系数:1大小:311585484路径:hdfs://192.168.77.130:8020/hadoop-2.6.0-cdh5.7.0.tar.gz这是一个:文件夹副本系数:0大小:0路径:hdfs://192.168.77.130:8020/hdfsapi这是一个:文件副本系数:1大小:49路径:hdfs://192.168.77.130:8020/hello.txt

The files in the root directory are the ones we put on the HDFs command before, so the copy factor for these files is the copy factor we set in the configuration file.

HDFs Write Data Flow

On the HDFs write data flow, I found on the network a very simple and understandable comic form to explain the principle of HDFs, the author is unknown. than the general PPT to be easy to understand a lot, is a rare study material, hereby extracts to this article.

1, three parts: client, NameNode (can understand main control and file index similar to Linux inode), DataNode (storage server that holds actual data)

2. HDFs Write Data process:


HDFS Read Data Flow

3. Read the data process

4, fault tolerance: The first part: fault type and its detection method (Nodeserver fault, and network fault, and dirty data problem)

5, fault-tolerant The second part: read-write fault tolerance

6, fault-tolerant Part III: DataNode failure

7. Backup rules

8. Concluding remarks

There is also a Chinese version of this cartoon, the address is as follows:

Https://www.cnblogs.com/raphael5200/p/5497218.html

Advantages and disadvantages of the HDFs file system

HDFS Benefits:

    • Data redundancy (multi-copy storage), Hardware fault tolerance
    • Process streaming data access, write multiple reads at once
    • Ideal for storing large files
    • Can be built on inexpensive machines, saving costs

HDFs Disadvantages:

    • Not suitable for low latency data access
    • Unable to efficiently store large numbers of small files
      • Because even if only 1 m of files, but also have their own meta-data. So if there are a large number of small files, then the corresponding metadata needs to occupy more storage space, too much metadata will give Namenode increased pressure

Java Operation HDFS Development environment Construction and HDFS read-write process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.