Topic Center

Contact Sales

Home > Developer > Java

Java Operation HDFS Development environment Construction and HDFS read-write process

Last Update:2018-03-25 Source: Internet

Author: User

Tags hdfs dfs aliyun

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java Operation HDFS Development environment construction

We have previously described how to build hdfs pseudo-distributed environment on Linux, and also introduced some common commands in HDFs. But how do you do it at the code level? This is what is going to be covered in this section:

1. First use idea to create a MAVEN project:

Maven defaults to a warehouse that does not support CDH, and needs to be configured with the CDH warehouse in Pom.xml, as follows:

  <repositories>    <repository>      <id>cloudera</id>      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>    </repository>  </repositories>

Then also need to go to the settings.xml file, the value in the <mirrorOf> tag is configured to indicate that in addition to the *,!cloudera *,!cloudera Aliyun Warehouse also use the Cloudera warehouse, as follows:

<mirror>    <id>alimaven</id>    <name>aliyun maven</name>    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>    <mirrorOf>*,!cloudera</mirrorOf></mirror>

Then configure the dependent packages:

  <properties>    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

So our environment is set up, very simple, this is the benefit of using MAVEN, we only need to add dependency, MAVEN will automatically help us to download all the jar package, and no need to manually add the jar.

   Java API Operations HDFs file system
 Once the engineering environment has been built, we can invoke the Hadoop API to manipulate the HDFs file system, so let's write a test case to create a directory on the HDFs file system: 
 
Package Org.zero01.hadoop.hdfs;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.FileSystem ; Import Org.apache.hadoop.fs.path;import Org.junit.after;import org.junit.before;import Org.junit.Test;import java.net.uri;/** * @program: Hadoop-train * @description: Hadoop HDFS Java API operation * @author: @create: 2018-03-25 13:5 9 **/public class Hdfsapp {//HDFS file System server address and port public static final String Hdfs_path = "hdfs://192.168.77.130:8020    ";    The Operation object of the HDFs file system is FileSystem FileSystem = null;    Configure the object configuration configuration = null; /** * Create HDFs directory */@Test public void mkdir () throws exception{//need to pass a path object filesystem.mkdirs    (New Path ("/hdfsapi/test"));        }//Prepare resources @Before public void setUp () throws Exception {configuration = new configuration ();        The first parameter is the URI of the server, the second parameter is the configuration object, the third parameter is the file system user name FileSystem = filesystem.get (new URI (Hdfs_path), configuration, "root"); System.out.println ("HdfsaPp.setup ");        }//Frees resources @After public void TearDown () throws Exception {configuration = null;        FileSystem = null;    System.out.println ("Hdfsapp.teardown"); }}
Operation Result:
You can see that it is running successfully, and then go to the server and see if the file is more than the directory we created:
[[email protected] ~]# hdfs dfs -ls /Found 3 items-rw-r--r--   1 root supergroup  311585484 2018-03-24 23:15 /hadoop-2.6.0-cdh5.7.0.tar.gzdrwxr-xr-x   - root supergroup          0 2018-03-25 22:17 /hdfsapi-rw-r--r--   1 root supergroup         49 2018-03-24 23:10 /hello.txt[[email protected] ~]# hdfs dfs -ls /hdfsapiFound 1 itemsdrwxr-xr-x   - root supergroup          
As above, the directory on behalf of us was created successfully.
Let's add a method to test the creation of the file and write some content into the file:
/** * 创建文件 */@Testpublic void create() throws Exception {    // 创建文件    FSDataOutputStream outputStream = fileSystem.create(new Path("/hdfsapi/test/a.txt"));    // 写入一些内容到文件中    outputStream.write("hello hadoop".getBytes());    outputStream.flush();    outputStream.close();}
After successful execution, the same goes to the server to see if there is a file we created, and whether the content of the file is what we write:
[[email protected] ~]# hdfs dfs -ls /hdfsapi/testFound 1 items-rw-r--r--   3 root supergroup         
Each operation has to go to the server to view, very troublesome, in fact, we can also directly in the code to read the file system in the content of a file, the following example:
/** * 查看HDFS里某个文件的内容 */@Testpublic void cat() throws Exception {    // 读取文件    FSDataInputStream in = fileSystem.open(new Path("/hdfsapi/test/a.txt"));    // 将文件内容输出到控制台上，第三个参数表示输出多少字节的内容    IOUtils.copyBytes(in, System.out, 1024);    in.close();}
Now that you know how to create a directory, a file, and read the contents of a file, perhaps we also need to know how to rename the file, as in the following example:
/** * 重命名文件 */@Testpublic void rename() throws Exception {    Path oldPath = new Path("/hdfsapi/test/a.txt");    Path newPath = new Path("/hdfsapi/test/b.txt");    // 第一个参数是原文件的名称，第二个则是新的名称    fileSystem.rename(oldPath, newPath);}
increase, check, change we already know how to operate, the last one to delete the operation, the following example:
/** * 删除文件 * @throws Exception */@Testpublic void delete()throws Exception{    // 第二个参数指定是否要递归删除，false=否，true=是    fileSystem.delete(new Path("/hdfsapi/test/mysql_cluster.iso"), false);}
The file of the increase, delete, check, change are all finished, let's see how to upload local files to the HDFs file system, I have a local.txt file here, the file content is as follows:
 
  
   
   This is a local file 
   
 
Write the test code as follows:
/** * 上传本地文件到HDFS */@Testpublic void copyFromLocalFile() throws Exception {    Path localPath = new Path("E:/local.txt");    Path hdfsPath = new Path("/hdfsapi/test/");    // 第一个参数是本地文件的路径，第二个则是HDFS的路径    fileSystem.copyFromLocalFile(localPath, hdfsPath);}
After performing the above method successfully, we go to HDFs to see if the copy was successful:
[[email protected] ~]# hdfs dfs -ls /hdfsapi/test/Found 2 items-rw-r--r--   3 root supergroup         12 2018-03-25 22:33 /hdfsapi/test/b.txt-rw-r--r--   3 root supergroup         
The above demonstrates uploading a small file, but if I need to upload a larger file, and want to have a progress bar, you have to use the following method:
/** * 上传大体积的本地文件到HDFS，并显示进度条 */@Testpublic void copyFromLocalFileWithProgress() throws Exception {    InputStream in = new BufferedInputStream(new FileInputStream(new File("E:/Linux Install/mysql_cluster.iso")));    FSDataOutputStream outputStream = fileSystem.create(new Path("/hdfsapi/test/mysql_cluster.iso"), new Progressable() {        public void progress() {            // 进度条的输出            System.out.print(".");        }    });    IOUtils.copyBytes(in, outputStream, 4096);    in.close();    outputStream.close();}
Similarly, after successful execution of the above method, we go to HDFs to see if the upload was successful:
[[email protected] ~]# hdfs dfs -ls -h /hdfsapi/test/Found 3 items-rw-r--r--   3 root supergroup         12 2018-03-25 22:33 /hdfsapi/test/b.txt-rw-r--r--   3 root supergroup         20 2018-03-25 22:45 /hdfsapi/test/local.txt-rw-r--r--   3 root supergroup    812.8 M 2018-03-25 23:01 /hdfsapi/test/mysql_cluster.iso[[email protected] ~]#
Since there is a natural download file upload files, and upload files in two ways. So there are two ways to download files, such as the following example:
/** * 下载HDFS文件1 *  */@Testpublic void copyToLocalFile1() throws Exception {    Path localPath = new Path("E:/b.txt");    Path hdfsPath = new Path("/hdfsapi/test/b.txt");    fileSystem.copyToLocalFile(hdfsPath, localPath);}/** * 下载HDFS文件2 * */@Testpublic void copyToLocalFile2() throws Exception {    FSDataInputStream in = fileSystem.open(new Path("/hdfsapi/test/b.txt"));    OutputStream outputStream = new FileOutputStream(new File("E:/b.txt"));    IOUtils.copyBytes(in, outputStream, 1024);    in.close();    outputStream.close();} 
  
   
   Note: The first download shown above may have a null pointer error on the Windows operating system, and a second approach is recommended on Windows 
   
 
Let's show you how to list all the files in a directory, example:
/** * 查看某个目录下所有的文件 * * @throws Exception */@Testpublic void listFiles() throws Exception {    FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/hdfsapi/test/"));    for (FileStatus fileStatus : fileStatuses) {        System.out.println("这是一个：" + (fileStatus.isDirectory() ? "文件夹" : "文件"));        System.out.println("副本系数：" + fileStatus.getReplication());        System.out.println("大小：" + fileStatus.getLen());        System.out.println("路径：" + fileStatus.getPath() + "\n");    }}
The console prints the following results:
这是一个：文件副本系数：3大小：12路径：hdfs://192.168.77.130:8020/hdfsapi/test/b.txt这是一个：文件副本系数：3大小：20路径：hdfs://192.168.77.130:8020/hdfsapi/test/local.txt这是一个：文件副本系数：3大小：852279296路径：hdfs://192.168.77.130:8020/hdfsapi/test/mysql_cluster.iso
Note that from the console print results, we can see a problem: We have set a replica factor of 1 in Hdfs-site.xml, why is the query file see the coefficient is 3?
In fact, this is because these files are uploaded locally through the Java API, we do not set the replica coefficients locally, so we will use the default copy factor of Hadoop: 3.
If we were put on the server with the HDFs command, we would take the replica coefficients we set in the configuration file. If you do not believe, you can modify the path to the root directory in the code, when the console output is as follows:
这是一个：文件副本系数：1大小：311585484路径：hdfs://192.168.77.130:8020/hadoop-2.6.0-cdh5.7.0.tar.gz这是一个：文件夹副本系数：0大小：0路径：hdfs://192.168.77.130:8020/hdfsapi这是一个：文件副本系数：1大小：49路径：hdfs://192.168.77.130:8020/hello.txt
The files in the root directory are the ones we put on the HDFs command before, so the copy factor for these files is the copy factor we set in the configuration file.
   HDFs Write Data Flow
On the HDFs write data flow, I found on the network a very simple and understandable comic form to explain the principle of HDFs, the author is unknown. than the general PPT to be easy to understand a lot, is a rare study material, hereby extracts to this article.
1, three parts: client, NameNode (can understand main control and file index similar to Linux inode), DataNode (storage server that holds actual data)
2. HDFs Write Data process:


   HDFS Read Data Flow
3. Read the data process
4, fault tolerance: The first part: fault type and its detection method (Nodeserver fault, and network fault, and dirty data problem)

5, fault-tolerant The second part: read-write fault tolerance
6, fault-tolerant Part III: DataNode failure
7. Backup rules
8. Concluding remarks
There is also a Chinese version of this cartoon, the address is as follows:
 
  
   
   Https://www.cnblogs.com/raphael5200/p/5497218.html 
   
 
   Advantages and disadvantages of the HDFs file system
HDFS Benefits:
 
  
   
   Data redundancy (multi-copy storage), Hardware fault tolerance 
   Process streaming data access, write multiple reads at once 
   Ideal for storing large files 
   Can be built on inexpensive machines, saving costs 
   
 
HDFs Disadvantages:
 
  
   
   Not suitable for low latency data access 
   Unable to efficiently store large numbers of small files 
     
     Because even if only 1 m of files, but also have their own meta-data. So if there are a large number of small files, then the corresponding metadata needs to occupy more storage space, too much metadata will give Namenode increased pressure 
     
   
 
Java Operation HDFS Development environment Construction and HDFS read-write process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

hdfs architecture cloudera hdfs hdfs explained start hdfs hdfs login hdfs commands isilon hdfs

Java's garbage collection mechanism 07-06

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Operation HDFS Development environment Construction and HDFS read-write process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support