How Hadoop handles massive small images

Last Update:2018-07-26 Source: Internet

Author: User

Tags file size memory usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Method principle:

Based on the basic principle of hbase storage System, this paper presents an effective solution to the Mapfile file of HDFs with "state Mark Bit", which is not perfect to support append processing, which solves the problem of small file storage of HDFs, and solves the problem of mapfile immediate modification.

2. Method Description:

In the massive picture background, the storage form of the picture is an important part to ensure the system performance. There is a common common problem with small file storage in HDFs, and reading small files often results in a large number of seeks and hopping retrieve files from Datanode to Datanode, which is a very inefficient way to access them. Therefore, for files that are much smaller than the size of the HDFS block, they need to be processed before being stored in HDFs. Almost all images are much smaller than 64M (hdfs default chunk size), and processing these large numbers of small images requires some form of container to be packaged in some way. Hadoop offers a number of options. The main choices are Harfile, Sequencefile, Mapfile. This system uses the Mapfile as the small file container storage. At the same time, if all pictures less than 64M are packaged, it will increase the resource loss of the process of packaging files, so a threshold is required, and when the file size exceeds this threshold, packaging operations, or directly through the Namenode upload. The threshold value of this system is 2MB. In addition, since Hadoop supports file append append operations in the latest version, there is no complete support for mapfile. This means that if the original processing method, each upload operation will rewrite the original mapfile, inefficient. The system uses the "labeling method" to mapfile packaging small files when the deletion and modification of the processing to ensure that the image storage access efficiency.

3. Specific implementation:

The basic operation of the picture includes the addition, deletion, modification and inquiry of the image. Because the picture is stored in the special environment of HDFS, the operation of adding and deleting the picture needs special treatment. Because Mapfile does not support append write operations, it is inefficient to overwrite and write the original mapfile file each time it is performed. In order to realize the corresponding function, this system has added a status flag bit to the picture metadata stored in HBase, the possible value of this state bit is "hdfslargefile", "Hdfsmapflie", "Localsmallfile" and "Deleted" four kinds. Each upload operation will be done to determine the size of the file, and the corresponding processing, update the flag bit. For the increased operation of Mapfile, the system uses the write cache queue operation to support. After each user's upload operation, the picture is written to the local queue, the flag is "Localsmallfile", and when the queue reaches the specified upload threshold, the thread is started to be packaged, and the update flag bit is "hdfsmapfile".

4. Code implementation

Storing files on HDFs, a large number of small files is very expensive namenode memory, because each file will be assigned a file descriptor, Namenode need to load all the files at the start of the description information, so the more files, the

Namenode, the bigger the cost.
We can consider, after compressing the small files and uploading to HDFs, only one file descriptor information is needed, which naturally greatly reduces the overhead of namenode memory usage. In the MapReduce calculation, the built-in Hadoop provides the following

Kind of compression format: DEFLATE gzip bzip2 LZO

Using a compressed file for mapreduce calculations, the overhead is the time it takes to decompress, and this is an issue that should be considered in a particular application scenario. However, for a large amount of small files in the application scenario, we compress the small files, but change

The locality feature.
If hundreds of thousands of small files only have a block after compression, then this block must exist on a Datanode node, in the calculation of the input of a inputsplit, no data transfer between the network overhead, and is local

Operation. If a small file is uploaded directly to HDFs, hundreds of small blocks are distributed across different datanode nodes, and calculations may need to be "moved" before they can be calculated. Files rarely in the case except Namenode inside

Overhead, you may not feel the overhead of network transmission, but it is very obvious if small files reach a certain size.
Below, we use the gzip format to compress small files and then upload them to HDFs to implement the MapReduce program for task processing.
Using a class to implement the basic map task and the reduce task, the code looks like this:

01	Package org.shirdrn.kodz.inaction.hadoop.smallfiles.compression;

03	Import java.io.IOException;

04	Import Java.util.Iterator;

06	Import org.a

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More