How Hadoop handles massive small images

Source: Internet
Author: User
Tags file size memory usage

1. Method principle:

Based on the basic principle of hbase storage System, this paper presents an effective solution to the Mapfile file of HDFs with "state Mark Bit", which is not perfect to support append processing, which solves the problem of small file storage of HDFs, and solves the problem of mapfile immediate modification.

2. Method Description:

In the massive picture background, the storage form of the picture is an important part to ensure the system performance. There is a common common problem with small file storage in HDFs, and reading small files often results in a large number of seeks and hopping retrieve files from Datanode to Datanode, which is a very inefficient way to access them. Therefore, for files that are much smaller than the size of the HDFS block, they need to be processed before being stored in HDFs. Almost all images are much smaller than 64M (hdfs default chunk size), and processing these large numbers of small images requires some form of container to be packaged in some way. Hadoop offers a number of options. The main choices are Harfile, Sequencefile, Mapfile. This system uses the Mapfile as the small file container storage. At the same time, if all pictures less than 64M are packaged, it will increase the resource loss of the process of packaging files, so a threshold is required, and when the file size exceeds this threshold, packaging operations, or directly through the Namenode upload. The threshold value of this system is 2MB. In addition, since Hadoop supports file append append operations in the latest version, there is no complete support for mapfile. This means that if the original processing method, each upload operation will rewrite the original mapfile, inefficient. The system uses the "labeling method" to mapfile packaging small files when the deletion and modification of the processing to ensure that the image storage access efficiency.

3. Specific implementation:

The basic operation of the picture includes the addition, deletion, modification and inquiry of the image. Because the picture is stored in the special environment of HDFS, the operation of adding and deleting the picture needs special treatment. Because Mapfile does not support append write operations, it is inefficient to overwrite and write the original mapfile file each time it is performed. In order to realize the corresponding function, this system has added a status flag bit to the picture metadata stored in HBase, the possible value of this state bit is "hdfslargefile", "Hdfsmapflie", "Localsmallfile" and "Deleted" four kinds. Each upload operation will be done to determine the size of the file, and the corresponding processing, update the flag bit. For the increased operation of Mapfile, the system uses the write cache queue operation to support. After each user's upload operation, the picture is written to the local queue, the flag is "Localsmallfile", and when the queue reaches the specified upload threshold, the thread is started to be packaged, and the update flag bit is "hdfsmapfile".


4. Code implementation

Storing files on HDFs, a large number of small files is very expensive namenode memory, because each file will be assigned a file descriptor, Namenode need to load all the files at the start of the description information, so the more files, the

Namenode, the bigger the cost.
We can consider, after compressing the small files and uploading to HDFs, only one file descriptor information is needed, which naturally greatly reduces the overhead of namenode memory usage. In the MapReduce calculation, the built-in Hadoop provides the following

Kind of compression format: DEFLATE gzip bzip2 LZO

Using a compressed file for mapreduce calculations, the overhead is the time it takes to decompress, and this is an issue that should be considered in a particular application scenario. However, for a large amount of small files in the application scenario, we compress the small files, but change

The locality feature.
If hundreds of thousands of small files only have a block after compression, then this block must exist on a Datanode node, in the calculation of the input of a inputsplit, no data transfer between the network overhead, and is local

Operation. If a small file is uploaded directly to HDFs, hundreds of small blocks are distributed across different datanode nodes, and calculations may need to be "moved" before they can be calculated. Files rarely in the case except Namenode inside

Overhead, you may not feel the overhead of network transmission, but it is very obvious if small files reach a certain size.
Below, we use the gzip format to compress small files and then upload them to HDFs to implement the MapReduce program for task processing.
Using a class to implement the basic map task and the reduce task, the code looks like this:

01 Package org.shirdrn.kodz.inaction.hadoop.smallfiles.compression;
02
03 Import java.io.IOException;
04 Import Java.util.Iterator;
05
06 Import org.a

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.