Sharing of third-party configuration files for MapReduce jobs

Source: Internet
Author: User

Sharing of third-party configuration files for MapReduce jobs

In fact, the sharing method for running third-party configuration files in MapReduce jobs is actually the transfer of parameters in MapReduce jobs. In other words, it is actually the application of DistributedCache.

Configuration is commonly used to pass parameters in MapReduce. Configuration is a key-value pair that represents the required parameter values as key-value pairs (key-value pairs are string types ), the set Method of Configuration is saved, and the get method is called when it is used.

This is the most basic, and some special situations are inevitable during work. For example, how to pass an object-type parameter? When your MapReduce job depends on a third-party jar package, and this third-party jar package needs to read some configuration files from the local cluster, how can you pass the configuration file to each node in the cluster?

Parameters of the object type can overwrite the toString () method of the object, represent all its elements as strings, and then use Configuration. set (name, value) transmits this string, and then get the string when used for analysis. This method can easily lead to loss of precision and a waste of space. For example, if the double type is converted to a string, the precision is not only lost, but the 8-byte space is represented by a string, which may become dozens of bytes. Secondly, it is not flexible. If the structure of this object is modified, there may be bugs.

Another nice method is to use DefaultStringifier in the Hadoop api. This class has two methods: store and load, which are used for setting and obtaining. Usage:

DefaultStringifier. store (conf, obj, "keyname ");

After the object is serialized, the specified key is stored in the conf file.

Object = DefaultStringifier. load (conf, "keyname", variableClass );

Conf is the current configuration environment conf of the MapReduce job, obj is the input object, keyname is the identifier of obj in conf, and variableclass is the class obtained by obj and converted,

Note that the object obj needs to implement the Writable interface so that it can be serialized. The Writable interface of this object can be implemented by itself or converted to the BytesWritable type. In this way, the object has to be reversed when retrieved from the conf file. The conversion method can be written like this.

Private static BytesWritable transfer (Object patterns ){
ByteArrayOutputStream baos = null;
ObjectOutputStream oos = null;
Try {
Baos = new ByteArrayOutputStream ();
Oos = new ObjectOutputStream (baos );
Oos. writeObject (patterns );
Oos. flush ();
 
Return new BytesWritable (baos. toByteArray ());
} Catch (Exception e ){
Logger. error ("", e );
} Finally {
IoUtils. close (baos );
IoUtils. close (oos );
}
Return null;
}

The reverse method is

Private static Object transferMRC (byte [] bytes ){
// MapWritable map = new MapWritable ();
ObjectInputStream is = null;
Try {
Is = new ObjectInputStream (new ByteArrayInputStream (bytes ));
Return is. readObject ();
} Catch (Exception e ){
Logger. error ("", e );
} Finally {
IoUtils. close (is );
}
Return null;
}

But what if we encounter a larger parameter? For example, the corpus used for word splitting, and so on, the Hadoop cache mechanism DistributedCache should be used.

DistributedCache is a mechanism provided by the hadoop framework. It can distribute Files specified by a job to the machine where the task is executed before the job is executed, there are mechanisms to manage cache files.

Spark subverts the sorting records maintained by MapReduce

Implement MapReduce in Oracle Database

MapReduce implements matrix multiplication-implementation code

MapReduce-based graph algorithm PDF

Hadoop HDFS and MapReduce

MapReduce counters

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.