Sharing of third-party configuration files for MapReduce jobs

Last Update:2015-01-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, the sharing method for running third-party configuration files in MapReduce jobs is actually the transfer of parameters in MapReduce jobs. In other words, it is actually the application of DistributedCache.

Configuration is commonly used to pass parameters in MapReduce. Configuration is a key-value pair that represents the required parameter values as key-value pairs (key-value pairs are string types ), the set Method of Configuration is saved, and the get method is called when it is used.

This is the most basic, and some special situations are inevitable during work. For example, how to pass an object-type parameter? When your MapReduce job depends on a third-party jar package, and this third-party jar package needs to read some configuration files from the local cluster, how can you pass the configuration file to each node in the cluster?

Parameters of the object type can overwrite the toString () method of the object, represent all its elements as strings, and then use Configuration. set (name, value) transmits this string, and then get the string when used for analysis. This method can easily lead to loss of precision and a waste of space. For example, if the double type is converted to a string, the precision is not only lost, but the 8-byte space is represented by a string, which may become dozens of bytes. Secondly, it is not flexible. If the structure of this object is modified, there may be bugs.

Another nice method is to use DefaultStringifier in the Hadoop api. This class has two methods: store and load, which are used for setting and obtaining. Usage:

DefaultStringifier. store (conf, obj, "keyname ");

After the object is serialized, the specified key is stored in the conf file.

Object = DefaultStringifier. load (conf, "keyname", variableClass );

Conf is the current configuration environment conf of the MapReduce job, obj is the input object, keyname is the identifier of obj in conf, and variableclass is the class obtained by obj and converted,

Note that the object obj needs to implement the Writable interface so that it can be serialized. The Writable interface of this object can be implemented by itself or converted to the BytesWritable type. In this way, the object has to be reversed when retrieved from the conf file. The conversion method can be written like this.

Private static BytesWritable transfer (Object patterns ){
ByteArrayOutputStream baos = null;
ObjectOutputStream oos = null;
Try {
Baos = new ByteArrayOutputStream ();
Oos = new ObjectOutputStream (baos );
Oos. writeObject (patterns );
Oos. flush ();

Return new BytesWritable (baos. toByteArray ());
} Catch (Exception e ){
Logger. error ("", e );
} Finally {
IoUtils. close (baos );
IoUtils. close (oos );
}
Return null;
}

The reverse method is

Private static Object transferMRC (byte [] bytes ){
// MapWritable map = new MapWritable ();
ObjectInputStream is = null;
Try {
Is = new ObjectInputStream (new ByteArrayInputStream (bytes ));
Return is. readObject ();
} Catch (Exception e ){
Logger. error ("", e );
} Finally {
IoUtils. close (is );
}
Return null;
}

But what if we encounter a larger parameter? For example, the corpus used for word splitting, and so on, the Hadoop cache mechanism DistributedCache should be used.

DistributedCache is a mechanism provided by the hadoop framework. It can distribute Files specified by a job to the machine where the task is executed before the job is executed, there are mechanisms to manage cache files.

Spark subverts the sorting records maintained by MapReduce

Implement MapReduce in Oracle Database

MapReduce implements matrix multiplication-implementation code

MapReduce-based graph algorithm PDF

Hadoop HDFS and MapReduce

MapReduce counters

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sharing of third-party configuration files for MapReduce jobs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sharing of third-party configuration files for MapReduce jobs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support