Running R jobs quickly on many machines

Source: Internet
Author: User

As we demonstrated in ' A gentle introduction to Parallel computing in R ' one of the great things about R was how easy it is To take advantage of parallel processing capabilities to speed up calculation. In this note we'll show how to move from running jobs multiple cpus/cores to running jobs multiple machines (for even LA Rger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credits card into a supercomputer.



Colossus supercomputer:the Forbin Project

R itself is not a language designed for parallel computing. It doesn ' t has a lot of great user exposed parallel constructs. What saves us the data science tasks we tend to use R for be themselves is very well suited for parallel programming And many people has prepared very good pragmatic  libraries to exploit this. There is three main ways for a user to benefit from library supplied parallelism:

    • Link against superior and parallel libraries such as the Intel BLAS Library (supplied on Linux, OSX, and Windows As part of themicrosoft R open distribution of R).  this replaces libraries you is already using with parallel Ones, and you get a speed up for free (on appropriate tasks, such as linear algebra portions of LM ()/GLM ()).
    • Ship your modeling tasks out of R to an external parallel system for Processing. this is strategy of systems SU Ch As rx methods from RevoScaleR, now Microsoft Open R, H2O methods from H2o.ai, Orrhadoop.
    • use R ' s  parallel  facility to ship jobs to cooperating R instances. This is the strategy used in  ' A gentle Introduction to Parallel computing in R '  and many libraries that sit on t OP of  parallel . This was essentially implementing remote procedure call through Sockets or networking.

We is going to write more about the third technique.

The third technique is essentially very course grained remote procedure call. It depends on shipping copies of code and data to remote processes and then returning results. It's ill suited for very small tasks. But very well suited a reasonable number of moderate to large tasks. The strategy used by R ' s library and parallel Python's multiprocessing library (though with Python multiprocessing pretty much need to Bring in additional libraries-to-move from single machine to cluster computing).

This method is seem less efficient and less sophisticated than gkfx memory methods, but relying on object transmission means it is in principle very easy-to-extend the technique from a a-to many machines (also called "Cluster Co Mputing "). This is the "what we'll demonstrate" the R portion of Here "in moving from a" to a cluster we necessarily bring In a lot of systems/networking/security issues which we'll have to defer on).

Here are the complete R portion of the lesson. This assumes your already understand how to configure "SSH" or has a systems person who can help you with the SSH system s Teps.

Take the examples from "A gentle introduction to Parallel computing in R" and instead of starting your parallel cluster WI Th the command: " parallelCluster <- parallel::makeCluster(parallel::detectCores()) ."

Do the following:

Collect a list of addresses of machines you can ssh . The "The hard part", depends on your operating system, and something you should get help with if you had not tried it b Efore. In this case I am using the ipV4 addresses, but when using the Amazon EC2 I use hostnames.

In my case my list is:

    • My Machine (primary): "192.168.1.235", user "Johnmount"
    • Another win-vector LLC machine: "192.168.1.70", user "Johnmount"

Notice We is not collecting passwords, as we is assuming we have set up proper "Authorized_keys" and Keypairs in the " "Configurations of all of these machines. We is calling the machine we is using to issue the overall computation "primary."

It is vital try all of these addresses with "SSH" in a terminal shell before trying them with R. Also the machine address choose as "primary" must is an address the worker machines can use reach back to the primary Machine (so can ' t use "localhost", or the use of an unreachable machine as primary). Try ssh by hand back and forth from primary to all of the these machines and from all of the these machines back to your primary b Efore trying to use SSH with R.

Now with the system stuff behind us and R part are as follows. Start your cluster with:

Primary <-' 192.168.1.235 ' machineaddresses <-list (  list (host=primary,user= ' Johnmount ',       ncore=4),  list (host= ' 192.168.1.70 ', user= ' Johnmount ',       ncore=4)) spec <-lapply (machineaddresses,               function ( Machine) {Rep (list (                 host=machine$host, User=machine$user)                               ,                     machine$ncore)               }) Spec <- Unlist (spec,recursive=false) parallelcluster <-parallel::makecluster (type= ' Psock ',                                         master=primary,                                         SPEC=SPEC) Print (parallelcluster) # # Socket cluster with 8 nodes on hosts##                   ' 192.168.1.235 ', ' 192.168.1.70 '

And that's it. You can now run your job on many cores on many machines. For the right tasks this represents a substantial speedup. As always separate your concerns when Starting:first get a trivial ' Hello World ' task to work on your cluster, then get a Smaller version of your computation to work in a local machine, and only after these throw your real work at the cluster.

As we have mentioned before, with some + system work you canspin up transient Amazon EC2 instances to join your Computa tion. At the this point your credits card becomes a supercomputer (though do has to remember to shut them down to prevent extra expenses!).

Transferred from: http://www.win-vector.com/blog/2016/01/running-r-jobs-quickly-on-many-machines/

Running R jobs quickly on many machines (EXT)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.