R and Parallel Computing

Source: Internet
Author: User
Tags intel mkl

Article Summary

In this paper, the basic concept of parallel computing is introduced, and the relationship between R and parallel computing is briefly expounded. Then the author discusses the two parallel computing modes of implicit and display from the perspective of the use of R users, and gives the corresponding cases. Implicit parallel computing mode not only provides a simple and clear method of use, but also hides the implementation details of parallel computing. So the user can focus on the problem itself. Display Parallel computing mode is more flexible, users can follow their own practical problems to choose data decomposition, memory management and calculation of the way task allocation. Finally, the author discusses the challenge of R parallelization at present and the future development.

r and Parallel computing

Statistical capital of the small partners to R,SAS,SPSS, MATLAB and other statistical software use is pro, but the parallel computing (aka high-performance Computing, distributed computing) and other concepts may be a little strange. The invitation of Brother Chingtaiun to introduce some basic concepts of parallel computing and its use in R.

What is parallel computing?

Parallel computing, to be precise, should include two aspects of high-performance computers and parallel software. For a long time, China's high-performance computers are at the forefront of the world. In the latest issue of the world TOP500 supercomputer rankings, China's power of the Taihu Lake is ranked first. But the applications of high-performance computers are relatively limited, mainly in the military, aerospace and other defense workers and scientific research fields. For most individuals, high-performance computers are spring snow for small and midsize businesses. However, in recent years, with the rapid development of personal PC, Cheap fleet, and various accelerator cards (NVIDIA GPU, Intel Xeon Phi, FPGA), the PC is now fully comparable to the previous high-performance computers. Compared to the rapid development of computer hardware, the development of parallel software is somewhat lagging, imagine you are using which software is to support parallel operation? The parallelization of software requires more research and development support, as well as parallelization of a large number of serial algorithms and existing software, which is called code modernization. Sounds quite tall on the job, however in practice a large number of bug fixes (bugfix), underlying data structure rewriting, software framework changes, and code parallelism resulting in operational uncertainties and cross-platform issues greatly increase the development and maintenance costs and operational risks of the software, It also makes the job less appealing in practice than it is imagined.

Why does R need parallel computing?

So, let's go back to R itself. As one of the most popular statistical software, R has many advantages, such as rich statistical models and data processing tools, as well as powerful visualization capabilities. But as the amount of data increases, R's memory usage and calculation mode limit R's ability to process large-scale data. From a memory point of view, R uses the memory calculation mode (in-memory), and the processed data needs to be prefetch into main memory (RAM). The advantages are high computational efficiency and speed, but the disadvantage is that the size of the problem can be handled very limited (less than the size of RAM). On the other hand, R core (R Core) is a single-threaded program. Therefore, on modern multicore processors, R is not able to efficiently utilize all of the compute cores. If you run R to the light CPU of Taihu lake with 260 compute cores, the single-threaded R program can use up to 1/260 of the computing power and waste the other 259/260 compute cores.

How did it break? Parallel Computing!

Parallel computing technology is to solve the problem of single-machine memory capacity and one-core computing ability which can not meet the computing demand in practical application. Therefore, parallel computing technology will greatly expand the use of R and the scene. The latest version of R has set the parallel package to the default installation package. The R Core Development Group also attaches great importance to parallel computing.

R users: How do I use parallel computing?

From the user's usage, the parallel computing mode in R can be divided into two types: implicit and explicit. Below I will use specific examples to give you a brief introduction.

Implicit parallel Computing

Implicit computing hides most of the details for the user, and the user does not need to know how the data is distributed, the implementation of the algorithm, or the underlying hardware resource allocation. The compute core is automatically started based on the current hardware resources. Obviously, this model is most popular for most users. We can achieve higher performance without changing the original calculation mode and code at all. Common implicit parallel methods include the following:

1, the use of parallel Computing library, such as Openblas,intel Mkl,nvidia Cublas

These parallel libraries are typically provided by the hardware manufacturer and are deeply optimized based on the corresponding hardware, with far more performance than the R-Blas library, so it is recommended that you select a high-performance library when compiling R or specify the load library at runtime by Ld_preload. The specific compilation and loading methods can be found in the Appendix section of this Blog "1". In the Matrix calculation comparison experiment, the parallel library on the 16-core CPU easily exceeds the number of R original libraries. In the image on the right, we can see that the GPU's math library has also significantly accelerated some of the common analytic algorithms.

2. Using multi-threaded functions in R

OpenMP is a multi-threaded library based on shared memory that is used primarily for application acceleration on a single node. The latest R has the OpenMP option turned on at compile time, which means that some calculations can run in multithreaded mode. For example, the Dist function in R is a multithreaded implementation function, by setting the number of threads to use multiple compute cores on the current machine, we use a simple example to feel the efficiency of parallel computing, GitHub has a complete code "2", this code needs to run under the Linux system.

#comparison of single thread and multiple threads runfor (i in 6:11) {ORDER <-2^i m <-Matrix (Rnorm (Order*ord    ER), Order,order); . Internal (Setmaxnummaththreads (1));. Internal (Setnummaththreads (1)); Res <-system.time (d <-Dist (m)) print (res). Internal (Setmaxnummaththreads (20));. Internal (Setnummaththreads (20)); Res <-system.time (d <-Dist (m)) print (res)}

3. Using the Parallelization package

Some existing parallelization packages and tools have been listed in the R High Performance computing List "3". Users can use these parallelization packages as quickly and easily as any other R package, always focusing on the problem they are dealing with, without having to consider too much about parallelization implementations and performance improvements. We take H2o.ai "4" as an example. H2O backend using Java to implement multi-threaded and multi-machine computing, the front-end of the R interface is simple and clear, the user only need to initialize the H2O after loading the package number of threads, subsequent calculations, such as GBM,GLM, deeplearning algorithm, will be automatically assigned to multiple threads and multiple CPUs. Detailed functions can be found in the H2O document "5".

Library (H2O) h2o.init (nthreads = 4) Connection successful!
R is connected to the H2O cluster:    H2O cluster uptime:          1 hours minutes    H2O cluster version:        3.8.3.3     H2O cluster name:           h2o_started_from_ r_patricz_ywj416    H2O cluster total nodes:    1    H2O cluster total memory:   1.55 gb    H2O cluster total cores:    4    H2O Cluster allowed cores:  4    H2O cluster healthy:        TRUE& nbsp;   H2O Connection ip:          localhost     H2O Connection port:        54321    H2O Connection proxy:        na    R version:                  R version 3.3.0 (2016-05-03) 
Show Parallel Calculations

Explicit calculations require that the user be able to handle the data partitioning, task assignment, calculation, and final result collection in the study itself. Therefore, the explicit calculation mode to the user's higher requirements, the user not only need to understand their own algorithms, but also need to have a certain understanding of parallel computing and hardware. Fortunately, the existing parallel computing frameworks in R, such as parallel (Snow,multicores), Rmpi and foreach, use a mapping parallel model (Mapping), which is simple and clear, greatly simplifying programming complexity. R users only need to convert an existing program into a *apply or for loop form, and implement parallel computing with a simple API substitution. For more complex computational patterns, the user can be constructed by repeating the mapping collection (map-reduce) process. Below we use the unary two equation solution problem to describe how to use *apply and foreach to do parallelization calculation, complete code (EXPLICITPARALLEL.R) "6" can be downloaded on GitHub. First, we give a non-vectorization two-order equation solving function, which includes the processing of several special cases, such as two-time term coefficient is zero, two-time term and one-time coefficient are zero or the open radical number is negative. We randomly generated 3 large vectors that saved two times of the equation, one-time term and constant term coefficients.

# vectorized Functionsolve.quad.eq <-function (A, B, c) {# Not validate eqution:a and B is almost ZERO if ( ABS (a) < 1e-8 && ABS (b) < 1e-8) return (C (Na, NA)) # not quad equation if (ABS (a) < 1e-8 &&   ABS (b) > 1e-8) return (C (-c/b, NA)) # Solution if (B*b-4*a*c < 0) return (c (Na,na)) # Return Solutions X.delta <-sqrt (b*b-4*a*c) x1 <-(-B + x.delta)/(2*a) x2 <-(-b-x.delta)/(2*a) return (c (x1, x2))} # Ge Nerate data len <-1e6a <-runif (len, -10, Ten) a[sample (Len, 100,replace=true)] <-0b <-runif (len, -10, ten) C & lt;-runif (Len,-10, 10)

Apply Implementation method: First we look at the serial code, the following code uses the Lapply function to map the equation Solver function Solve.quad.eq to each set of input data, the return value is saved to the list.

# serial Codesystem.time (res1.s <-lapply (1:len, fun = function (x) {Solve.quad.eq (a[x], b[x], c[x])))

Next, we use the mclapply (multicores) in the parallel package to parallelize the calculations in lapply. From the API's interface, the mclapply is used in the same way as the original lapply, in addition to specifying the required compute cores, which is very low for the user. The mclapply function takes advantage of the Linux down Fork mechanism to create multiple copies of the current R process and assigns the input index to multiple processes, after which each process calculates according to its own index and finally collects the results of the merge. In this example we specify 2 worker processes, one for process 1: (LEN/2), another for (len/2+1): Len's data, and finally, when Mclapply returns, merges two parts of the result into RES1.P. However, since Multicores uses the Linux process creation mechanism at the bottom, this version can only be executed under Linux.

# parallellibrary (parallel) # multicores on Linuxsystem.time (res1.p <-mclapply (1:len, fun = function (x) {Solve.quad. EQ (a[x], b[x], c[x])}, Mc.cores = 2))

For non-Linux users, we can use the parlapply function in the parallel package for parallelization. The Parlapply function supports different platforms such as WINDOWS,LINUX,MAC, which are more portable, but slightly more complex to use. Before using the Parlapply function, we first need to establish a calculation group (cluster). A calculation group is a software-level concept that refers to how many R worker processes we need to create (the parallel package creates a new R worker process rather than a copy of the R parent process in the multicores) to calculate, in theory the size of the group is not affected by the hardware environment. For example, we can create a calculation group with a size of 1000, that is, there are 1000 R worker processes. In practice, however, we typically use the same number of compute groups as hardware computing resources, where each R worker process can be mapped separately to a compute core. If the number of calculated groups is greater than the existing hardware resources, then multiple R worker processes will share the existing hardware resources. In the following example we first use Detectcores to determine the number of cores in the current computer. It is worth noting that the default number of returns for Detectcores is the number of hyper-threads instead of the actual physical cores. For example, there are 2 physical cores on my laptop, and each physical core can emulate two hyper-threads, so the return value of Detectcores () is 4. For many computationally intensive tasks, Hyper-Threading is not very useful for performance, so use the Logical=false parameter to get the actual number of physical cores and create an equal number of compute groups. Because the processes in the calculation group are completely new R processes, the data and functions in the parent process are not visible to the child process. Therefore, we need to use Clusterexport to broadcast the data and functions needed for calculations to all processes in the computing group. Finally parlapply distributes the calculation evenly to all r processes in the calculation group, and then collects the merge results.

#Cluster on Windowscores <-detectcores (logical = FALSE) CL <-makecluster (Cores) Clusterexport (CL, C (' Solve.quad.eq ', ' A ', ' B ', ' C ')) System.time (RES1.P <-parlapply (cl, 1:len, function (x) {Solve.quad.eq (a[x], b[x], c[ X])) Stopcluster (CL))

for implementation: the calculation of the For loop is basically similar to the *apply form. In the following serial implementations, we created the matrix in advance to hold the results of the calculation, and only need to assign values within the for loop.

# Serial Coderes2.s <-matrix (0, Nrow=len, Ncol = 2) system.time (for (i in 1:len) {res2.s[i,] <-Solve.qua D.eq (A[i], b[i], C[i])

For parallelization of the For loop, we can use the%dopar% operation in the foreach package to assign the calculation to multiple compute cores. The foreach package provides a data mapping method for the software layer, but does not include the establishment of a calculation group. Therefore, we need to create a calculation group Doparallel or DOMC packets. The creation of the calculation group is the same as before, and when the calculation group is established, we need to use Registerdoparallel to set the calculation method for the foreach backend. Here we start with the data allocation method, we want to assign each R worker process a continuous computation task, and the 1:len data will be distributed evenly to each R worker process. Assuming we have two worker processes, process 1 processes 1 to LEN/2 of data, and process 2 processes data from len/2+1 to Len. So in the following program, we distribute the vectors evenly to the calculation group, each process calculates the chunk.size size of the contact task. And the matrix is created within the process to hold the results, and the Foreach function merges the results according to the Rbind function of the. Combine reference.

# foreachlibrary (foreach) library (doparallel) # Real physical cores in the Computercores <-detectcores (logical=f) CL & lt;-makecluster (Cores) Registerdoparallel (cl, cores=cores) # split data by Ourselveschunk.size <-len/ Coressystem.time (res2.p <-foreach (i=1:cores,. combine= ' Rbind ')%dopar% {# Local data for results res <-MA Trix (0, Nrow=chunk.size, ncol=2) for (x in ((i-1) *chunk.size+1):(i*chunk.size)) {res[x-(i-1) *chunk.size,] < ;-Solve.quad.eq (A[x], b[x], c[x])} # Return local results res}) Stopimplicitcluster () Stopcluster (CL)

Finally, we test with 4 threads on the Linux platform, and the parallel implementations of the above versions can be up to 3 times times faster than the other.

the challenge and prospect of R parallelization Challenges:

In practice, the problem of parallel computing is not so simple. The challenge of parallelizing R and the entire ecosystem remains enormous.

1, R is a decentralized, non-commercial software

R is not developed by a compact organization or company, and most of its packages are developed by users themselves. This means that it is difficult to align and deploy software architectures and designs uniformly. Some commercial software, such as MATLAB, its management and maintenance development is very unified, the structure of the adjustment and local reconstruction is relatively easy. With several versions of iterations, the overall degree of parallelism of the software is much higher.

2. The bottom design of R is still single thread, the upper application package is highly dependent

R was originally designed in single-threaded mode, meaning that many of the underlying data structures are not thread-safe. Therefore, when the upper-level parallel algorithm is implemented, many data structures need to be rewritten or adjusted, which will also destroy some of R's original design patterns. On the other hand, in R, the dependency of the package is strong, we assume that the B packet is used, and the B packet calls a package. If the B package implements multi-threading first, but after a period of time a packet is also parallelized. At this point there is a good chance of mixed parallelism, the program is very likely to have a variety of strange errors (bugs), performance will be greatly reduced.

Outlook:

What are the main patterns of future R parallelization? The following is purely fictitious, if there is a similarity is entirely coincidental.

1. R will be more dependent on commercialization and high-performance components provided by research institutions

For example, H2o,mxnet,intel DAAL, these packages have greatly exploited the efficiency of parallelism, and have long-term updates and tuning of relevant personnel. In essence, the development of software is inseparable from the investment of manpower and capital.

2. Cloud computing Platform

With the rise of cloud computing, the wave of data analysis as a service (Daas:data Analyst as a services) and machine learning as a service (Mlas:machine learning as a services) will come. Major service providers from the bottom of the hardware deployment, database optimization to the last algorithm optimization has provided the corresponding parallelization measures, such as Microsoft recently launched a series of R on the cloud products, more information please see this article. Therefore, the future of more parallelization work will be transparent to users, r users see the original R, but the real calculation has been distributed to the cloud.

Resources

"1" http://www.parallelr.com/r-hpac-benchmark-analysis/"2" https://github.com/PatricZhao/ParallelR/blob/master/ PP_FOR_COS/IMPLICITPARALLEL_MT. R "3" https://cran.r-project.org/web/views/HighPerformanceComputing.html "4" http://www.h2o.ai/"5"/HTTP// Docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/rbooklet.pdf "6" https://github.com/PatricZhao/ParallelR/blob/ Master/pp_for_cos/explicitparallel.r

Transferred from: http://cos.name/2016/09/r-and-parallel-computing/

R vs. parallel Computing (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.