R multithreading and multi-node parallel computing

Last Update:2015-07-19 Source: Internet

Author: User

Tags sparkr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A: R itself is single-threaded, how to let its multi-threaded run up, improve the speed of computing? Playing parallel computing with parallel and foreach packets

After reading the above article will be. Plainly, to load the parallel package, and then rewrite your own code is OK.

#-----With a strength to demonstrate R how multithreaded computing
Func <-function (x) {
n = 1
Raw <-X
while (x > 1) {
X <-ifelse (x%%2==0,x/2,3*x+1)
n = n + 1
}
Return (c (raw,n))
}

#----
Library (parallel)
# Use System.time to return the time required to calculate
System.time ({
X <-1:1e5
CL <-Makecluster (4) # Initialize Quad core cluster
Results <-parlapply (cl,x,func) # lapply parallel version
RES.DF <-do.call (' rbind ', results) # Integration Results
Stopcluster (CL) # Close Cluster
})

User system Elapsed
0.431 0.062 18.954

For 1:1 million executes the Func function, which is completed in only 18.954 seconds.

#---I show the results graphically (see figure I), the figure is kinda strange,,,
Library (GGPLOT2)
Df=as.data.frame (RES.DF)
Qplot (DATA=DF,X=V1,Y=V2)

－－－－－－－－－－－－

Figure A

－－－－－－－－－－－

Figure two: Look at the CPU usage, you can see that there are four R threads running, the CPU usage instantly soared to nearly 100%, distressed my computer,,,

－－－－－－－－－

Use the parallel in a reptile program, and then take a crawler to test how parallel performance

It is important to note that the process of loading the package needs to be written into the function, because each thread needs to load the package.

GetData <-Function (i) {
Library (MAGRITTR)
Library (Proto)
Library (GSUBFN)
Library (Bitops)
Library (Rvest)
Library (STRINGR)
Library (DBI)
Library (Rsqlite)
#library (SQLDF)
Library (Rcurl)
#library (Ggplot2)
Library (SP)
Library (Raster)
URL <-paste0 ("http://www.cnblogs.com/pick/", I, "/") # #generate URL
Combined_info <-url%>%html_session ()%>%html_nodes ("Div.post_item div.post_item_foot")%>%html_text ()% >%strsplit (split= "\ r \ n")
Post_date <-sapply (combined_info, function (v) return (V[3]))%>%str_sub (9,24)%>%as. Posixlt () # #get the date
Post_year <-post_date$year+1900
Post_month <-post_date$mon+1
Post_day <-Post_date$mday
Post_hour <-Post_date$hour
Post_weekday <-weekdays (post_date)
Title <-url%>%html_session ()%>%html_nodes ("Div.post_item h3")%>%html_text ()%>%as.character ()%> %trim ()
Link <-url%>%html_session ()%>%html_nodes ("Div.post_item a.titlelnk")%>%html_attr ("href")%>% As.character ()
Author <-url%>%html_session ()%>%html_nodes ("Div.post_item a.lightblue")%>%html_text ()%>% As.character ()%>%trim ()
AUTHOR_HP <-url%>%html_session ()%>%html_nodes ("Div.post_item a.lightblue")%>%html_attr ("href")%>% As.character ()
Recommendation <-Url%>%html_session ()%>%html_nodes ("Div.post_item span.diggnum")%>%html_text ()%>% Trim ()%>%as.numeric ()
Article_view <-url%>%html_session ()%>%html_nodes ("Div.post_item Span.article_view")%>%html_text ()% >%str_sub (4,20)
Article_view <-gsub (")", "" ", Article_view)%>%trim ()%>%as.numeric ()
Article_comment <-url%>%html_session ()%>%html_nodes ("Div.post_item span.article_comment")%>%html_ Text ()%>%str_sub (14,100)
Article_comment <-gsub (")", "" ", Article_comment)%>%trim ()%>%as.numeric ()
Data.frame (title,recommendation,article_view,article_comment,post_date,post_weekday,post_year,post_month,post_ DAY,POST_HOUR,LINK,AUTHOR,AUTHOR_HP)

}

#--------Method 1 Loop

DF <-Data.frame ()

System.time ({
For (i-1:73) {
DF <-Rbind (Df,getdata (i))
}
})

User system Elapsed
21.605 0.938 95.918

#--------Method 2 Multi-thread parallel computing
Library (parallel)
System.time ({
X <-1:73
CL <-Makecluster (4) # Initialize Quad core cluster
Results <-parlapply (cl,x,getdata) # lapply parallel version
Jinghua <-do.call (' rbind ', results) # Integration Results
Stopcluster (CL) # Close Cluster
})

User system Elapsed
0.155 0.122 32.674

Obviously with parallel much faster, and,

－－－

Climbed down the data long like this,, is the blog Garden Essence Post Some information,,,

--I'm a split line-------------------------

II: Deploy R on a Linux server

And so after the deployment of the write encountered which pits,,, but Xiaonan: WEB scraping with R This article describes the various benefits of Linux on the R

Why Linux?

Network Performance & Mem. Management→faster

Better parallelization Support→faster

Uni?? Ed Encoding & Locale→faster (for coders)

more recent third party libs→faster (less bugs)

We are looking forward to building our analysis environment,

Three: summary

－－－－－－

To improve the operation speed of R, we can solve the problem from the following points

1. Abandon Data.frame, embrace data.table, optimize code,,

2. Using the parallel of R itself, multithreading calculation, improve the CPU utilization,

3. On a powerful server, 16 cores 128G Ah, this violent supercomputer,

4. On multiple mega-machines for cluster, RHADOOP,SPARKR,,,

－－－－－－－

Sparkr latest developments, for future reference.

Announcing Sparkr:r on Spark

Sparkr GitHub

Sparkr (R on Spark)

Documentation for package ' Sparkr ' version 1.4.1

Sparkr technology, it sounds very flashy, in fact, there are many ways to go,, and Transwrap engineers to the SPARKR environment for functional testing, the result is: to want to local R code normal operation in the SPARKR environment, need to change the code, because R The code and SPARKR Environment's R code is not the same, the data structure of Spark is the RDD (Rdd is all called resilient distributed Datasets, is a fault-tolerant, parallel data structure that allows users to explicitly store data into disk and memory , and can control the partitioning of the data. Announcing Sparkr:r on Spark just released said 1.41 version, to support Data.frame, look forward to sparkr become better use,,,

－－－－－－－

Here are some r deck for your reference.

China r Language Conference (Beijing) Minutes "with lecture materials"

Minutes of China R Language Conference (Guangzhou) "With lecture materials"

Summary of the Ninth China R Language Conference (Beijing)

Summary of the Ninth China R Language Conference (Shanghai)

Minutes of China R Language Conference (Shanghai venue)

R multithreading and multi-node parallel computing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More