R multithreading and multi-node parallel computing

Source: Internet
Author: User
Tags sparkr

A: R itself is single-threaded, how to let its multi-threaded run up, improve the speed of computing? Playing parallel computing with parallel and foreach packets

After reading the above article will be. Plainly, to load the parallel package, and then rewrite your own code is OK.

#-----With a strength to demonstrate R how multithreaded computing
Func <-function (x) {
n = 1
Raw <-X
while (x > 1) {
X <-ifelse (x%%2==0,x/2,3*x+1)
n = n + 1
}
Return (c (raw,n))
}

#----
Library (parallel)
# Use System.time to return the time required to calculate
System.time ({
X <-1:1e5
CL <-Makecluster (4) # Initialize Quad core cluster
Results <-parlapply (cl,x,func) # lapply parallel version
RES.DF <-do.call (' rbind ', results) # Integration Results
Stopcluster (CL) # Close Cluster
})

User system Elapsed
0.431 0.062 18.954

For 1:1 million executes the Func function, which is completed in only 18.954 seconds.

#---I show the results graphically (see figure I), the figure is kinda strange,,,
Library (GGPLOT2)
Df=as.data.frame (RES.DF)
Qplot (DATA=DF,X=V1,Y=V2)

------------

Figure A

-----------

Figure two: Look at the CPU usage, you can see that there are four R threads running, the CPU usage instantly soared to nearly 100%, distressed my computer,,,

---------

Use the parallel in a reptile program, and then take a crawler to test how parallel performance

It is important to note that the process of loading the package needs to be written into the function, because each thread needs to load the package.

GetData <-Function (i) {
Library (MAGRITTR)
Library (Proto)
Library (GSUBFN)
Library (Bitops)
Library (Rvest)
Library (STRINGR)
Library (DBI)
Library (Rsqlite)
#library (SQLDF)
Library (Rcurl)
#library (Ggplot2)
Library (SP)
Library (Raster)
URL <-paste0 ("http://www.cnblogs.com/pick/", I, "/") # #generate URL
Combined_info <-url%>%html_session ()%>%html_nodes ("Div.post_item div.post_item_foot")%>%html_text ()% >%strsplit (split= "\ r \ n")
Post_date <-sapply (combined_info, function (v) return (V[3]))%>%str_sub (9,24)%>%as. Posixlt () # #get the date
Post_year <-post_date$year+1900
Post_month <-post_date$mon+1
Post_day <-Post_date$mday
Post_hour <-Post_date$hour
Post_weekday <-weekdays (post_date)
Title <-url%>%html_session ()%>%html_nodes ("Div.post_item h3")%>%html_text ()%>%as.character ()%> %trim ()
Link <-url%>%html_session ()%>%html_nodes ("Div.post_item a.titlelnk")%>%html_attr ("href")%>% As.character ()
Author <-url%>%html_session ()%>%html_nodes ("Div.post_item a.lightblue")%>%html_text ()%>% As.character ()%>%trim ()
AUTHOR_HP <-url%>%html_session ()%>%html_nodes ("Div.post_item a.lightblue")%>%html_attr ("href")%>% As.character ()
Recommendation <-Url%>%html_session ()%>%html_nodes ("Div.post_item span.diggnum")%>%html_text ()%>% Trim ()%>%as.numeric ()
Article_view <-url%>%html_session ()%>%html_nodes ("Div.post_item Span.article_view")%>%html_text ()% >%str_sub (4,20)
Article_view <-gsub (")", "" ", Article_view)%>%trim ()%>%as.numeric ()
Article_comment <-url%>%html_session ()%>%html_nodes ("Div.post_item span.article_comment")%>%html_ Text ()%>%str_sub (14,100)
Article_comment <-gsub (")", "" ", Article_comment)%>%trim ()%>%as.numeric ()
Data.frame (title,recommendation,article_view,article_comment,post_date,post_weekday,post_year,post_month,post_ DAY,POST_HOUR,LINK,AUTHOR,AUTHOR_HP)

}

#--------Method 1 Loop

DF <-Data.frame ()

System.time ({
For (i-1:73) {
DF <-Rbind (Df,getdata (i))
}
})


User system Elapsed
21.605 0.938 95.918

#--------Method 2 Multi-thread parallel computing
Library (parallel)
System.time ({
X <-1:73
CL <-Makecluster (4) # Initialize Quad core cluster
Results <-parlapply (cl,x,getdata) # lapply parallel version
Jinghua <-do.call (' rbind ', results) # Integration Results
Stopcluster (CL) # Close Cluster
})

User system Elapsed
0.155 0.122 32.674

Obviously with parallel much faster, and,

---

Climbed down the data long like this,, is the blog Garden Essence Post Some information,,,

--I'm a split line-------------------------

II: Deploy R on a Linux server

And so after the deployment of the write encountered which pits,,, but Xiaonan: WEB scraping with R This article describes the various benefits of Linux on the R

Why Linux?

Network Performance & Mem. Management→faster

Better parallelization Support→faster

Uni?? Ed Encoding & Locale→faster (for coders)

more recent third party libs→faster (less bugs)

We are looking forward to building our analysis environment,

Three: summary

------

To improve the operation speed of R, we can solve the problem from the following points

1. Abandon Data.frame, embrace data.table, optimize code,,

2. Using the parallel of R itself, multithreading calculation, improve the CPU utilization,

3. On a powerful server, 16 cores 128G Ah, this violent supercomputer,

4. On multiple mega-machines for cluster, RHADOOP,SPARKR,,,

-------

Sparkr latest developments, for future reference.

Announcing Sparkr:r on Spark

Sparkr GitHub

Sparkr (R on Spark)

Documentation for package ' Sparkr ' version 1.4.1

Sparkr technology, it sounds very flashy, in fact, there are many ways to go,, and Transwrap engineers to the SPARKR environment for functional testing, the result is: to want to local R code normal operation in the SPARKR environment, need to change the code, because R The code and SPARKR Environment's R code is not the same, the data structure of Spark is the RDD (Rdd is all called resilient distributed Datasets, is a fault-tolerant, parallel data structure that allows users to explicitly store data into disk and memory , and can control the partitioning of the data. Announcing Sparkr:r on Spark just released said 1.41 version, to support Data.frame, look forward to sparkr become better use,,,

-------

Here are some r deck for your reference.

China r Language Conference (Beijing) Minutes "with lecture materials"

Minutes of China R Language Conference (Guangzhou) "With lecture materials"

Summary of the Ninth China R Language Conference (Beijing)

Summary of the Ninth China R Language Conference (Shanghai)

Minutes of China R Language Conference (Shanghai venue)

R multithreading and multi-node parallel computing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.