A: R itself is single-threaded, how to let its multi-threaded run up, improve the speed of computing? Playing parallel computing with parallel and foreach packets
After reading the above article will be. Plainly, to load the parallel package, and then rewrite your own code is OK.
#-----With a strength to demonstrate R how multithreaded computing
Func <-function (x) {
n = 1
Raw <-X
while (x > 1) {
X <-ifelse (x%%2==0,x/2,3*x+1)
n = n + 1
}
Return (c (raw,n))
}
#----
Library (parallel)
# Use System.time to return the time required to calculate
System.time ({
X <-1:1e5
CL <-Makecluster (4) # Initialize Quad core cluster
Results <-parlapply (cl,x,func) # lapply parallel version
RES.DF <-do.call (' rbind ', results) # Integration Results
Stopcluster (CL) # Close Cluster
})
User system Elapsed
0.431 0.062 18.954
For 1:1 million executes the Func function, which is completed in only 18.954 seconds.
#---I show the results graphically (see figure I), the figure is kinda strange,,,
Library (GGPLOT2)
Df=as.data.frame (RES.DF)
Qplot (DATA=DF,X=V1,Y=V2)
------------
Figure A
-----------
Figure two: Look at the CPU usage, you can see that there are four R threads running, the CPU usage instantly soared to nearly 100%, distressed my computer,,,
---------
Use the parallel in a reptile program, and then take a crawler to test how parallel performance
It is important to note that the process of loading the package needs to be written into the function, because each thread needs to load the package.
GetData <-Function (i) {
Library (MAGRITTR)
Library (Proto)
Library (GSUBFN)
Library (Bitops)
Library (Rvest)
Library (STRINGR)
Library (DBI)
Library (Rsqlite)
#library (SQLDF)
Library (Rcurl)
#library (Ggplot2)
Library (SP)
Library (Raster)
URL <-paste0 ("http://www.cnblogs.com/pick/", I, "/") # #generate URL
Combined_info <-url%>%html_session ()%>%html_nodes ("Div.post_item div.post_item_foot")%>%html_text ()% >%strsplit (split= "\ r \ n")
Post_date <-sapply (combined_info, function (v) return (V[3]))%>%str_sub (9,24)%>%as. Posixlt () # #get the date
Post_year <-post_date$year+1900
Post_month <-post_date$mon+1
Post_day <-Post_date$mday
Post_hour <-Post_date$hour
Post_weekday <-weekdays (post_date)
Title <-url%>%html_session ()%>%html_nodes ("Div.post_item h3")%>%html_text ()%>%as.character ()%> %trim ()
Link <-url%>%html_session ()%>%html_nodes ("Div.post_item a.titlelnk")%>%html_attr ("href")%>% As.character ()
Author <-url%>%html_session ()%>%html_nodes ("Div.post_item a.lightblue")%>%html_text ()%>% As.character ()%>%trim ()
AUTHOR_HP <-url%>%html_session ()%>%html_nodes ("Div.post_item a.lightblue")%>%html_attr ("href")%>% As.character ()
Recommendation <-Url%>%html_session ()%>%html_nodes ("Div.post_item span.diggnum")%>%html_text ()%>% Trim ()%>%as.numeric ()
Article_view <-url%>%html_session ()%>%html_nodes ("Div.post_item Span.article_view")%>%html_text ()% >%str_sub (4,20)
Article_view <-gsub (")", "" ", Article_view)%>%trim ()%>%as.numeric ()
Article_comment <-url%>%html_session ()%>%html_nodes ("Div.post_item span.article_comment")%>%html_ Text ()%>%str_sub (14,100)
Article_comment <-gsub (")", "" ", Article_comment)%>%trim ()%>%as.numeric ()
Data.frame (title,recommendation,article_view,article_comment,post_date,post_weekday,post_year,post_month,post_ DAY,POST_HOUR,LINK,AUTHOR,AUTHOR_HP)
}
#--------Method 1 Loop
DF <-Data.frame ()
System.time ({
For (i-1:73) {
DF <-Rbind (Df,getdata (i))
}
})
User system Elapsed
21.605 0.938 95.918
#--------Method 2 Multi-thread parallel computing
Library (parallel)
System.time ({
X <-1:73
CL <-Makecluster (4) # Initialize Quad core cluster
Results <-parlapply (cl,x,getdata) # lapply parallel version
Jinghua <-do.call (' rbind ', results) # Integration Results
Stopcluster (CL) # Close Cluster
})
User system Elapsed
0.155 0.122 32.674
Obviously with parallel much faster, and,
---
Climbed down the data long like this,, is the blog Garden Essence Post Some information,,,
--I'm a split line-------------------------
II: Deploy R on a Linux server
And so after the deployment of the write encountered which pits,,, but Xiaonan: WEB scraping with R This article describes the various benefits of Linux on the R
Why Linux?
Network Performance & Mem. Management→faster
Better parallelization Support→faster
Uni?? Ed Encoding & Locale→faster (for coders)
more recent third party libs→faster (less bugs)
We are looking forward to building our analysis environment,
Three: summary
------
To improve the operation speed of R, we can solve the problem from the following points
1. Abandon Data.frame, embrace data.table, optimize code,,
2. Using the parallel of R itself, multithreading calculation, improve the CPU utilization,
3. On a powerful server, 16 cores 128G Ah, this violent supercomputer,
4. On multiple mega-machines for cluster, RHADOOP,SPARKR,,,
-------
Sparkr latest developments, for future reference.
Announcing Sparkr:r on Spark
Sparkr GitHub
Sparkr (R on Spark)
Documentation for package ' Sparkr ' version 1.4.1
Sparkr technology, it sounds very flashy, in fact, there are many ways to go,, and Transwrap engineers to the SPARKR environment for functional testing, the result is: to want to local R code normal operation in the SPARKR environment, need to change the code, because R The code and SPARKR Environment's R code is not the same, the data structure of Spark is the RDD (Rdd is all called resilient distributed Datasets, is a fault-tolerant, parallel data structure that allows users to explicitly store data into disk and memory , and can control the partitioning of the data. Announcing Sparkr:r on Spark just released said 1.41 version, to support Data.frame, look forward to sparkr become better use,,,
-------
Here are some r deck for your reference.
China r Language Conference (Beijing) Minutes "with lecture materials"
Minutes of China R Language Conference (Guangzhou) "With lecture materials"
Summary of the Ninth China R Language Conference (Beijing)
Summary of the Ninth China R Language Conference (Shanghai)
Minutes of China R Language Conference (Shanghai venue)
R multithreading and multi-node parallel computing