One, doubling performance with parallel computing
1. Data Parallel VS Task parallel
An algorithm for implementing data parallelism
Scoket parallelism
Note that parallel computing time is not proportional to the number of compute resources executing the task (computer core), Amdahl Law: The speed of parallel code is limited by the part of serial execution, including the overhead of parallelism
In non-Windows systems, parallel supports fork Clustering (crossover), and the new work process forks from the parent R process and copies the data. The advantage is that no display is needed to create and destroy the cluster
Algorithms to implement task parallelism
2. Computer cluster performs multiple tasks in parallel
Only socket-based clusters can do this because the process cannot be forked to another machine. Because the computer cluster must communicate over the network, the bandwidth and latency of the network to the entire cluster performance
Very important. Therefore, in the case of multiple physical machines, all nodes should be deployed in the same network segment more reasonable
3. Shared memory parallelism VS distributed memory parallelism
In distributed memory parallelism, each process has its own block of memory space, in other words, each process has its own copy of the data, even if the processes are processing the same data.
In this way, parallel code runs on multiple processes in a single computer, which results in great redundancy. Cross-cluster without this problem, the Socker cluster creates a new instance of R, so each worker has a copy of the data.
Shared memory parallelism All worker processes share a single copy of the data. Although the parallel package does not support shared memory parallelism, the Cran package of Big.matrix objects in Bigmemory can be supported by adjusting the data structure
Be sure to avoid the race condition, where the worker process reads and writes the same memory location, resulting in conflicts and program errors due to improper coordination.
4, the possibility of optimizing parallelism
The main obstacle is data transfer and copying between master and worker.
1, is the use of shared memory parallelism
2. Data compression
3, save the data in each worker node, only the intermediate node data communication is similar to Mr
Ii. handing over data processing to the database system
A large data set is stored in the database, and extracting all the data into R is unrealistic.
1. Extract data to R VS. Working with data in the database
Using SQL for data preprocessing from a relational database
Dplyr and PIVOTALR can convert r expressions to SQL
2. Run statistics and machine learning algorithms in the database
Madlib wants PostgreSQL to add advanced stats, no windows support, just a simple database rule to compute on the database side, load the results into Windows
3. Using a column database to improve performance
This approach is not very suitable for our business at this time.
4, using database array to maximize the performance of scientific computing
For multidimensional models is the use
Iii. unfolding Lenovo, R and Big data
HDFS storage data, block storage (128M), default 3 copies, guaranteed high availability
The MR Data is processed in parallel to the data on HDFs, as in the 8th, but the advantage of Mr is that the data is already stored on the worker node and does not need to be distributed every time the task is run
However, each read requires data to be fetched from the hard disk and then written back to disk. So the completion of the computation time exceeds the cost of reading and writing data and the cost of running the Hadoop cluster.
Specifically not in detail, because the big data ecosystem is already very big, not a person a day or two can be said clearly.
Analyzing HDFs data using Rhadoop
Using the RMR2 function to read the file Make.input.format (), you can also read the native text, json,hadoop the serialized file, HBase me, Hive,pig.
In addition to Rhdfs and RMR2, there are also
Plyr function on the PLYRMR:MR
Rhbase: Provides functions to process rhbase data
Ravro: Read and write data in AVRO format
The core knowledge of high-performance programming has been introduced here.
About the massive data, in R anyway optimization also R can not meet the situation, such as for our Business college Entrance Examination Single-volume 500W examinee single subject * 100 small questions 500 million offline data volume, plus the previous total of billions of data volume.
At this point, the personal feel that the R optimization, increase the performance of the hardware is also an indicator of the situation, the big data bar, we should form their own data warehouse, storage of historical data, the formation of products, so as to be bigger and stronger!
PS: At that time, the selection of R is to solve the problems like dif, reliability, measurement standard error, correlation coefficient and so on, discard R, how to solve the algorithm problem. How to calculate the correlation coefficients in the case that the algorithm cannot be divided?
Sparkr? Is there any other way?
R Language High performance programming (III)