What are the advantages and disadvantages of R language?2015-05-27 programmer Big Data Small analysis

R, not just a language

This article is originally contained in "Programmer" magazine 2010 8th, because of the limitation of space, has been limited, here is the full text.

工欲善其事, its prerequisite, as an engineer in the forefront of the IT world, C + +, Java, Perl, Python, Ruby, PHP, JavaScript, Erlang, and so on, you always have a knife to use freely, to help you to the battle.

The application scenario determines the knowledge's reserve and the choice of tools, and, in turn, whatever tool you choose, you will struggle to change it to the way you want it to fit your application. In this sense, I chose r[1] as the data miner's hand in the siege pool of the ladder, and strive to transform it into the way it wants to be.

A more accurate description of R is that R is a language for statistical computation and graphing, not just a language, but also a data computing and analysis environment. There are three major tools in the field of statistical computing: SAS, SPSS and S,r are developed by the influence of S language and scheme language. Its main features are free, open source, a variety of modules is very complete, in the comprehensive file network R Cran, provides a large number of third-party function packages, which covers from statistical computing to machine learning, from financial analysis to biological information, from social network analysis to natural language processing, Everything from a variety of database language interfaces to high-performance computing models can be said to be all-encompassing, and that's why R is gaining an important reason for the growing popularity of practitioners in all walks of life.

From the popularization of R, the popularity of foreign countries to be significantly better than the domestic, with the proliferation of pirated windows will affect the popularity of Linux in China, the same truth, the existence of cracked matlab and SPSS also affect the use of R in China's population. But in the statistics Department of Foreign universities, R is almost a compulsory language and has a dominant position. In the industry, as the Internet company leader in the Google also has a number of projects using R for data analysis work, [2] There is a Google Campus lecture video, the content is to use R as a tool to tell the concept and algorithm of data mining.

With the increase in R users in recent years, reports about R have also been found in newspapers, such as The New York Times in early 2009, a good report: Data analysts captivated by R's power[3]. The report described the development history of R and the growing popularity of data mining demand, although it originated in S but its development is far more than S, has become a college graduate students choose the second largest tool language, Google and Pfizer staff also introduced R in their own company application. In addition, Google chief economist Hal Varian said: "The most amazing thing about R is that you can modify it to do everything, and you already have a lot of available toolkits, which will undoubtedly make you stand on the shoulders of giants to work."

Here are some of the main applications of R and my experience in practice to introduce this not-so-mainstream programming language.

Statistical calculation: R Best Strength

R from the first day it was born to do statistical calculations, then it was defined as a statistical calculation and mapping tools, although developed to now it has been endowed with more and more powerful functions, but now r developers, or colleges and universities in the statistical Department of Teachers and students, they naturally know what they need most.

In statistical calculations, we often need to follow the sample data for linear regression, to obtain a certain regularity, r to achieve this function is very simple, the following is a unary linear regression example:

X <-1:10

Y <-x+rnorm (10, 0, 1)

Fit <-lm (y ~ x)

Summary (FIT)

Note that r in the "<-" symbolic meaning is assigned, in most cases it can be replaced with "=", but some special occasions can not, this article will follow the "<-" the official use of the wording. The first two rows of this example prepare two columns of data: The independent variable x and the dependent variable y, and the third row of the function LM, based on the provided sample data for linear regression calculations, the resulting model results can be printed in line fourth. In addition to this simple unary linear regression, the function LM can also do multivariate linear regression and return the various statistics of the model.

Do statistics often unavoidable to do a variety of graphics, R is another basic feature is the strong support for graphics, the following code shows a box-line diagram of the practice, the code from the BoxPlot function of the manual, the figure shows the number of columns of data, the median, mean, singular points and other information and its comparison position. See [4] for more details on the mapping function of R.

BoxPlot (Mpg~cyl,data=mtcars, main= "Car milage data", xlab= "number of cylinders", ylab= "Miles Per gallon")

Machine learning: Make your data work as it should

The field of machine learning and data mining is confronted with some problems, such as association rule Mining, clustering and classification, which are abstracted from a large number of real life issues. As a complete engineering calculation package, R has no doubt provided sufficient support for all of them.

The problem of association rules stems from the question of "What customers buy this product buys" and is now widely used in customer behavior analysis and Internet user behavior analysis. The most classical algorithm for association rule mining is Apriori,r's third-party package arules[5], which is dedicated to mining association rules. The following example requires that you have installed the Arules package.

Library (Arules)

Data <-paste ("Item1,item2″," Item1″, "Item2,item3″, sep=" \ n ")

Write (data, file = "Demo_basket")

TR <-read.transactions ("demo_basket", format = "basket", sep= ",")

Data ("adult")

Rules <-Apriori (adult, parameter = List (Supp = 0.5, conf = 0.9, target = "rules"))

The last line of the Apriori function accepts the input of a transaction object, outputs the association rule object rules, for convenience, the transaction object used here for the calculation adult is loaded in from the Arules package by line 5th, section 2~ Line 4 shows how to read data from a text file and generate a transaction object.

The most widely used efficient algorithm for clustering algorithms is undoubtedly that kmeans,r contains this function in the stats package that it loads by default, and here is an example from the Kmean documentation:

x <-rbind (Matrix (rnorm (SD = 0.3), Ncol = 2), Matrix (Rnorm (2, mean = 1, SD = 0.3), Ncol =)

CL <-Kmeans (x, 2)

Plot (x, col = cl$cluster)

Points (cl$centers, col = 1:2, pch = 8, cex=2)

Line 1th of the code generates two sets of two-D normal distribution data, the first group has a mean value of 0, the second group has a mean value of 1, and both sets of data variance are 0.3. The 2nd row clusters the data, and the 3rd and 4th lines draw the cluster results.

Classifier is the research topic in the field of pattern recognition and the center of Human cognition activity. Over the years the academic research has accumulated many kinds of classifiers, and the more reliable classifier basically can find the corresponding realization in R. Many classifiers are most famous for SVM, which is also known by some as the king of single classifiers. The following is a process that uses SVM to classify a well-known Iris DataSet, and running this example requires that you have installed the e1071 package [6].

Library (e1071)

Data (IRIS)

X <-Subset (iris, select =-species)

Y <-iris$species

Model <-SVM (x, y)

Summary (model)

Pred <-Predict (model, X)

Table (pred, y)

The 5th line of code calls the SVM function, calculates the classifier model with X as the category label, and the 7th line applies the model to the original data for prediction.

The example above is not intended to allow readers to learn the use of these functional functions in various fields on the spot, but to show some actual r code and how it solves the problem, on the other hand, the accumulation of r in these common machine learning areas. In R help to solve these may not be our professional problems, can save us a lot of repetitive wheel-building energy, written out of the code is also short enough, saving time also allows you to your own algorithm logic of the global sweeping.

High-performance computing: vectorization and parallel/distributed computing

As a modern data mining practitioner, the first thing to be concerned about is the scalability (scalability) of the tools used, specifically the computational power in the face of large data volumes.

A computing package with high performance computing power, first of all it must be able to take full advantage of the historical accumulation of those famous numerical computing packages, such as Blas, Lapack, on the other hand, it must have good extensibility, that is, it must be convenient for developers to parallelize their own algorithms, fortunately these features r all have.

Engineering calculations like R, Scilab, and Matlab typically use vectorization (vectorization) as their basic computational feature (even Python's numpy package), because Vectorization is the basic feature of modern mainframe computers, [7] In the field of computing, both hardware and software provide support for vectorization, hardware such as Intel's MMX, SSE, and other instruction sets provide vectorization support, and more can be seen on Wikipedia introduction. The software, such as Blas, is a well-known computational package that can naturally be used to automatically implement parallel computations for vectorization commands.

The so-called vectorization is a special way of parallel computing, which can perform multiple operations at the same time, usually by executing the same or a batch of instructions on different data, or applying an instruction to an array/vector, compared to the way the general program performs only one operation at a time. The following list of several vectorization operations used frequently in R is a trivial operation, but they are essentially the same operations applied to a batch of data at the same time, so they can be transformed by a vectorization approach:

Vector values, such as: V[1:10]

Vector assignment, such as: V[1:10] <-seq (1,10)

Lapply, similar to the map function in Python: lapply (A, mean)

Matrix operations: A + B; A%*% B

Vectorization is a natural pioneer of parallel computing because of its non-dependence on the data in the process of computation, and an algorithm that is implemented by vectorization must be a highly parallelized algorithm. For this reason, when using R to write scripts, we should try to use the idea of vectorization to design their own algorithms, as little as possible to use the loop structure. Once your program is or is mostly based on vectorization, in addition to the time obtained from the computer hardware and software optimization, in the future, one day the data volume expansion makes computing a bottleneck, you can be very convenient to the original algorithm parallelization.

As we know, Cran includes a variety of toolkits you can imagine, and of course there are a number of parallel computing packages that are summarized in the R High Performance computing-related packages list [8].

[9] A more detailed description of the vectorization and parallel calculation of R can be found in one of my blog posts.

Writing interfaces and toolkits: The most useful package must be the one you wrote.

One of the most powerful sources of open source software is the contribution of a large number of practitioners, the most exciting of R, and the fact that choosing it as a platform for work is a big and all-encompassing Cran, where you can almost find all the tools you can imagine that are relevant to your analytical research. It can be said that silk is no inferior to Perl's cpan. The reason to have such a strong third-party support, on the one hand, is that R itself supports statistical computing and computing power, on the other hand, it is so easy to develop an r extension that every person who uses r as his or her common tool will be tempted to write a package of their own to meet the needs of the job. If the package feels good and is needed by many people, it can be submitted to Cran. This is the cause of cran so large, but at the same time also caused the Cran package is not complete. But in most cases, these packages will be your right-hand helpers, especially those well-known and widely used packages, and if they don't feel like you need them, then be sure to modify them as they are open source.

Here is a simple R extension package making process:

1, build the package structure: Create a new directory mypkg, as the package name, in mypkg a few new directories and files, MYPKG directory structure as shown. R comes with a function Package.skeleton can automatically help you generate these directories, but it needs some ready-made function objects or files to start, in order to explain the whole process, here is not used.

2, Directory Description: Required is the description file, the man directory and r directory, the rest is optional. Description file describes the meta-information of the package; R directory under the R script file, the function can be exported as a Package function library for external use, if you want to put some experimental data in the package, can be placed in the data directory, commonly used in CSV format, in the R terminal data (* *) can be loaded, here empty; The man directory is the Help document for R. There is a certain format requirements, here is also left blank, there will be some warning when generating the package, can not be tube, SRC store C/c++/fortran source code, you must also place the makefile or Makevars files to guide the compiler to work, this is empty; R can do something while loading the package, and here is also empty.

3, add the function: Description The contents of the file can refer to any of the R package corresponding to the wording of the file, according to the sample to modify the information into their corresponding information. The following is a simple R function as a description, add a file named HELLOWORD.R under the R directory, the file content is as follows:

Helloword <-function (x, y)

{

Return (X*y)

}

4. Install: Run R cmd build mypkg on the command line, compile and build a mypkg_0.1.tar.gz installation package, where the number is the version number I wrote in Description, and run the R cmd install mypkg, you can install the package into the system.

5, test: Run R, enter the R terminal; library (mypkg), load the newly made package, search (), you can see that the MYPKG package has been loaded; run HelloWorld (2,3) in R terminal, return 6, Test succeeded.

A certain function of the package is done, is not very simple. If there is any other need, simply add the file to the R directory or src directory and then rebuild and install it. The interface call between R and C + + is also very convenient, limited to space, can not be more careful to explain, more detailed content may refer to my several blog [10-13].

The development of R in China

The popularization of R in China is not very extensive now, mainly in schools and research institutions in use, but in recent years with the fame of R, there are more and more industry practitioners in various fields choose R as their work platform, in which the statistical capital [14] is a domestic R user focus. This year in June at Renmin University held the 3rd session of the R language Conference, the composition of the former three sessions, R's Chinese user base has been showing a large growth trend, the user distribution of the field is more and more abundant. [15] The third session of the attendees of the R language conference can be seen from the minutes of the meeting. I believe that as data mining is widely accepted by various companies, R will also be approached by industry in all walks of life.

Application of R in watercress

For a while, I have been looking for the middle of the MATLAB and system language (such as C, Fortran), I hope it can not only have the system language of high performance, but also to facilitate the daily work of data mining personnel, so I found R, this is not only a language, it is an ideal computing environment. On the one hand, it is convenient for me to build, debug and evaluate the prototype of the new algorithm, on the other hand, it does not let me lose the computing advantage of the system-level language, even has more choices in implementing parallel computing. Now I use R to write our own toolkit, the algorithm prototype construction, matrix operations, parallel algorithms and other off-line applications, for the similarity calculation, recommendation system, and other upper-level applications to provide support for the bottom.

An example of a R-written collaborative filtering recommendation

Finally, using an R implementation of the collaborative filtering recommended example to end this article, collaborative filtering is a recommended system in a basic algorithm, the details can be referred to here [16]. Due to the large number of vectorization calculations (including various matrix operations), the implementation of the algorithm is quite concise, it is possible that the history of the least Code collaborative filtering recommendation engine data <-read.table (' Data.dat ', sep= ', ', header=true)

User <-Unique (data$user_id)

Subject <-unique (data$subject)

Uidx <-Match (data$user, user)

IIDX <-Match (data$subject, subject)

M <-Matrix (0, Length (user), Length (subject))

I <-cbind (Uidx, IIDX)

M[i] <-1

MoD <-colsums (m^2) ^0.5

MM <-M%*% diag (1/mod)

S <-Crossprod (MM)

R <-M%*% S

R <-Apply (r, 1, Fun=sort, Decreasing=true, Index.return=true)

K <-5

Res <-lapply (R, Fun=function (R) return (Subject[r$ix[1:k]))

Write.table (Paste (user, res, sep= ': '), file= ' Result.dat ',

Quote=false, Row.name=false, Col.name=false)

I do not comment on the code, I am interested to know the principle of the students can see here [16].

Reference:

[1] R official website

[2] googletechtalks on YouTube, using R to teach data mining: statistical Aspects of Mining (Stats 202)

[3] The New York Times report data analysts captivated by R ' s Power

[4] R Graphics

[5] Arules Package

[6] e1072 Package

[7] Vectorization

[8] R High Performance computing-related packages

[9] vectorization and parallel computing

The simplest production of the [ten] R package

[One] R and C interface: from R call C's shared library

[] r object structure, using. Call to more flexibly write R-package

[13] Core guidelines for writing R Package C extensions

[14] The capital of statistics

[15] Meeting Minutes of the third R Language Conference (Beijing venue)

[16] Probably the least-coded collaborative filtering recommendation engine in history

What are the advantages and disadvantages of R language?