R you ready? -- Elegant and excellent statistical analysis and drawing environment in the big data age

Source: Internet
Author: User
According to the author's press: This article is based on the materials presented at the "Big Data Technology Conference" held by csdn in September, and was originally published in the issue of "programmer" magazine. 1. History

R (r development core team, 2011) was developed by Ross ihaka and Robert gentleman at the University of Auckland, New Zealand. Their lexical and syntax are derived from scheme and S languages, respectively, the r language is generally considered to be a dialect of the s language (John Chambers, Bell Labs, 1972. R is a free, effective language and environment for Statistical Computing and plotting. It provides a wide range of statistical analysis and drawing technologies: it includes linear and nonlinear models, statistical tests, time series, classification, clustering, and other methods. We are more inclined to think that R is an environment where many classic and modern statistical technologies are implemented.

Figure 1: Ross ihaka and Robert gentleman became colleagues at the University of Auckland in 1992. Later, in order to facilitate the teaching of the elementary statistics course, the two developed a language, and their first letters were r, so r became the name of the language.

As the predecessor of the r language, the s language code can run in the r language environment without any modification. From this perspective, the two languages are almost equivalent. S language was born in the 1970s s by John M. under the leadership of chambers, the department of Bell's laboratory statistics shows the evolution of modern statistical analysis methods (Xie Yihui, Zheng Bing, 2008 ):

  • 1975-in 1976, Bell Laboratory's Statistical Research Department used a set of well-documented Fortran libraries for statistical research, short for SCs (Statistical Computing subroutines );
  • At that time, the commercial statistics software used batch processing to output all relevant information about the problem at a time. In that time, this process took several hours, in addition, commercial software cannot make any changes to the program. The statisticians in Bell's laboratory need flexible interactive data analysis methods, so SCS are very popular in Bell's laboratory;
  • However, statisticians found that using SCS for statistical analysis requires a lot of FORTRAN programming, and the time spent on programming is not worth the candle compared with the analysis results. Slowly, we reached a consensus that the FORTRAN program should not be compiled for statistical analysis!
  • Therefore, in order to interact with SCS, a complete high-level language system s was born;
  • The concept of s language, in the words of its inventor John Chambers, is "to turn ideas into software, quickly and faithfully ."

In 1993, s language license was bought by mathsoft, S-PLUS became its company's main data analysis products, at this time, because S-PLUS inherited s language of excellent lineage, so it is widely used by statisticians from all over the world. However, the R language became the GNU project in 1997, and a large number of outstanding statisticians joined the r language development ranks. As the r language becomes more and more powerful, S-PLUS users gradually turn to the same channel of R language. John M. Chambers, one of the inventors of s language, eventually became a member of the r language core team. S-PLUS, this outstanding software is also a few easy to hand, and finally spent tibco company, this is post.

John Chambers has been dedicated to the development of R language, and is still an active r language developer. John Chambers defined the r language in the first issue of R journal in 2009 as follows:

  1. An interface to computational procedures of sort kinds;
  2. Interactive, hands-on in real time;
  3. Functional in its model of programming;
  4. Object-oriented, "Everything is an object ";
  5. Modular, built from standardized pieces; and,
  6. Collaborative, a world-wide, open-source effort.

Of course, the characteristics of the r language are hard to be embodied in a short article. I will briefly describe the current situation and future of the r language.

2. Status Quo and Application

R Languages differ greatly in international and domestic development. r languages are already standards in the field of professional data analysis, but there is still a long way to go in China, this is certainly because of the status of the data discipline. It is also because of the weak copyright Concept of Chinese people and the relatively blocking of academic fields. So why can the r language be accepted by a large number of data analysts? There are many reasons for this:

2.1 Advantages and features

From the history of R language development, R is mainly a language developed by statisticians to solve problems in the field of data analysis. Therefore, R has some unique advantages:

  • Statisticians and cutting-edge algorithms covering almost the entire statistical field (more than 3700 extension packages)
  • Open source code (free, in both senses) can be deployed on any operating system, such as Windows, Linux, Mac OS X, BSD, and powerful community support for Unix.
  • High-quality and extensive statistical analysis and data mining platforms
  • Repetitive analysis work (sweave = R + latex). With the powerful analysis capability of the r language + the perfect layout capability of latex, You can automatically generate analysis reports.
  • Convenient scalability
    • You can connect to a database through corresponding interfaces, such as Oracle, DB2, and MySQL.
    • Intercommunication with python, Java, C, C ++, and other languages
    • APIS can be called, such as Google, Twitter, and Weibo.
    • Most other statistical software can call R, such as SAS, SPSS, and statistica.
    • Even some direct commercial applications, such as Oracle R enterprise, IBM netezza, r add-on for teradata, SAP Hana, Sybase RAP (Liu Si, 2012)

 

2.2 honors

R language has so many advantages, most of the reason is that it also inherits the excellent lineage of s language. S language was awarded the software system Award by the American Computer Society (ACM) in 1998. This is the "only" statistical system assigned by ACM among the numerous statistical software so far.

At that time, ACM commented on the s language in this way:

  • It permanently changes the way people analyze, visualize, and process data;
  • Is an elegant, widely accepted, and immortal software system.

We can also query the list of software system awards Awarded by ACM over the years. These excellent software systems are closely related to our lives:

  • 1983 Unix
  • 1986 Tex
  • 1989 postscript
  • 1991 TCP/IP
  • 1995 World-Wide-Web
  • 1997 Tcl/TK
  • 1998 s
  • 1999 the Apache Group
  • Java 2002

The New York Times published a social evaluation entitled "data analysts captivated by R's power" in 2009, focusing on the development of R language in the field of data analysis, and caused a wide and intense debate between SAS and r Users. In 2010, the American Statistical Association awarded the first Statistical Computing and graphics award to the r language, it is used to recognize its wide range of impact in statistical applications and research.

2.3 community and activities

As John Chambers said above, R is also a community and its offline activities are very active. Internationally, users will be held once a year in Europe and the United States! The meeting will be attended by r users from all over the world to discuss the application of R language and scientific research achievements. For special consideration of Statistical Computing, a DSC meeting (directions in Statistical Computing) will be held every two years to discuss the application and theoretical research of R in statistical computing. There will also be corresponding R groups in major cities to facilitate local R user gatherings and exchanges.

In China, two Chinese R language conferences are held in Beijing and Shanghai at the capital of statistics every year, so far, four R language conferences have been held at Renmin University of China and East China Normal University, over the years, he has delivered speeches in many fields, including medicine, finance, geographic information, statistical graphics, data mining, pharmaceuticals, high-performance computing, sociology, bioinformatics, and the Internet. starting from next year, taipei will be the third city to hold the Chinese R language conference. The Chinese R language Conference Taipei venue in June 2012 is already under planning.

2.4 recognized by the industry

Each year, the kdnuggets website conducts special surveys on data analysis and data mining. In the August 2011 survey on language popularity in the data mining field, the r language ranks first in all languages in the field of data mining (Figure 2), while the SQL, Python, and Java followed by it have their own unique advantages in a certain field. In the scope of data mining, r languages complement and complement each other with these languages.

The tietong programming Community index (programming Community index) calculated based on internet search results may be more representative of the popularity of programming languages. In the ranking in December 2011, the R language was still the most popular language in the field of statistics, ranking 24th (ratings 0.522% ), SAS, which is often put together for comparison, ranks 31st (0.417% ).

Figure 2: Although samples of the kdnuggets website are suspected to be biased, it represents the preference of a certain group of people. And the top five languages are representative in their respective fields. Data Source http://www.kdnuggets.com/2011/08/poll-languages-for-data-mining-analytics.html3-challenges and the future

Although the r language has many advantages, the R language is not omnipotent. After all, it is a statistical programming language. Due to the versatility of its algorithm architecture and speed performance, its initial design is completely based on single-line and pure memory computing. Although the use of R is generally irrelevant, the disadvantages of these two design ideas are becoming increasingly dazzling under today's big data conditions. Fortunately, some excellent scalability packages of R solve the above problems, for example:

  • SnowSupports MPI, PVM, NWS, and sockets communications to address single-thread and memory restrictions;
  • MulticoreSuitable for large-scale computing environments, mainly solving single-thread problems;
  • ParallelThe standard package added by r 2.14.0 integrates the snow and multicore functions;
  • R + hadoopRun the r code on the hadoop cluster or operate the hive repository;
  • RhipeA more friendly R code runtime environment to address single-thread and memory restrictions;
  • SegueUse Amazon's Web Services (EC2 ).

Here we need to focus on the parallel package, which is a new feature package added by the R core team to solve the big data computing problem under the standard installer.

3.1 misunderstandings

Many people think that the R language is a GNU open-source project software, so the use of the software is "not guaranteed. However, in the United States, r computing results are recognized by the FDA (Food and Drug Administration). In addition, there are reports that R has a very small number of bugs compared with other commercial software (ucia, 2006 )!

The core team of R development is extremely cautious about the new features of R. For example, cairographics started from 2007 and was not introduced to the R Standard installer until the previous major version (2011; the Byte-compile function has been incubated for nearly 12 years from 1999 to 2011 (Ripley, 2011 ). From this point of view, the code quality of the r language and the credibility of the calculation results are completely guaranteed.

Of course, this is the standard installation package of R, which does not represent the quality of all the extension packages. After all, more than 3700 of expansion packages are mixed. Although there are some excellent packages (such as rcpp, rodbc, vgam, and rattle), there must be some poor quality expansion packages.

3.2 application thinking

R language is not a language that everyone can access. It is relatively small and many people may not know what R is used even if they are exposed. For those who are on this path, there are often some application difficulties, such as for personal learning:

  • Although the r language was designed to avoid using a large number of Programming to Implement Statistical algorithms, the most basic programming capability is still needed. Therefore, it is undoubtedly difficult for non-computer professionals;
  • Many people have mentioned that the R language learning curve is steep. However, from the perspective of my years of experience, the steep learning curve is not the r language itself, but the statistical knowledge hidden behind it is difficult to grasp in a short time.

From the perspective of the company's commercial applications, there are also some unavoidable problems:

  • The first is how to calculate the cost of human resources;
  • The software cost problem is that R is free software and can be downloaded anywhere at any time. Therefore, how to measure the cost is a problem for enterprises;
  • There is no official or institutional standard for R skills verification, and "skilled use of the r language" on the resume may not make any sense;
  • In fact, even if there are no such two problems, it is not that easy for enterprises to find R-related talents;
  • For companies that have already done a lot of work with other software (such as using SAS), the conversion cost is very high;
  • Technical support problems.
4. Conclusion

Although the r language was born in the statistics community and serves data, as the data penetrated into all walks of life, the R language is far beyond the scope of statistics, I believe that more friends will join the r language community in the near future.

Reference Directory
  • Xie Yihui, Zheng Bing (2008). historical background, development history, and current situation of R language. 1st China R conference.
  • Liu sicheng (2012). commercial database support for R language. http://www.bjt.name/2012/04/r-language-enterprise.
  • R development core team (2011). R: A language and environment for Statistical Computing. r foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
  • Ripley, B. (2011). The R development process. Technical Report, Department of Statistics,
  • University of Oxford.
  • Tibench (2011). http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html.
  • Ulinoleic (2006). R relative to statistical packages. Technical Report, ulinoleic.


About Liu sicheng

Focus on the application of R language in statistical analysis, data mining, and data visualization. Home page: http://bjt.name view all the articles published by Liu Si →

R you ready? -- Elegant and excellent statistical analysis and drawing environment in the big data age

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.