Do I need to study R: 4 good Reasons to try open source data analysis platform

Source: Internet
Author: User

You may have heard of R. Perhaps you've read an article like Sam Siewert's "Big Data in the cloud". You probably know that R is a programming language and you know it's about statistics, but is it right for you?

Why do you choose R?

R can perform statistics. You can see it as a competitor to analysis systems such as SAS Analytics, not to mention simpler packages such as StatSoft Statistica or Minitab. Many of the professional statisticians and methodologies in the Government, enterprise and pharmaceutical industries have devoted all their careers to IBM SPSS or SAS, but have not written a single line of R code. So in a way, the decision to learn and use R is about corporate culture and how you want to work. I use a variety of tools in statistical consulting practice, but most of my work is done in R. The following examples give me the reason why I use R:

R is a powerful scripting language. I have recently been asked to analyze the results of a scoping study. The researchers examined 1,600 research papers and coded their contents based on multiple conditions, which are, in fact, a large number of conditions with multiple options and forking. Their data (once flattened onto a Microsoft Excel spreadsheet) contains more than 8,000 columns, most of which are empty. The researchers want to count the totals under different categories and headings. R is a powerful scripting language that can access regular expressions like Perl to handle text. Messy data requires a programming language resource, and although SAS and SPSS provide scripting languages to perform unexpected tasks with Drop-down menus, R is written as a programming language and is a better tool for that purpose.

R walks in the forefront of the times. Many of the new developments in statistics were initially introduced in the form of R packages, which were then brought into the commercial platform. I recently got a data on a medical study of patients ' memories. For each patient, we have the number of doctor-recommended treatments and the number of items the patient actually remembers. The natural model is beta-two distribution. This has been known since the 50 's, but the estimated process of associating the model with variables of interest has only recently arisen. Data such as this is usually handled by the generalized estimation equation (general estimating equations, GEE), but the GEE method is progressive and assumes a wide range of samples. I want a generalized linear model with beta-two R. An updated R pack estimates the model: Ben Bolker wrote the Betabinom. and SPSS didn't.

Integrated document Publishing. R seamlessly integrates the LaTeX document publishing system, which means that statistical output and graphics from R can be embedded in documents that are available for publication. It's not for everyone, but if you want to have a portable asynchronous book about data analysis, or just don't want to copy the results to a word processing document, the shortest and most elegant path is through R and LaTeX.

no cost. As the owner of a small business, I like R's free specific. Even for larger businesses, it's nice to know that you can temporarily transfer someone and immediately let them sit next to a workstation using state-of-the-art analytics software. Don't worry about the budget.

What is R, and what is its purpose?

140 explanation of the character

R is an open source implementation of S, which is a programming environment for data analysis and graphics.

As a programming language, R is similar to many other languages. Anyone who has written code will find a lot of familiar things in R. The particularity of R is the statistical philosophy it supports.

A statistical revolution: S and exploratory data analysis

Computers are always good at computing-after you write and debug a program to execute the algorithm you want. But in the 60 and 70 of the last century, computers were not good at displaying information, especially graphics. These techniques are limited to trends in the combination of statistical theory, meaning that statistical practices and statisticians ' training focus on model building and hypothesis testing. One assumes a world where researchers set assumptions (often agricultural), construct well-designed experiments (in an agricultural station), fill models, and run tests. A spreadsheet-and-menu-driven program (such as SPSS reflects this approach). In fact, the first version of the SPSS and SAS Analytics contains subroutines that can be tuned from one (Fortran or other) program to populate and test a model in a model toolbox.

In the framework of this normative and penetrating theory, John Tukey put into the concept of exploratory data analysis (EDA), which is like a pebble hitting a glass roof. Nowadays, it is difficult to imagine a situation where a data set is analyzed without using the box plot to examine skewness and outliers, or to check the normal state of a linear model residuals for a point graph. These ideas are presented by Tukey and are now presented in any introductory statistical course. But that's not always the case.

From "Graphical Data analysis methods"

"In any serious application, you should analyze the data in a variety of ways, construct some floor plans, and perform multiple analyses so that the results of each step provide recommendations for the next step." Effective data analysis is iterative. "-john Chambers.

EDA is not so much a theory as a way of saying it. This method is inseparable from the following empirical rules:

Whenever possible, you should use graphics to identify features that are of interest.

The analysis is incremental. Try the following model, and populate the other model with the results.

Use a graphical check model hypothesis. The tag has an exception value.

Use sound methods to prevent the violation of distribution assumptions.

The Tukey approach has led to a new graphical approach and robust estimation of the development tide. It also inspires the development of a new software framework that is better suited to exploratory approaches.

The S language, developed by John Chambers and colleagues at Bell Labs, is used as a statistical analysis platform, especially for Tukey sorting. The first version (for use inside Bell Labs) was developed in 1976, but it was not until 1988 that it formed a version similar to its current form. At this point, the language can also be used by users outside of Bell Labs. Each aspect of the language conforms to the "new model" of data analysis:

S is an interpreted language that operates in a programming environment. The S grammar is very similar to C's syntax, but it eliminates the difficult part. S is responsible for implementing memory management and variable declarations, for example, so that users do not have to write or debug these aspects. Lower programming overhead allows users to quickly perform a large number of analyses on the same dataset.

From the beginning, S takes into account the creation of advanced graphics, and you can add functionality to any open graphics window. You can easily highlight points of interest, query their values, make the scatter chart smoother, and so on.

The object-oriented nature was added to S in 1992. In a programming language, objects construct data and functions to satisfy the user's intuition. The human mind is always object-oriented, especially in statistical reasoning. Statisticians deal with frequency tables, time series, matrices, spreadsheets, models with various data types, and so on. In each case, the original data has attributes and expectations: For example, a time series contains observations and points of time. And for each type of data, you should get standard statistics and floor plans. For a time series, I might draw a time series planar graph and a correlation graph, and for the fitted model, I might draw the fitting values and residuals. S supports creating objects for all of these concepts, and you can create more object classes as needed. Objects make it very easy to conceptualize from the problem to the implementation of its code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.