What is the role of Python in data analysis compared with R, SAS, and SPSS?
Source: Internet
Author: User
Php Chinese network (www.php.cn) provides the most comprehensive basic tutorial on programming technology, introducing HTML, CSS, Javascript, Python, Java, Ruby, C, PHP, basic knowledge of MySQL and other programming languages. At the same time, this site also provides a large number of online instances, through which you can better learn programming... Reply content: all the users who have used it will answer:
The requirement of spss for users is that they only need to click the menu. there is a programming window, but it is generally not used. most users have received some statistical training, but they do not need advanced analysis capabilities, market research is widely used, and the major of statistics is generally required
Many of the well-written procedure in sas are Fda-certified and well-secured. the advantage is that the authority is not flexible enough. slow algorithm updates and high costs. The syntax is strange because it is not a programming language in the traditional sense. so it is not so convenient to loop or involve algorithms, it's not a math language like matlab, so it's troublesome to perform mathematical operations unless you buy an iml library, but it's used by pharmaceutical companies because of its authority (the boss is fda certified) in addition, the bank's risk control will also use sas. the advantage is that the proc SQL built in big data processing is good, but to be honest, I prefer to use mysql directly.
Next, R is open-source, so new theories are updated quickly. data processing is very convenient, especially for data frame list. What are you missing? biomedicine and research in schools like to use R to solve problems? many non-IT people need to face a lot of programming troubles, if we sort data, do we start with the bubble algorithm? So it was initially said that R is a simplified version of matlab, but it was only known that R is more lightweight and easy to learn. it is open-source and free. there are not so many problems with linux or windows. R calls C. the loop speed can be greatly improved, monte carlo is an angel! I have witnessed a hard time connecting my colleagues to C using matlab. In short, if you want to have your own ideas, that is, you need to program your own strong push R. by the way, my R is learned at trading floor. finance does not need the current front-end R of finance. it is also very popular.
Python finally said that it was pain in the beginning of the last three months! I have to say that pandas's data processing is not as convenient as R, but I'm used to it. the advantage of python is that we can do a lot of things, not just statistics, so the application is wider. I don't know that mathematical modeling and matlab are very convenient. similar syntax: win32 module is compatible with office. it supports stand-alone and large-scale development.
In addition, there are many financial applications (more than R) quant departments all use python. I have even interviewed hedge fund, which means that all of them use python to do IT. after all, if you really want to write Algorithm R, IT will be much slower, and the people who write C ++, after all, IT most of the algorithms written by the mathematics department are not doing well, and the C ++ of the mathematics department is not so good, so python is useful.
In short, R and Sas are essential for more professional statistical software statistics students
Spss is a more popular statistical software to complete small issues such as Questionnaire analysis and simple regression. python is not a statistical software, but a language that can be used for various tasks. stata is programmed between spss and sas. the uncomfortable hair is lost.
In addition, only R and python are open-source and open-source. it is more important than free. it is maintained and developed by many people. Therefore, new theories with new requirements can be put into practice very quickly, but risks may be wrong (but the error will be corrected soon) so if you want to use the correct payment method, you can go to court (by the way, revolution R is a paid and guaranteed version of R) if you like to use open-source software with more flexibility (by the way, Ave, but I still don't like the matlab series syntax), if you use very simple statistics, you don't even know how to use spss
If you only collect statistics or work by yourself, use R. If you are a company, you need to build a platform for everyone to use. if your work involves statistics, use python.
In fact, R can also connect to SQL c ++. The key is to be proficient in one field, and then you will find that everything else is floating cloud .........
Actually, as a heavy user of R and python, I prefer R ...... All of the company's platforms are replaced with python .........
Sorry, I am sorry for the trouble. I want to answer the question and write so much. so I used spaces for some random punctuation ...... Different from R, Python is a versatile language. Data statistics are mostly implemented through third-party packages.
Specifically, the Package I commonly use Python has such a number in statistics.
1. Numpy and Scipy. These two packages are an important reason why Python has a place in data analysis. Numpy encapsulates basic matrix and vector operations, while Scipy provides richer functions based on Numpy, for example, common statistical distributions and algorithms can be quickly found in Scipy.
2. Matplotlib. This Package is mainly used to provide data visualization. it has powerful functions, and the generated icons can achieve the printing quality, so the appearance rate in various academic conferences is not low. Relying on Python, the customization is higher than other graphics libraries. Another advantage is that it provides interactive data analysis and allows you to dynamically scale charts. it is very suitable for adhoc analysis.
3. Scikit Learn. A very useful Machine Learning library for quick prototyping. Encapsulate almost all classic algorithms (neural networks may be the only exception, but this is supplemented by Pylearn2), which is extremely easy to use.
4. Python standard library. This mainly reflects the advantages of Python in processing strings. because of the attributes of Python's many functions and its good support for regular expressions, it is not appropriate to process text.
This is basically involved in daily use. Symbol operations are also supported by powerful third-party libraries such as Sympy and Theano. In summary, Python provides the most comprehensive functions in your listing. However, these functions are scattered in third-party libraries and are not organically integrated, resulting in high learning costs. Python is faster than R. Python can directly process G Data; R cannot. when R analyzes data, it must first convert big data into small data (through groupby) through the database before it can be handed over to R for analysis, therefore, R cannot directly analyze behavior details, but can only analyze statistical results. Python = R + SQL/Hive
The advantage of R is that there are all-encompassing statistical functions that can be called, especially in time series analysis (mainly used in financial analysis and trend prediction) both classic and cutting-edge methods have corresponding packages for direct use; it is much poorer than python in this regard.
The advantage of Python lies in the characteristics of its glue language, some algorithms written in C at the underlying layer are encapsulated in the python package and the performance is very efficient (decision tree analysis in the Orange canve data mining package of Python 0.5 million users output results in 10 seconds, it takes several hours for R to run, and 8 GB of memory is fully occupied ).
In general, Python is a balanced language, which can be used in all aspects, while R is prominent in statistics. However, data analysis is not just about statistics, data collection, data processing, data sampling, data clustering, complex data mining algorithms, data modeling, and so on, as long as it is more than m of data, R is very difficult to do, but Python is basically competent.
Add:
Python has a dedicated data analysis package Pandas for SQL-like functions. However, Pandas loads all data into the memory. if the data is too large (more than 2 GB) you need to find a way to analyze chunk, or use pytables/pyh5 to convert to hdf5 format files on the hard disk for analysis.
In addition, if it is a windows environment, we recommend that you use winpython. the packages mentioned above will come with you. Of course, more abundant is pythonxy, but unfortunately this is only 32-bit.
SAS and SPSS are commercial data analysis software, which has never been used. I have used Python and R in depth, and I feel the following points:
About R Advantages: a wide range of packages, especially in time series analysis and some cutting-edge algorithms
Disadvantage: slow speed
======================================
Ps: There is a solution to the slow speed of R. For more information, see: replacing Rblas with R to speed up
About Python Advantages: industrial programming language, which has a wide range of use and powerful functions. its data analysis performance can reach the limit of a single machine due to numpy with MKL.
ClassicRelatively rich implementation of algorithms
Disadvantages: after all, a group of programmers are writing statistical software, which is not comprehensive or intuitive. cutting-edge algorithms are lacking. most time series analysis functions are not built.
========================================================
A simple entry: python metering series
Here is a small case of using Python: statistics on all sfgg blogs
Python provides a complete set of data collection, cleaning, and analysis tools. However, R is only an analysis tool.
========================================================== ==============
I have learned both Python and R for the advice I wrote to the landlord. it is really comfortable to learn Python, and I am also very familiar with it. R, when everyone else is confused, you have to upload it. From the ranking of KDnuggets, we can see the following 12-year and 13-year situations:
The result is self-evident. click here ( The result is self-evident. click here (Ages used for analytics/data mining/data science
).
However, if you are specific to an industry or business scenario, the situation will be different.
In general, python is a language (flexible) compared to R \ SAS \ SPSS-when you mention python, the first thought is the computer language and
Scientific computingAnd R \ SAS \ SPSS is more statistical. Let me talk about my experience. my work is mainly focused on data analysis and modeling. Therefore, PYTHON is mainly used to process some data, such as word segmentation statistics for name keywords. SPSS is mainly used to analyze processed data and generate reports. Sas still holds a large market share in the medical statistics field. However, the standalone edition is only available on windows, and requires code writing or button operations, but far from code writing. Sas studio (online version) can be used on mac or windows, and there are many well-written drawing tabulation functions that only need to be called, the operation is very simple. Many American Public Health majors list sas as their learning subjects.
Spss is widely used in China as far as I know. Scientific research doctors or public health professionals generally use spss processing and analysis software. Spss is mainly used to click the icon, which consumes time. Therefore, it is especially suitable for users who do not need to process complex data.
Stata can operate on icons or write code. different versions have rules on the number of rows and columns, and only one table can be operated at a time, which is not very convenient.
As an open-source software, R has many good packages for calling and strong programming and modeling. However, the processing of data, such as changing the volume name and data cleaning, is not particularly good, at least not as convenient as sas.
Matlab is widely used for modeling.
Python has many advantages in processing big data in the early stage. sas is slow in processing Gb data, while python's numpy package can complete batch processing within several seconds. In addition, the regular expression of python is easier to use than the re of sas.
To sum up, as the top professional in the pharmaceutical data analysis field, sas is still user-friendly from interface design to programming language. we recommend that you use sas! If the data is large, we recommend using python for cleaning! I have used the numpy libraries of R and python, and big company data posts. now I am using R to explain the reasons.
1. ease of single-host deployment, strong R, and no need to install eclipse for R (joke ). It is almost the same, but it is relatively convenient. you do not need to download the installation library from the internet. you just need to run the install. packages command, but you need to find python yourself. sometimes it is very tired.
2. the file reading speed is higher than that of R. I don't know how the first few people mentioned the slow speed of R files. Test the read.csv file of runder 1 Gbit/s and the readln file written by python for itself. when using python, R can read the file several hundred times or more... The reason why I have rarely used python is that I often encounter GB data. reading python is too slow... Let's talk about the file acceleration behavior of R. R can save the data in the memory as an image file of Rdata. I don't know how to speed up saving and reading python, but it must be slow to store csv files...
3. performance: I think R is more powerful. some people say that python's numpy speed is fast, but I still don't see it... So how can we get a concurrent killer? R can be used in parallel with doParallel. does python need to generate an exe and write n batches by itself? In this case, we will not talk about the disconnection and performance loss... The most important thing is that programming is out of stock...
4. ease of use: clustering, bp neural network, genetic inheritance, and many other algorithms can directly call the R database, so you do not have to write your own data if the performance is good. The only weak one should be the class. Currently, R6 is used to write data and sometimes it may be faulty to match the parallel package.
5. Release: The only defect of R is that it cannot be released. only scripts can be passed. Unlike python, exe can be generated... This may be the essential difference between a programming language and a statistical language...
Mark. we don't want you to take a detour if you have any updates. You are welcome to ask questions ~ For details, see url For details, see url
SAS vs. R (vs. Python)
Python has the momentum of replacing MATLAB for data processing and analysis;
Python and R are still quite lacking in statistics. Many open-source statistical packages and software Python on R cannot be found for the time being, such as bioinf Bioconductor.
SAS is generally used by pharmaceutical manufacturers. because of the FDA-certified relationship, the functions and effects of other software cannot be replaced.
SPSS is generally used for common social science statistics. you only need to click and set the relevant parameters.
Performance. Python has good performance and fast development speed. it is easy to make a slightly larger amount of data, such as the Demo in the case of GB. This R and Matlab are slightly difficult. Note that my machine is talking about it under 24-core 96 gb ram, so ordinary laptops do not expect to directly use MATLAB and R to challenge raw data larger than GB. Parsing alone is very slow.
For ultra-large datasets, open-source Python and R all have their own solutions, such as Rhadoop or GPU packages. SAS also seems to be supported, and MATLAB is slightly difficult. SPSS does not support this.
Finally, I strongly pushed Julia language, which gradually rose. Hemp boutique.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.