a push Big data engineer Li Shuhan
In the present, the wave of artificial intelligence swept through. From Alphago, unmanned technology, face recognition, voice dialogue, to the mall recommendation system, the financial industry's wind control, quantitative operations, user Insight, enterprise credit, intelligent investment, and so on, the application of artificial intelligence widely penetrated into all walks of life, but also let data scientists in short supply. Python and R are gaining more and more attention as the mainstream language of machine learning. Recruits in the field of data learning often do not know how to make a choice between the two, this article on the language characteristics and use of the scene for everyone to compare and analyze.
A Concepts and features of Python and R
Python is an object-oriented, interpreted, free, open-source, high-level language. It is powerful, has active community support and a wide variety of libraries, as well as concise, easy to read and extensible, has become a popular programming language in recent years.
Advantages of Python:
1, Python use a lot of scenes, not only and R as can be used for statistical analysis, more widely used in system programming, graphics processing, Text processing, database programming, network programming, web programming, network crawler, etc., is very suitable for those who want to delve into data analysis or application of statistical technology programmer.
2, the current mainstream big data and machine learning framework to provide good support for Python, such as Hadoop, Spark, TensorFlow, and Python also has a strong community support, especially in recent years with the rise of artificial intelligence, More and more developers are active in the Python community.
3, Python as a glue language, can be linked with other languages, such as your statistical analysis can be written in R language, and then encapsulated as Python can call the Extension class library.
R Language is an interpretive language used for data exploration, statistical analysis and mapping, but more like a mathematical computing environment. It is rich in modules and provides a very convenient programming method for mathematical calculations, especially for matrix calculations.
Advantages of the R language:
1, the R language has many elegant and intuitive charts, the common Data Visualization toolkit is:
· Interactive chart Rcharts, plotly, interactive sequence diagram dygraphs, interactive tree view TreeMap
· ggplot2-a drawing system based on graphic grammar
· Lattice-r language lattice graphic
· rbokeh-R Language Interface for Bokeh
· Rgl-uses OpenGL's 3D visualization
· shiny-Framework for creating interactive applications and visualizations
· visnetwork-Interactive Network Visualization
Scatter chart
Timing Diagram
Word Cloud
2, has a large number of special statistical personnel for the practical functions and rich mathematical toolkit. The basic module of the base one R, the MLE one maximum likelihood estimation module, the TS time series Analysis module, the MVA one multi-element statistic analysis module, the Survival One Survival analysis module and so on, simultaneously the user can use the array and the matrix operation operator flexibly, and a series of coherent and complete data analysis intermediate tool.
3, the language is simple to get started quickly, do not need to explicitly define the variable type. For example, the following simple three lines of code, you can define a unary linear regression, is not very cool:
X <-1:10
Y <-x+rnorm (10, 0, 1)
Fit <-lm (y ~ x)
At the same time, the R language has a high degree of support to vectorization, and it is an implementation of high parallel computing and avoids the use of many cyclic structures by vectorization, which is not dependent on the data in the process of computation.
Of course, there are some disadvantages compared to python. For example, memory management problems, in large sample regression, such as improper use of memory will be low, but now spark also provides support for R, developers can use SPARKR for big data calculation processing.
Two. Python and R Differences in text information mining and timing analysis
Python and R have very powerful code libraries, and Python has pypi,r Cran. But the two directions are different, Python uses a broader range of aspects, R is more focused on statistics, but runs slowly when the volume of data is large. Here I compare Python and R for two usage scenarios in data analysis:
1. Text Information mining:
The application of text information mining is very extensive, for example, according to the Internet purchase evaluation, social networking website tweets or news analysis of emotional polarity. Here we use examples to analyze and compare.
Python has a good package to help us with the analysis. such as NLTK, and specifically for the Chinese language snownlp, including Chinese word segmentation, pos tagging, sentiment analysis, text classification, Textrank, TF-IDF and other modules.
When using Python for emotional polarity analysis, we first need to break down the sentences into words, where we can use Jieba participle in python, which is very simple to use:
Word=jieba.cut (M,cut_all=false)
Then, we can use the stopwords of NLTK to remove the inactive words first by using the feature extraction. If necessary, the text can be vectorized, here we can use the bag of Words, select TF-IDF for weight-based vector conversion, you can also word2vec based on similarity conversion. Next, use the PCA in the Sklearn package to reduce the dimension:
PCA=PCA (N_components=1)
Newdata=pca.fit_transform (data)
In addition to PCA, you can choose to use other methods such as mutual information or information entropy.
Then, we conduct the classification algorithm model training and the model evaluation, can use the Naïve Bayes (Naivebayes), the decision tree (decision trees) and so on NLTK own machine learning method.
Using R for Emotional polarity analysis
First, the data need to be preprocessed, installation of Rwordseg/rjava (which has a lot of pits) two packets;
After the data cleansing clears the useless symbol, carries on the word breaker: The SEGMENTCN method in the rwordseg can the Chinese participle. Of course, Jiebar can also be used;
Next, construct the word-document-tag data set to remove the discontinued words;
Create a document-term matrix, you can choose Termdocumentmatrix, use the WEIGHTTFIDF method to get the TF-IDF matrix;
Finally, using the Bayesian method in the e1071 package for text categorization, or you can use other machine learning algorithms in the Rtexttools package to complete the classification, which contains nine algorithms: BAGGING (ipred:bagging): BAGGING Integrated classification
Boosting (catools:logitboost): Logit boosting integrated classification
Glmnet (glmnet:glmnet): Generalized linear regression based on maximum likelihood
MAXENT (maxent:maxent): Maximum entropy model
Nnet (nnet:nnet): Neural network
RF (randomforest:randomforest): Random Forest
Slda (Ipred:slda): Scaled linear discriminant analysis
SVM (E1071:SVM): Support Vector machine
Tree (tree:tree): Recursive classification tree
2. Timing Analysis:
Time series analysis is based on the time series data observed by the system, and the theory and method of mathematical model are established by curve fitting and parameter estimation, which are usually used in financial field, meteorological forecast, market analysis field and so on. The R language has a number of packages that can be used to process rules and irregular time series, thus making it more advantageous.
Python often uses Arima (P,D,Q) models for timing analysis, where D refers to difference items, and p and Q represent autoregressive and moving averages respectively. The most common use of the Arima model is the Statsmodels module, which can be used to test the difference, modeling and model of time series. Here is an example of a cyclical prediction:
The following is a set of data on behalf of a bus company in the United States for 50 years of annual passenger data (such as 1950-2000):
data = [9930, 9318, 9595, 9972, 6706, 5756, 8092, 9551, 8722, 9913, 10151, 7186, 5422, 5337, 10649, 10652, 9310, 11043, 69 37, 5476, 8662, 8570, 8981, 8331, 8449, 5773, 5304, 8355, 9477, 9148, 9395, 10261, 7713, 6299, 9424,9795, 10069, 10602, 10 427, 8095, 6707, 9767, 11136, 11812, 11006, 11528, 9329, 6818, 10719, 10683]
1). First, the data is processed and stored using pandas:
DATA=PD. Series (data)
2). Then we need to test the smoothness of the data, general use of unit root test, commonly used methods are ADF, DFGLS, pp and so on:
Using ADF (data) directly in Python, Dfgls (data) can produce pvalue results.
3). Sequence smoothness is the precondition for time series analysis, if the previous step shows that the result is not stable, it is necessary to do a smooth processing of time series, the most common difference method is:
diff1 = Data.diff (2)
where diff (object) represents the order of difference, here we use 2 order, of course you can also use 1, 3, 4, etc.
4). White Noise test:
Value=acorr_ljungbox (Data,lags=1)
5). Now, in our Arima (p,d,q) d=2, let's proceed to model selection. The first step is to calculate the P and Q, first to check the stationary time series autocorrelation graph and the partial autocorrelation graph, through SM.GRAPHICS.TSA.PLOT_ACF (data) and SM.GRAPHICS.TSA.PLOT_PACF (data), Then through the coefficient of the model selection, there are ar,ma,arma,arima to choose.
6). Model Training: Model=sm.tsa.arma (data, (P,D,Q)). Fit (), where p and q are calculated using the ARMA model to train the model.
Using R to build a time series model
R has a variety of toolkits for time series, such as:
Library (XTS), library (timeseires), Library (Zoo)-Time Base Package
Library (Urca)--Conduct unit root test
Library (tseries)--arma model
Library (funitroots)--Conduct unit root test
Library (fints)--Call the autoregressive test function in it
Library (Fgarch)--garch model
Library (Nlme)--Call the GLS function in it
Library (Farma)--for fitting and testing
Library (forecast)-arima modeling
Let me introduce the two very powerful tools in the Forecast toolkit in the R language: ETS and Auto.arima. Users do not need to do anything, these two functions will automatically pick the most appropriate algorithm to analyze the data. For example, use ETS to deal with:
Fit<-ets (train)
Accuracy (Predict (fit,12), test)
or with Auto.arima:
Fit<-auto.arima (train)
Accuracy (Forecast (fit,h=12), test)
In addition, the forecast package has a time-series algorithm holt-winters for increasing or decreasing trends and seasonal fluctuations. Holt-winters's idea is to decompose the data into three components: the average level, the trend (trend), and the periodicity (seasonality). A simple function STL can decompose the original data in R.
In this paper, two programming languages, Python and R, are analyzed from their advantages and specific examples. It is not difficult to see, the two in the "comprehensive strength" on the difficult points, the specific choice of which kind of deep learning, still need to consider their actual expectations to solve the problem, the field of application and so on. Finally, I would like to welcome you to communicate with me about big data programming language.
The battle between Python and R: How do Big Data beginners choose?