update :Thank you for your approval, thanks and comments. I'm going to stick to a python-made data processing that defines a more complex new variable, a simple feature engineering. This task will be a headache if it is done with Stata. In addition, this example can also be used to experience the Ipython Notebook (to use the web version, the mobile version of the effect is poor).
GitHub Link: machine-learning-mini-project/feature engineering.ipynb
Original answer :I would like to share my own experience with Python and Stata, which I have rarely discussed with R. I would like to emphasize that I can only be a primer on stata and Python, so the comparison between the two is likely to be limited to my level and not pertinent. Also look at the correct.
first, the conclusion.：for the application of data analysis, from the use of only Stata, to a more fluent use of Python, it is likely to benefit, and accompanied by the enlightened experience of pleasure. These skills are more widely used than stata, and it is not too difficult to learn basic things as long as you are willing to take some effort. In addition to the huge increase in learning efficiency in communities like stack overflow, learning Python has a high return on investment.
I do the application of micro, most of the research projects do not involve any advanced measurement methods, is basically to go with the heart to ask questions, and then painstaking effort to collect data from the "bitterness" route (the study of the direction of economic history is the data collected from the original historical records, The study of partial management science is a performance appraisal data that is evaluated by the employees of a company using each other. So the need for software is mainly data cleaning, transformation, visualization and so on.
I was originally using Stata. At that time it was very convenient to think of Stata, especially the definition of new variables (Bysort:gen such as the syntax is very useful), and run Ols/logit regression, and then enter the form to latex. These basic functions stata very convenient to implement. However, in the ointment is, once to write their own functions, began not accustomed to stata programming way, so the code is not easy to reuse, do file a long, slowly feel a bit chaotic. And then the matrix operation and computing function is not very good.
Later, with his interest in data science and machine learning, there were some Python-based courses on the edx, Coursera, Udacity and other platforms. One of the most rewarding lessons for the utility is EdX's two Python courses (6.00.1x and 6.00.2X) on the MIT and intro to Data Science on Udacity. After these lessons, I did some machine learning small project. The purpose of the study was not to apply to their own economic research-that time, in addition to a game theory of the model I can not help analysis, using Python to do a bit agent-based simulation depicts the nature of the equilibrium, and did not really take to complete a project.
The interesting thing is that I started a new project a few months later, and although it still doesn't require advanced statistics and metering, it's more complicated than before in data processing--you need to aggregate the data into some transformation matrices (transition matrix), then do some calculations, and do a lot of data visualization. When I started a new project with Stata, I tried to use Python pandas to do data manipulation and draw with matplotlib. Another reason is that when I start using Ipython notebook, I can't stop--the Code and Analysis results (charts) are integrated into a single document (a piece of code that follows the output) and is ideal for sorting and sharing. Who knows with whom.
The first time from Stata to Python, still not very accustomed to pandas DataFrame, especially for reshaping, Multiindex, pivot_table and other functions. So I still miss Stata. Then slowly feel the pandas powerful data operation function.
In short, after using Python, my most satisfying efficiency is that all the analysis is automated, from raw data to the final required charts and results, without the need for some semi-automated manual adjustments. And the amount of reusable code is significantly improved. In addition, with Python, thanks to the increased data manipulation capabilities, I have become more frequent than before to visualize data, and almost all regression analysis I will do to do the corresponding descriptive analysis and visualization.
Finally, we have to mention the power of the Python-related community. I'm not going to. Google, the search-out stack overflow questions, and some technical blog content, basically can solve the problem. However, the use of stata, often have a sense of powerlessness, stuck on the stuck, tangled for a long time can only rely on reading documents and then explore their own.
Add:A friend asked me what I used to make a diagram. I'll use Matplotlib. Although it is not very useful, but the basic function is almost enough. Here are some of the figures in my study of economic history. are very basic things, just to let interested friends know about my use of the situation. Laughed at:)
You're a statisticians, not a programmer, you're a statisticians, not a programmer, you're a statistical biologist, not a programmer.
Computer language is a tool to implement your ideas, but it's not Python or R that supports your ideas, it's probability and statistics, it's math.
I've had a similar puzzle before, so I've been talking to a professor, and that's the answer I got.
Of course I'm not trying to make excuses for every year I'm calling to learn Python, and R Dafa is good. One advantage of R is that it is written by statisticians, and the disadvantage of R is that it is written by statisticians.
In my definition, R/python/matlab, is basically can replace each other, the more difficult to choose the more the explanation can be. When I was repairing the ML, I asked the teacher which is the most suitable one, and the teacher replied. Regardless of statistical measurement or time series, I have been using R, quite satisfied, after all, play statistics for their own use, know what they need, professional enough.
As for Stata, I am a class with spss/eviews, called the Metering software, and r This statistical language still has the essential difference. Thank you for inviting me. In this regard, I have only touched the fur of biological information. Force a reply.
Some of the knowledge to do biological information is also specialized in data analysis, with Python can be, after all, data mining convenience.
Python, all aspects can be, but all aspects are not the best (inevitably there is no best, only better).
Instead should not, after all, R and other professions do this, I think the academic and industry situation will be different.
Academia is like the current highest-level answer the Lord says, R or Python is just a tool, and more importantly, thought. So the advent of Python just gives researchers some new tools. It seems that the boss of operations research was using Python more (another research professor used C ...). 。 Perhaps this is largely decided by the professor's own style and research direction. So when Python does not appear to be enough to crush other languages, R should still not be replaced.
The industry is not the same. Python has the opportunity to replace R because it is easy to get started, readable, and so on. If you just want to do data processing, R is good.
If you want to data processing in the future and crawl the web to dig the data by the way to do a blog, or the beginning of learning Python is more convenient. A thick, forced face to answer. Focus on the statistics/Big Data/Data science field AH. Stata there is nothing comparable. Don't say Stata. SAS are losing their comparability. Python and R each have their own good. Simply put, these two tools are CS and statistical two genre-heavy tools. Compared to python more can reflect the thinking of CS, and R for the statistics of people is basically carrying a theory to the implementation of an application. This, of course, has a profound relationship with the history of the two languages. Python itself is closely related to C, and R is based on S language. Although they now have the ability to call other underlying languages, these historical reasons also create their own traits. Python is more like an all-in-one tool, and it's no problem to write a UI or something (that's not to say that R can't do the UI just hard). In addition to the big data background, Python's compatibility is significantly stronger. Python's NLP stuff is also a big advantage. and R's obvious advantage is also related to his statistical thinking. As the answer says, this statistic is written to himself, so the analysis of the set of things, r of the various packages too much too rich, once the statistics have a little breakthrough, there will be developer write a package to the theory into easy-to-operate function is to do the theory of the algorithm implementation. Of course, other areas have a little to use, R will also be someone to write a package ... Without long-term attention to statistics, using Python and r in depth may not be appreciated. Python actually has this trait, but it can't be compared with R. This also has to do with the traditional Python users of the CS background rather than the statistical analysis of the background is related. The convenience of Python's interaction with the underlying language is also a feature, so you can see that many do not choose R for quantitative trading and naturally choose python (the underlying language, of course, has an unshakable voice in this field). Direct comparison, the individual think R is more suitable for strategy, model development. Python is a better fit to implement your model as a whole. In general, these two languages are constantly learning from each other. In the future, these two languages are likely to persist for a long time, eating the market share of other analytic languages such as SAS and MATLAB. The strategy of commercializing many of the analytic languages and promoting them on a large scale has been gradually adapting to the new era of big data. Open source has become a big advantage, and knowledge sharing results sharing is important. After all, the speed of behind closed doors is too slow. Another is the old language, especially like SAS, exaggerated a little, there is a big reason is that there is a big help to traditional users, business users. However, these traditional users in today's difficult to create big value. Especially with Coperate america,500, a lot stronger.Companies are represented by those enterprises. But to say two, 1 data science in addition to the analysis will have a lot of low-level development needs, so to understand the one or two kinds of underlying language is a great advantage. 2 In addition, the future is also very optimistic about Scala.
For a PAT. Of course, compared to Python, R is a high-level programming language compared to the user friendly but the limitations are greater if the programming is strong enough of course you can use Python to do all r can do things and faster is actually a tool, proficient one, the rest is not difficult. Check it out, Statsmodels's developers have said something:
' I can see that. Much of Python stats strikes me as poor imitation of R. Like Matplotlib:matlab, Oo:ms Office '
Referring to Statsmodels
I ' m not sure whether the implied criticism are on "poor" or "imitation"
I would "officially" correct this:)
Statsmodels is isn't only a poor imitation of R, it's also a poor imitation of Stata. It is in some parts a poor imitation of SAS, and maybe even in some parts a poor imitation of Matlab or GAUSS or ...., and Maybe in some parts it ' s even a good imitation.
But I think it was a good imitation of statsmodels,
Although with still some very important gaps in coverage of statistics and econometrics. "