Python has an extremely rich and stable data science tool environment. Unfortunately, for those who do not know this environment is like a jungle (cue snake joke). In this article, I'll guide you step-by-step through how to get into this pydata jungle.
You might ask, what about many of the existing Pydata package referral lists? I think it would be too much for a novice to offer too many choices. So there's no referral list, and I'm going to have a narrow range of discussions that focus on 10% of the tools, but they can do 90% of your work. Once you have mastered the necessary tools, you can browse the long list of pydata tools and choose what you want to use next.
It's worth mentioning that the tools I've introduced will allow you to complete most of the day-to-day work of a data scientist (such as data input and output, data reprocessing, and data analysis).
installation
People often come to me and say, "I hear Python is good with data science, so I want to learn." But installing Python and all the other modules takes two days. It's reasonable to install Python because you want to use it, but it's a big project to manually install all the Pydata tools when you don't know what other tools are really needed. So I am strongly opposed to doing so.
Fortunately, a group of continuum created the Python release Anaconda, which contains most of the Pydata toolkit. Modules that are not available by default can also be easily installed through the GUI. This release is available for all major platforms. This eliminates the need for two days of installation and can be used directly.
IPython Notebook
After Python is installed, most people start directly and start learning. That's reasonable, but sadly it's a big mistake. I've never seen a Python scientific computing environment run directly on the Python command line (varies from person to person). Instead, you can use Ipython, especially Ipython notebook, which are particularly powerful python shells that are widely used in the Pydata realm. I strongly recommend that you use Ipython notebook (IPYNB) without bothering about anything else, and you won't regret it. In short, IPYNB is a python shell accessed through a browser. It allows you to mix editing code, text, and graphics (even interactive objects). This article is done in the IPYNB. In the Python conference, almost all the speeches used Ipython notebook. The anaconda is preloaded with IPYNB and can be used directly. Here's what it looks like:
In [1]:
Print (' Hello World ')
Hello World
IPYNB is growing fast-I'm always amazed by the new features they've come up with every time I speak to a core developer in a meeting (IPYNB). To learn about some of its advanced features, take a look at the following brief tutorial on the Ipython gadget. These gadgets allow you to interactively control the drawing using the slider bar:
In [1]:
From Ipython.display import youtubevideo
youtubevideo (' wxvx54ax47s ') # Yes, it can also embed YouTube video
OUT[1]:
6. IPython Widgets–ipython Notebook Tutorial
Pandas
Usually, you will be advised to learn NumPy (read as Num-pie, not num-pee), a library that supports multidimensional arrays. It must have been like this a few years ago, but now I hardly use numpy. Because NumPy is increasingly becoming a core library of other libraries, these libraries typically have more elegant interfaces. As a result, pandas becomes the main library used to process data. It can be used in a variety of formats (including the database) input output data, perform join and other SQL similar functions to reshape the data, skilled processing of missing values, support time series, with basic drawing capabilities and statistical functions, and so on there are many. There must be a learning curve for all of its features, but I strongly recommend that you take a look at most of the documents first. The time you invest will make your data reprocessing process more efficient, which can bring in hundreds of thousands of rewards. Here are some quick tips to get your appetite open:
In [18]:
Import pandas as PD
DF = PD. Dataframe ({' A ': 1.,
' B ': PD. Timestamp (' 20130102 '),
' C ': PD. Series (1, Index=list (range (4)), dtype= ' float32 '),
' D ': PD. Series ([1, 2, 1, 2], Dtype= ' int32 '),
' E ': PD. Categorical (["Test", "Train", "Test", "Train"]),
' F ': ' foo '}
In [19]:
OUT[19]:
A B C D E F
0 1 2013-01-02 1 1 Test foo
1 1 2013-01-02 1 2 train foo
2 1 2013-01-02 1 1 Test foo
3 1 20 13-01-02 1 2 train foo
You can get a column by using a column name:
In []:
DF. B
out[17]:
0 2013-01-02
1 2013-01-02
2 2013-01-02
3 2013-01-02
Name: B, Dtype:datetime64[ns]
Compute The sum of D for every category in E:
sorted by E, each class to D sum: in
[]:
df.groupby (' E '). SUM (). D
out[21]:
E
test 2
train 4
name:d, Dtype:int32
Using NumPy (or clunky matlab) to achieve the same purpose can be cumbersome.
There are a lot of uses. Do not believe the words can look at this tutorial "ten minutes to pandas". The example above is also from this tutorial.
Seaborn
Matplotlib is the main drawing library for Python. However, I do not recommend that you use it directly, for reasons that do not recommend that you use NumPy are the same. Although the matplotlib is very powerful, it is very complex in itself, and your diagram can be refined by a great deal of adjustment. Therefore, as a substitute, I recommend that you start using Seaborn. Seaborn essentially uses matplotlib as a core library (like pandas to NumPy). I will briefly describe the advantages of the next Seaborn. Specifically, it can:
- By default, a pleasing chart can be created. (Only one point, the default is not Jet ColorMap)
- Create a graph with statistical significance
- Can understand the dataframe type of pandas, so they can work well together.
Although the Anaconda pre-installed pandas, but did not install Seaborn. Easy to install via Conda install Seaborn.
A graph of statistical significance
In [5]:
%matplotlib Inline # IPython Magic to create plots within cells
In [7]:
Import Seaborn as SNS
# Load One of the data sets that come with seaborn
tips = Sns.load_dataset ("Tips")
SNS.J Ointplot ("Total_bill", "Tip", tips, kind= ' reg ');
As you can see, with just one line of code, we create a beautifully complex statistical chart that contains the most fitted regression lines, boundary graphs, and correlation coefficients that have confidence intervals. Using Matplotlib to redraw the picture requires quite a lot of (ugly) code, including calling SciPy to perform a linear regression and manually drawing a line using the linear regression equation (I can't even figure out how to compute the confidence interval in the boundary drawing). Examples above and below are excerpted from the tutorial "the tutorial on quantitative linear models".
work well with Pandas's dataframe
The data has its own structure. Usually we are interested in a different group or class (in this case, it would be amazing to use the GroupBy function in pandas). The data set for tips (tips) is like this:
In [9]:
Tips.head ()
out[9]:
total_bill tip sex Smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1. Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 femal E No Sun Dinner 4
We may want to know if the smokers are giving us a different tip than non-smokers. Without Seaborn, this requires the use of the Pandas GroupBy function, and the linear regression line is drawn through complex code. Using Seaborn, we can provide the column name for the col parameter and divide the data according to our needs:
In [11]:
Sns.lmplot ("Total_bill", "Tip", tips, col= "smoker");
Pretty neat, huh?
As you study deeper, you may want to control the details of these graphs in finer granularity. Because Seaborn just called matplotlib, then you might want to learn this library. However, I still like to use Seaborn for most of the work.
Summary
The idea of this article is to maximize the efficiency of novice using Python to process data science by providing a partial package.