Python is a simple getting started tutorial for data science and python getting started tutorial

Source: Internet
Author: User

Python is a simple getting started tutorial for data science and python getting started tutorial

Python has an extremely rich and stable data science tool environment. Unfortunately, for people not familiar with it, this environment is like a jungle (cue snake joke ). In this article, I will step by step guide you how to enter the PyData jungle.

You may ask, what is the recommended list of many existing PyData packages? I think it is impossible for new users to make too many choices. Therefore, the recommendation list will not be provided here. The scope of the discussion is very narrow. It only focuses on 10% of the tools, but they can do 90% of your work. After you have mastered these necessary tools, you can browse the long list of PyData tools and select what you want to use next.

It is worth mentioning that the tools I have introduced allow you to complete most of the daily work of a data scientist (such as data input and output, data re-processing, and data analysis ).
Install

Someone often came to me and said, "I heard that Python is very good at data science, so I want to learn it. But it takes two days to install Python and all other modules ". It is reasonable to install Python because you need it, but it is indeed a big project to manually install all PyData tools when you do not know what other tools are actually needed. So I strongly oppose this.

Fortunately, the Continuum team created the Python release Anaconda, which contains most of the PyData toolkit. Modules that are not available by default can also be easily installed through the GUI. This release is applicable to all mainstream platforms. In this way, you do not need to spend two days installing it. You can use it directly.
IPython Notebook

After Python is installed, most people start and start learning. This is reasonable, but unfortunately it is a big mistake. I have never seen Python scientific computing environment running directly in Python command lines (varies from person to person ). On the contrary, you can use IPython, especially IPython Notebook. They are all very powerful Python shells and are widely used in the PyData field. I strongly recommend that you directly use IPython Notebook (IPyNB) instead of worrying about other things. You will not regret it. In short, IPyNB is a Python shell accessed through a browser. It allows you to mix and edit code, text, and graphics (or even interactive objects ). This article is completed in IPyNB. Almost all lectures in Python use IPython Notebook. IPyNB is pre-installed in Anaconda and can be directly used. Let's take a look at what it looks like:

In [1]:

print ('Hello World')
Hello World
IPyNB is developing very fast-every time I listen to a speech by a core developer (of IPyNB) at a conference, I am always struck by the new features they come up with. To understand some of its advanced features, take a look at this short tutorial on IPython gadgets. These gadgets allow you to interactively control the drawing using the slider:

In [1]:
 
from IPython.display import YouTubeVideo
YouTubeVideo ('wxVx54ax47s') # Yes, it can also be embedded in youtube videos
Out [1]:
6. IPython Widgets – IPython Notebook Tutorial
Pandas

Generally, you will be recommended to learn NumPy (pronounced num-pie, not num-pee) first, a library that supports multi-dimensional arrays. This must have been a few years ago, but now I hardly use NumPy. Because NumPy is increasingly becoming a core library used by other libraries, these libraries usually have a more elegant interface. Therefore, Pandas has become the main library used to process data. It can input and output data in various formats (including databases), perform joins and other SQL-like functions to reshape data, skillfully deal with missing values, support time series, have basic drawing functions and statistical functions, and many more . For all its features, there must be a learning curve, but I strongly recommend that you take a look at most of the documentation. The time you invest will make your data reprocessing process more efficient, which will bring thousands of times in return. Here are some quick tips that will make you appetite:
In [18]:
 
import pandas as pd
 
df = pd.DataFrame ({'A': 1.,
          'B': pd.Timestamp ('20130102'),
          'C': pd.Series (1, index = list (range (4)), dtype = 'float32'),
          'D': pd.Series ([1, 2, 1, 2], dtype = 'int32'),
          'E': pd.Categorical (["test", "train", "test", "train"]),
          'F': 'foo'})
In [19]:

Out [19]:
 A B C D E F
0 1 2013-01-02 1 1 test foo
1 1 2013-01-02 1 2 train foo
2 1 2013-01-02 1 1 test foo
3 1 2013-01-02 1 2 train foo
You can get a column by column name:
In [17]:
 
df.B
Out [17]:
 
0 2013-01-02
1 2013-01-02
2 2013-01-02
3 2013-01-02
Name: B, dtype: datetime64 [ns]
 
Compute the sum of D for each category in E:
Classified by E, each category sums D:
In [21]:
 
df.groupby ('E'). sum (). D
Out [21]:
 
E
test 2
train 4
Name: D, dtype: int32
Using NumPy (or bulky Matlab) to achieve the same purpose can be cumbersome.

There are many uses. If you don't believe it, you can take a look at this tutorial "10 minutes to pandas". The example above also comes from this tutorial.
Seaborn

Matplotlib is Python's main drawing library. However, I do not recommend that you use it directly, for the same reason that you did not recommend NumPy. Although Matplotlib is very powerful, it is very complicated in itself, and your graph can be refined after a lot of adjustments. Therefore, as an alternative, I recommend that you start with Seaborn. Seaborn essentially uses Matplotlib as its core library (just like Pandas does to NumPy). I will briefly describe the advantages of seaborn. Specifically, it can:

    By default, pleasing charts can be created. (Only one point, the default is not jet colormap)
    Create statistically significant graphs
    Can understand the DataFrame type of pandas, so they work well together.
Although pandas is pre-installed on anaconda, seaborn is not installed. It can be easily installed by conda install seaborn.
Statistically significant graph
In [5]:
 
% matplotlib inline # IPython magic to create plots within cells
In [7]:
 
import seaborn as sns
 
# Load one of the data sets that come with seaborn
tips = sns.load_dataset ("tips")
 
sns.jointplot ("total_bill", "tip", tips, kind = 'reg');
As you can see, with just one line of code, we have created a beautiful and complex statistical graph, which contains the best-fit regression line with confidence interval, boundary plot, and correlation coefficient. Using matplotlib to redraw this picture requires a lot of (ugly) code, including calling scipy to perform linear regression and manually using the linear regression equation to draw a straight line (I ca n’t even think of how to plot at the boundary and how to calculate the confidence interval). The examples above and below are taken from the tutorial "the tutorial on quantitative linear models".
Works well with Pandas DataFrame

The data has its own structure. Usually we are interested in including different groups or classes (in this case, using the groupby function in pandas will make people feel amazing). For example, the data set of tips is like this:
In [9]:
 
tips.head ()
Out [9]:
 total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
We may wonder whether the tips given by smokers are different from those who do not smoke. Without seaborn, this requires using the groupby function of pandas and drawing a linear regression line through complex code. Using seaborn, we can provide column names for the col parameter and divide the data according to our needs:
In [11]:
 
sns.lmplot ("total_bill", "tip", tips, col = "smoker");
Is it neat?

As you study deeper, you may want to control the details of these charts at a finer granularity. Because seaborn just calls matplotlib, then you may want to learn this library. However, for most jobs, I still like to use seaborn.
to sum up

The idea of this article is to maximize the efficiency of novices using Python to handle data science by providing some packages.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.