Python's easy-to-start tutorial on data science work

Source: Internet
Author: User
Python has an extremely rich and stable data science tool environment. Unfortunately, for those who do not know the environment is like a jungle (cue snake joke). In this article, I will step by step guide you how to get into this pydata jungle.

You might ask, how about a lot of the existing Pydata package recommendation lists? I think it would be unbearable for a novice to offer too many choices. So there's no recommendation list, and I'm going to talk about a very narrow range of tools that only focus on 10%, but they can do 90% of your work. Once you have mastered the necessary tools, you can browse the long list of pydata tools and choose what you want to use next.

It's worth mentioning that the tools I've introduced allow you to do most of the daily work of a data scientist (such as data input and output, data reprocessing, and data analysis).
Installation

People often come to me and say, "I've heard that Python is good at data science, so I want to learn." But it takes two days to install Python and all the other modules. " It's reasonable to install Python because you want to use it, but when you don't know what other tools you really need, it's really a big project to install all the Pydata tools manually. So I strongly oppose doing so.

Fortunately, Continuum's gang created the Python distribution Anaconda, which contains most of the Pydata toolkit. Modules that are not available by default can also be easily installed through the GUI. This release is available for all major platforms. This eliminates the need for two days of installation and allows you to use it directly.
IPython Notebook

After the Python installation, most people start directly and start learning. This is reasonable, but sadly, it's a big mistake. I've never seen it. Run the Python Scientific computing environment directly on the Python command line (varies from person to person). Instead, you can use Ipython, especially Ipython Notebook, which are particularly powerful python shells that are widely used in the Pydata field. I strongly advise you to use Ipython Notebook (IPyNB) without bothering about other things, and you will not regret it. In short, IPYNB is a python shell that is accessed through a browser. It allows you to mix and edit code, text, and graphics (even interactive objects). This article is done in the IPYNB. In the Python conference, almost all the speeches used Ipython Notebook. The IPYNB is preloaded in the anaconda and can be used directly. Here's what it looks like:

In [1]:

Print (' Hello World ') Hello World

IPYNB is growing fast-I'm always shocked by the new features they come up with every time a keynote speaker at the conference IPYNB. To understand some of its advanced features, take a look at the following short tutorial on Ipython gadgets. These gadgets allow you to interactively control the drawing using the slider bar:

In [1]:

From Ipython.display import youtubevideoyoutubevideo (' wxvx54ax47s ') # Yes, it can also embed YouTube videos

OUT[1]:
6. IPython Widgets–ipython Notebook Tutorial
Pandas

Usually, you will be advised to learn NumPy (read as Num-pie, not num-pee), a library that supports multidimensional arrays. It must have been like this a few years ago, but now I hardly use numpy. Because NumPy is increasingly becoming a core library used by other libraries, these libraries often have more elegant interfaces. As a result, pandas becomes the primary repository for processing data. It can input and output data in various formats (including databases), perform joins and other SQL-like functions to reshape data, skillfully handle missing values, support time series, have basic drawing capabilities and statistical functions, and much more. There must be a learning curve for all of its features, but I strongly recommend that you look at most of the documents first. The time you spend will make your data reprocessing process more efficient, which can result in thousands of returns. Here are some quick tips to make your appetite open:
In [18]:

Import Pandas as PD df = PD. DataFrame ({' A ': 1.,          ' B ': PD. Timestamp (' 20130102 '),          ' C ': PD. Series (1, Index=list (range (4)), dtype= ' float32 '),          ' D ': PD. Series ([1, 2, 1, 2], Dtype= ' int32 '),          ' E ': PD. Categorical (["Test", "Train", "Test", "Train"]),          ' F ': ' foo '})

In [19]:

OUT[19]:

A B C D E F0 1 2013-01-02 1 1 Test foo1 1 2013-01-02 1 2 train Foo2 1 2013-01-02 1 1 Test Foo3 1 2013-01-02 1 2 train foo

You can get a column from a column name:

in [+]: DF. BOUT[17]: 0  2013-01-021  2013-01-022  2013-01-023  2013-01-02name:b, Dtype:datetime64[ns] Compute the Sum of D for each category in E: Classified by E, each class is summed to D: in []: Df.groupby (' E '). SUM (). DOUT[21]: Etest   2train  4name:d, Dtype:int32

Using NumPy (or bulky matlab) to achieve the same goal can be cumbersome.

There is a lot of usage. If you don't believe it, take a look at this tutorial "ten minutes to pandas". The example above also comes from this tutorial.
Seaborn

Matplotlib is the main drawing library of Python. However, I do not recommend that you use it directly, the reason is the same as not recommending you to use NumPy at the beginning. Although Matplotlib is very powerful, it is complex in itself, and your diagram can be refined with a lot of tweaking. So, as an alternative, I recommend you start with Seaborn. Seaborn essentially uses matplotlib as the core library (just like pandas to NumPy). I will briefly describe the advantages of Seaborn. Specifically, it can:

    1. By default, a pleasing chart can be created. (Only one point, default is not Jet ColorMap)
    2. Create a statistically significant diagram
    3. Can understand pandas's dataframe types, so they work well together.

Although Anaconda preinstalled the pandas, it did not install Seaborn. Easy to install with Conda install Seaborn.
A graph of statistical significance
In [5]:

%matplotlib Inline # IPython Magic to create plots within cells

In [7]:

Import Seaborn as SNS # Load one of the data sets that come with seaborntips = Sns.load_dataset ("Tips") sns.jointplot ("Tot Al_bill "," Tip ", tips, kind= ' reg ');

As you can see, with just one line of code, we create a pretty complex chart that contains the most fitting regression lines, boundary plots, and correlation coefficients with confidence intervals. Using Matplotlib to redraw the picture requires quite a lot of (ugly) code, including calling SciPy to perform linear regression and manually drawing straight lines using linear regression equations (I can't even figure out how to draw at the boundary, how to calculate the confidence interval). The above and below examples are excerpted from the tutorial "the tutorial on quantitative linear models".
Work well with Pandas's dataframe

The data has its own structure. Often we are interested in having different groups or classes (in which case it is amazing to use the GroupBy feature in Pandas). The data set for tips (tips) is like this:
In [9]:

Tips.head () out[9]: Total_bill tip sex Smoker Day time size0 16.99 1.01 Female no Sun Dinner 10.34 1.66 Male no Sun Dinn Er 21.01 3.50 Male no Sun Dinner-23.68 3.31 Male no Sun Dinner 24.59 3.61 Female no Sun Dinner 4

We may want to know whether the tip of a smoker is different from a person who does not smoke. Without Seaborn, this would require the use of Pandas's groupby function and draw linear regression lines through complex code. With Seaborn, we can give the col parameter a column name and divide the data according to our needs:
In [11]:

Sns.lmplot ("Total_bill", "Tip", tips, col= "smoker");

Pretty neat, huh?

As you study more deeply, you may want to control the details of these charts more finely. Because Seaborn just called matplotlib, you might want to learn this library. However, I still like to use Seaborn for most of my work.
Summarize

The idea of this article is to maximize the efficiency of novice use of Python to process data science by providing partial packages.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.