Always at the beginning, but what can I do? Python data analysis

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Directory

Preface

Objective
The Python open Source Library ecosystem for the field of scientific computing has developed rapidly over the past 10 years. At the end of 2011, I felt deeply that Python programmers who had just been exposed to data analysis and statistical applications were struggling because of the lack of centralized learning resources. Key projects for data analysis, especially NumPy, matplotlib and pandas, are ripe, meaning that writing a book specifically about them doesn't seem to be going out of fashion soon. So I was determined to start a writing project like this. I wanted to be able to get such a book when I first started working with Python on data analysis in 2007. Hopefully you will find this book useful as well, and hopefully you'll be able to apply the tools described in the book to your actual work effectively.
The agreement of the book
This book uses the following typographic conventions:
Italic (Italic)
For new terms, URLs, e-mail addresses, file names, and file name extensions.
Equal width font (Constant width)
Used to indicate the manifest of the program and the elements in the program referenced in the paragraph, such as variables, function names, databases, data types, environment variables, statements, keywords, and so on.
Equal width bold (Constant width bold)
Used to indicate commands, or text content that needs to be typed verbatim by the reader.
Equal width italic body (Constant width italic)
Used to represent textual content that needs to be overridden by user-supplied values or by context-determined values.
Note: Represents a technique, suggestion, or general description.
Warning: Represents a warning or consideration.
Use of the sample code
The purpose of this book is to help you get your work done quickly. In general, you can use the code in your program or document without obtaining our permission, unless you want to copy a large portion of the code in the book. For example, when you write a program, you use several snippets of the book, which doesn't have to be licensed. However, if the code in the O despair eilly book is made into a CD-ROM and sold or transmitted, we will need to obtain our permission. Referencing the sample code or the content in the book answers the question without permission. Use a large portion of the book's sample code for your personal product documentation, which requires our permission.
If you cite the contents of this book and indicate the copyright attribution statement, we appreciate it, but this is not required. A copyright attribution statement typically includes: title, author, publisher, and ISBN, for example: "Python for Data analysis by William Wesley McKinney (O ' eilly)." Copyright William Wesley McKinney, 978-1-449-31979-3 ".
If you think your use of the sample code is beyond the scope mentioned above, or if you are not sure whether you need to get the sample code, please feel free to contact us at: [Email protected].
Contact Us
For any suggestions and questions about this book, you can contact us in the following ways:
United States:
O ' Eilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
China:
Room 807, Building C, No. 2nd, Xizhimen South Street, Xicheng District, Beijing (100035)
Aolaili Technology Consulting (Beijing) Co., Ltd.
We will list errata, examples, and other information on the pages of this book. This page can be accessed through http://oreil.ly/Python_for_Data_Analysis.
To comment or ask technical questions about this book, please send an email to:
[Email protected]
For more information about O ' eilly books, courses, conferences and news, visit the following Web site:
http://www.oreilly.com.cn
Http://www.oreilly.com
You can also follow us through the following websites:
Our home page on Facebook: http://facebook.com/oreilly
Our homepage on Twitter: http://twitter.com/oreillymedia
Our homepage on YouTube: Http://www.youtube.com/oreillymedia

Abstracts

1th Chapter
Preparatory work
The main contents of this book
This book is about the use of Python data control, processing, collation, analysis and other aspects of the specific details and basic points. At the same time, it is also a practical guide for scientific Computing using Python (specifically for data-intensive applications). This book focuses on Python languages and libraries for efficient resolution of various data analysis issues. This book does not describe how to use Python to implement a specific analysis method.
What exactly does it mean when the "data" appears in the book? Mainly refers to structured data (structured), a deliberately ambiguous term that refers to data in all common formats, such as:
Multidimensional Array (matrix).
Tabular data, where each column may be of a different type (string, numeric, date, and so on). For example, those data stored in a relational database or in a text file delimited by a tab/comma.
Multiple tables that are interconnected by key columns (for SQL users, which are primary keys and foreign keys).
A time series that has an average or not average interval.
This is by no means a complete list. Most datasets can be transformed into structured forms that are more appropriate for analysis and modeling, although this is sometimes not obvious. If not, you can also extract the dataset's features into some form of structure. For example, a group of news articles can be processed into a word frequency table, which can be used for affective analysis.
Most spreadsheet software, such as Microsoft Excel, which is probably the most widely used data analysis tool in the world, does not feel familiar with such data.
Why use Python for data analysis
Many people (myself included) are apt to fall in love with the language of Python. Since its inception in 1991, Python has now become one of the most popular dynamic programming languages, along with Perl, Ruby, and others. With a large number of web frameworks, such as rails (Ruby) and Django (Python), it has been very popular in recent years to use Python and Ruby for website building work. These languages are often referred to as scripting (scripting) languages because they can be used to write short, coarse, small programs (that is, scripts). I personally do not like the term "scripting language" because it seems to say that these languages cannot be used to build rigorous software. Among the many interpreted languages, Python's greatest feature is the presence of a large and active scientific Computing (scientific computing) community. Since entering the 21st century, the use of Python for scientific computing in industry applications and academic research is gaining momentum.
Python will inevitably be close to other open source and commercial domain-specific programming languages/tools such as R, MATLAB, SAS, Stata, etc. for data analysis and interaction, exploratory computing, and data visualization. In recent years, Python has a constantly improved library (mainly pandas), making it a major alternative to data processing tasks. Combined with its power in general programming, we can use only python to build data-centric applications.
Take Python as an adhesive
As a scientific computing platform, the success of Python stems from its ability to easily integrate C, C + +, and Fortran code. Most modern computing environments use a number of Fortran and C libraries to implement linear algebra, optimization, integration, Fast Fourier transforms, and other such algorithms. Many companies and national labs also use Python to "glue" legacy software systems that have been in use for more than more than 30 years.
Most software consists of two pieces of code: a small amount of code that takes up most of the execution time, and a large number of "adhesive codes" that are infrequently executed. The execution time of the adhesive code is usually negligible. The developer's energy is almost always spent on optimizing the computational bottlenecks, and sometimes more directly to lower-level languages (such as C).
In recent years, The Cython project (http://cython.org) has become a great way to create compiled extensions in the Python field and to interface with C + + code.
Solving the "two languages" problem
Many organizations often use a similar domain-specific computing language (such as MATLAB and R) to research, prototype, and test new ideas, and then migrate those ideas to a larger production system (probably written in Java, C #, or C + +). It is becoming increasingly recognized that Python is not only suitable for research and prototyping, but also for building production systems. I believe more and more companies will look at this, as researchers and engineers using the same programming tool will bring significant organizational benefits to the business.
Why not choose Python
While Python is great for building compute-intensive science applications and almost all kinds of universal systems, it still does not work for many scenarios.
Because Python is an interpreted programming language, most Python code runs much slower than code written in a compiled language such as Java and C + +. Because programmers ' time is usually more valuable than CPU time, many people are willing to make some tradeoffs here. However, in applications that require very little latency (such as high-frequency trading systems), it is worthwhile to spend time programming in lower-level, lower-productivity languages such as C + + in order to optimize performance to the maximum extent possible.
Python is not an ideal programming language for high-concurrency, multi-threaded applications, especially those with many compute-intensive threads. This is because Python has a thing called Global Interpreter lock (interpreter Lock,gil), which is a mechanism that prevents the interpreter from executing multiple python bytecode instructions at the same time. The technical reason why the Gil is there is beyond the scope of this book, but for the time being, Gil does not disappear in a short period. Although many large data processing applications need to be run on a computer cluster in order to complete the processing of datasets in a short time, there are still some situations that need to be solved with a single-process multithreaded system.
This is not to say that Python cannot execute real multithreaded parallel code, except that the code cannot be executed in a single Python process. For example, the Cython project can integrate OpenMP (a C framework for parallel computing) to enable parallel processing loops and thus greatly increase the speed of numerical algorithms

Always at the beginning, but what can I do? Python data analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Always at the beginning, but what can I do? Python data analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Always at the beginning, but what can I do? Python data analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support