Directory
Preface 1
Chapter 1th Preparation of work 5
Main contents of this book 5
Why use Python for data analysis 6
Important Python Library 7
Setup and Setup 10
Communities and Seminars 16
Using this book 16
Acknowledgements 18
Chapter 2nd Introduction 20
1.usa.gov data from bit.ly 21
movielens1m Data Set 29
1880-2010 All-American baby name 35
Summary and Outlook 47
3rd Chapter Ipython: An interactive computing and development environment 48
Ipython Foundation 49
Introspection 51
Using the command history 60
Interacting with the operating system 63
Software Development Tools 66
IPythonHTMLNotebook75
Some tips on using Ipython to improve code development efficiency 77
Advanced Ipython Features 79
Acknowledgements 81
4th NumPy Basics: Arrays and vector calculations 82
NumPy's Ndarray: A multidimensional Array object 83
General functions: Fast element Progression Group function 98
Using arrays for data processing 100
File input and output for arrays 107
Linear algebra 109
Random number Generation 111
Example: Random Walk 112
5th. Pandas Introduction 115
Pandas introduction to Data structure 116
Basic features 126
Summarize and calculate descriptive statistics 142
Handling Missing Data 148
Hierarchical Index 153
Other topics related to Pandas 158
6th data loading, storage and file format 162
Read and write data in text Format 162
Binary data Format 179
Using HTML and WebAPI181
Working with Databases 182
7th. Data normalization: Clean, transform, merge, reshape 186
Merging data sets 186
Reshaping and axial rotation 200
Data Conversion 204
String Manipulation 217
Example: USDA Food Database 224
8th. Drawing and Visualization 231
Matplotlibapi Getting Started 231
Drawing functions in Pandas 244
Draw a map: Graphically visualize Haiti earthquake crisis data 254
Python Graphical tools Ecosystem 260
Chapter 9th Data Aggregation and grouping operations 263
GroupBy Technology 264
Data Aggregation 271
Grouping-level operations and conversions 276
Pivot table and cross-table 288
Example: 2012 federal Election Commission database 291
The 10th Chapter time series 302
Date and time data types and tools 303
Time Series Basics 307
range, frequency, and movement of dates 311
Time Zone Processing 317
Time and its arithmetic operations 322
Resampling and Frequency Conversion 327
Time Series Drawing 334
Moving window Functions 337
Performance and memory usage considerations 342
Chapter 11th application of financial and economic data 344
Topics in Data Normalization 344
Grouping transformations and Analysis 355
More Example Applications 361
12th Chapter NumPy Advanced Applications 368
Internal mechanism of Ndarray objects 368
Advanced Array Operations 370
Broadcast 378
Ufunc Advanced Applications 383
Structured and recorded arrays 386
More about sorting topics 388
NumPy's Matrix class 393
Advanced array input and output 395
Performance Recommendations 397
Appendix Apython Language Essentials 401
Directory
Preface 1
Chapter 1th Preparation of work 5
Main contents of this book 5
Why use Python for data analysis 6
Important Python Library 7
Setup and Setup 10
Communities and Seminars 16
Using this book 16
Acknowledgements 18
Chapter 2nd Introduction 20
1.usa.gov data from bit.ly 21
movielens1m Data Set 29
1880-2010 All-American baby name 35
Summary and Outlook 47
3rd Chapter Ipython: An interactive computing and development environment 48
Ipython Foundation 49
Introspection 51
Using the command history 60
Interacting with the operating system 63
Software Development Tools 66
IPythonHTMLNotebook75
Some tips on using Ipython to improve code development efficiency 77
Advanced Ipython Features 79
Acknowledgements 81
4th NumPy Basics: Arrays and vector calculations 82
NumPy's Ndarray: A multidimensional Array object 83
General functions: Fast element Progression Group function 98
Using arrays for data processing 100
File input and output for arrays 107
Linear algebra 109
Random number Generation 111
Example: Random Walk 112
5th. Pandas Introduction 115
Pandas introduction to Data structure 116
Basic features 126
Summarize and calculate descriptive statistics 142
Handling Missing Data 148
Hierarchical Index 153
Other topics related to Pandas 158
6th data loading, storage and file format 162
Read and write data in text Format 162
Binary data Format 179
Using HTML and WebAPI181
Working with Databases 182
7th. Data normalization: Clean, transform, merge, reshape 186
Merging data sets 186
Reshaping and axial rotation 200
Data Conversion 204
String Manipulation 217
Example: USDA Food Database 224
8th. Drawing and Visualization 231
Matplotlibapi Getting Started 231
Drawing functions in Pandas 244
Draw a map: Graphically visualize Haiti earthquake crisis data 254
Python Graphical tools Ecosystem 260
Chapter 9th Data Aggregation and grouping operations 263
GroupBy Technology 264
Data Aggregation 271
Grouping-level operations and conversions 276
Pivot table and cross-table 288
Example: 2012 federal Election Commission database 291
The 10th Chapter time series 302
Date and time data types and tools 303
Time Series Basics 307
range, frequency, and movement of dates 311
Time Zone Processing 317
Time and its arithmetic operations 322
Resampling and Frequency Conversion 327
Time Series Drawing 334
Moving window Functions 337
Performance and memory usage considerations 342
Chapter 11th application of financial and economic data 344
Topics in Data Normalization 344
Grouping transformations and Analysis 355
More Example Applications 361
12th Chapter NumPy Advanced Applications 368
Internal mechanism of Ndarray objects 368
Advanced Array Operations 370
Broadcast 378
Ufunc Advanced Applications 383
Structured and recorded arrays 386
More about sorting topics 388
NumPy's Matrix class 393
Advanced array input and output 395
Performance Recommendations 397
Appendix Apython Language Essentials 401
Preface
Objective
The Python open Source Library ecosystem for the field of scientific computing has developed rapidly over the past 10 years. At the end of 2011, I felt deeply that Python programmers who had just been exposed to data analysis and statistical applications were struggling because of the lack of centralized learning resources. Key projects for data analysis, especially NumPy, matplotlib and pandas, are ripe, meaning that writing a book specifically about them doesn't seem to be going out of fashion soon. So I was determined to start a writing project like this. I wanted to be able to get such a book when I first started working with Python on data analysis in 2007. Hopefully you will find this book useful as well, and hopefully you'll be able to apply the tools described in the book to your actual work effectively.
The agreement of the book
This book uses the following typographic conventions:
Italic (Italic)
For new terms, URLs, e-mail addresses, file names, and file name extensions.
Equal width font (Constant width)
Used to indicate the manifest of the program and the elements in the program referenced in the paragraph, such as variables, function names, databases, data types, environment variables, statements, keywords, and so on.
Equal width bold (Constant width bold)
Used to indicate commands, or text content that needs to be typed verbatim by the reader.
Equal width italic body (Constant width italic)
Used to represent textual content that needs to be overridden by user-supplied values or by context-determined values.
Note: Represents a technique, suggestion, or general description.
Warning: Represents a warning or consideration.
Use of the sample code
The purpose of this book is to help you get your work done quickly. In general, you can use the code in your program or document without obtaining our permission, unless you want to copy a large portion of the code in the book. For example, when you write a program, you use several snippets of the book, which doesn't have to be licensed. However, if the code in the O despair eilly book is made into a CD-ROM and sold or transmitted, we will need to obtain our permission. Referencing the sample code or the content in the book answers the question without permission. Use a large portion of the book's sample code for your personal product documentation, which requires our permission.
If you cite the contents of this book and indicate the copyright attribution statement, we appreciate it, but this is not required. A copyright attribution statement typically includes: title, author, publisher, and ISBN, for example: "Python for Data analysis by William Wesley McKinney (O ' eilly)." Copyright William Wesley McKinney, 978-1-449-31979-3 ".
If you think your use of the sample code is beyond the scope mentioned above, or if you are not sure whether you need to get the sample code, please feel free to contact us at: [Email protected].
Contact Us
For any suggestions and questions about this book, you can contact us in the following ways:
United States:
O ' Eilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
China:
Room 807, Building C, No. 2nd, Xizhimen South Street, Xicheng District, Beijing (100035)
Aolaili Technology Consulting (Beijing) Co., Ltd.
We will list errata, examples, and other information on the pages of this book. This page can be accessed through http://oreil.ly/Python_for_Data_Analysis.
To comment or ask technical questions about this book, please send an email to:
[Email protected]
For more information about O ' eilly books, courses, conferences and news, visit the following Web site:
http://www.oreilly.com.cn
Http://www.oreilly.com
You can also follow us through the following websites:
Our home page on Facebook: http://facebook.com/oreilly
Our homepage on Twitter: http://twitter.com/oreillymedia
Our homepage on YouTube: Http://www.youtube.com/oreillymedia
Abstracts
1th Chapter
Preparatory work
The main contents of this book
This book is about the use of Python data control, processing, collation, analysis and other aspects of the specific details and basic points. At the same time, it is also a practical guide for scientific Computing using Python (specifically for data-intensive applications). This book focuses on Python languages and libraries for efficient resolution of various data analysis issues. This book does not describe how to use Python to implement a specific analysis method.
What exactly does it mean when the "data" appears in the book? Mainly refers to structured data (structured), a deliberately ambiguous term that refers to data in all common formats, such as:
Multidimensional Array (matrix).
Tabular data, where each column may be of a different type (string, numeric, date, and so on). For example, those data stored in a relational database or in a text file delimited by a tab/comma.
Multiple tables that are interconnected by key columns (for SQL users, which are primary keys and foreign keys).
A time series that has an average or not average interval.
This is by no means a complete list. Most datasets can be transformed into structured forms that are more appropriate for analysis and modeling, although this is sometimes not obvious. If not, you can also extract the dataset's features into some form of structure. For example, a group of news articles can be processed into a word frequency table, which can be used for affective analysis.
Most spreadsheet software, such as Microsoft Excel, which is probably the most widely used data analysis tool in the world, does not feel familiar with such data.
Why use Python for data analysis
Many people (myself included) are apt to fall in love with the language of Python. Since its inception in 1991, Python has now become one of the most popular dynamic programming languages, along with Perl, Ruby, and others. With a large number of web frameworks, such as rails (Ruby) and Django (Python), it has been very popular in recent years to use Python and Ruby for website building work. These languages are often referred to as scripting (scripting) languages because they can be used to write short, coarse, small programs (that is, scripts). I personally do not like the term "scripting language" because it seems to say that these languages cannot be used to build rigorous software. Among the many interpreted languages, Python's greatest feature is the presence of a large and active scientific Computing (scientific computing) community. Since entering the 21st century, the use of Python for scientific computing in industry applications and academic research is gaining momentum.
Python will inevitably be close to other open source and commercial domain-specific programming languages/tools such as R, MATLAB, SAS, Stata, etc. for data analysis and interaction, exploratory computing, and data visualization. In recent years, Python has a constantly improved library (mainly pandas), making it a major alternative to data processing tasks. Combined with its power in general programming, we can use only python to build data-centric applications.
Take Python as an adhesive
As a scientific computing platform, the success of Python stems from its ability to easily integrate C, C + +, and Fortran code. Most modern computing environments use a number of Fortran and C libraries to implement linear algebra, optimization, integration, Fast Fourier transforms, and other such algorithms. Many companies and national labs also use Python to "glue" legacy software systems that have been in use for more than more than 30 years.
Most software consists of two pieces of code: a small amount of code that takes up most of the execution time, and a large number of "adhesive codes" that are infrequently executed. The execution time of the adhesive code is usually negligible. The developer's energy is almost always spent on optimizing the computational bottlenecks, and sometimes more directly to lower-level languages (such as C).
In recent years, The Cython project (http://cython.org) has become a great way to create compiled extensions in the Python field and to interface with C + + code.
Solving the "two languages" problem
Many organizations often use a similar domain-specific computing language (such as MATLAB and R) to research, prototype, and test new ideas, and then migrate those ideas to a larger production system (probably written in Java, C #, or C + +). It is becoming increasingly recognized that Python is not only suitable for research and prototyping, but also for building production systems. I believe more and more companies will look at this, as researchers and engineers using the same programming tool will bring significant organizational benefits to the business.
Why not choose Python
While Python is great for building compute-intensive science applications and almost all kinds of universal systems, it still does not work for many scenarios.
Because Python is an interpreted programming language, most Python code runs much slower than code written in a compiled language such as Java and C + +. Because programmers ' time is usually more valuable than CPU time, many people are willing to make some tradeoffs here. However, in applications that require very little latency (such as high-frequency trading systems), it is worthwhile to spend time programming in lower-level, lower-productivity languages such as C + + in order to optimize performance to the maximum extent possible.
Python is not an ideal programming language for high-concurrency, multi-threaded applications, especially those with many compute-intensive threads. This is because Python has a thing called Global Interpreter lock (interpreter Lock,gil), which is a mechanism that prevents the interpreter from executing multiple python bytecode instructions at the same time. The technical reason why the Gil is there is beyond the scope of this book, but for the time being, Gil does not disappear in a short period. Although many large data processing applications need to be run on a computer cluster in order to complete the processing of datasets in a short time, there are still some situations that need to be solved with a single-process multithreaded system.
This is not to say that Python cannot execute real multithreaded parallel code, except that the code cannot be executed in a single Python process. For example, the Cython project can integrate OpenMP (a C framework for parallel computing) to enable parallel processing loops and thus greatly increase the speed of numerical algorithms
Always at the beginning, but what can I do? Python data analysis