Data charm: data analysis based on open-source tools
Basic Information
Author: (US) yanert (janert, K. P .)
Translator: Huang Quan, Lu Changhui, Xu xuemei, Fei Liufeng
Press: Tsinghua University Press
ISBN: 9787302290988
Mounting time:
Published on: February 1, July 2012
Start: 16
Page number: 1
Version: 1-1
Category: Computer> Computer science theory and basic knowledge> numerical computing> more comprehensive about "data charm: data analysis based on open-source tools
Introduction
Books
Computer books
The charm of data: data analysis based on open-source tools combined with the author's rich experience in data analysis over the years, describes the concepts and methods involved in data analysis. This book consists of four parts, 19 chapters. The topic includes how to observe data through charts, how to analyze data through various modeling methods, and then focuses on how to conduct data mining, finally, it emphasizes the practical application of data analysis in the commercial and financial fields. This book contains a large number of simulation processes and results, and illustrates how to use open-source tools for data analysis through examples. By reading this book, you can clearly understand the actual usage and usage of these methods.
The "Data charm: data analysis based on open-source tools" is reasonably structured and easy to understand. It is suitable for data analysis enthusiasts and practitioners to read. It is also suitable for scientific researchers who use scientific computing as a tool for reference. At the same time, this book is also applicable to data analysis courses for undergraduate or graduate students in computer science, mathematics, engineering technology and other related majors. It is a good reference book.
Directory
Data charm: data analysis based on open-source tools
Chapter 1 Introduction 1
Data Analysis 1
Content 2
Workshop 3
About mathematics 4
Required knowledge 6
Content not covered in this book 6
Part I charts: Observed Data
Chapter 2 single variables: Shape and
Distribution 11
Data points and jitter Fig 12
Histogram and Kernel Density Estimation 14
Histogram 15
Kernel Density Estimation 19
(Optional) how to select the optimal bandwidth 22
Cumulative Distribution Function 23
(Optional) probability chart distribution and QQ chart
Distribution comparison 25
Order chart and rise chart 30
Used only when appropriate: Statistical Summary
And box Plot 33
Summary statistic 33
Box-and-Whisker figure 36
(Workshop) numpy 38
Numpy practice 38
Numpy details 41
Extended reading 45
Chapter 2 two variables: relationship establishment 47
Scatter Plot 47
Overcome noise: Smooth 48
Spline 50
Loess51
Example 52
Residual 54
Other opinions and reminders 55
Figure 57
Tilt 61
Linear regression and other such methods 62
Description important information 66
Graphic analysis and demonstration 68
(Workshop) matplotlib 69
Interactive use of matplotlib 70
Case study: matplotlib and
Loess73
Control attribute 74
Matplotlib object model and structure 76
Directory XII
Fragmented knowledge 77
Extended reading 78
Chapter 1 takes time as a variable:
Timing Analysis 79
Example 79
Task 83
Requirements and reality 84
Smooth Processing 84
Moving Average Method 85
Exponential Smoothing Method 86
Do not ignore the obvious things 90
Related functions 91
Example 92
Implementation problems 93
(Optional) filter and convolution 95
(Workshop) scipy. Signal 96
Extended reading 98
Chapter 2 multi-variables: multi-variables of graphics
Analysis 99
False Color chart 100
Overview: multi-value graph 105
Scatter Plot matrix 105
Collaboration 107
Variant. 108
Composition problem. 110
Composition changes by 110
Multidimensional composition: Tree chart and
Mosaic 112
Novel curve type 116
Identifier 116
Parallel Coordinate chart 117
Interactive explorations 120
Query and zoom 121
Connection and coating 121
Grand Tour and Projection Pursuit 121
Tool. 122
(Workshop) multi-variable graphics tools 123
R 123
Lab tool 124
Python Chaco library 124
Additional reading. 125
Chapter 1 Episode: data analysis session 6th
Data analysis session 127
(Workshop) gnuplot Software 136
Additional reading. 138
Part II Analysis: Data Modeling
Chapter 1 Calculation and rough calculation 7th
Concept of calculation 142
Estimated size 143
Establish association 145
Use number 146
Power of 10: 146
Minor disturbance 147
Logarithm. 148
Directory XIII
For more examples 149
Some common things (things) I know)
151
Are these numbers good enough? 151
Preparation: 153 feasibility and cost
After: Reference and
Present number 154
(Optional) Further explore the shooting theory and
Error Propagation 155
Error Propagation 156
(Workshop) GNU scientific database (GSL) 158
Additional reading 161
Chapter 2 Scaling Parameter Model. 8th
Model 163
Modeling 164
Use and Misuse of models 164
Parameter Scaling: 165
Scaling Parameter 165
Example: dimension parameter 167
Example: optimization problem 169
Example: cost model 170
(Optional) Scaling parameters and
Dimensional analytics 172
Other theories 174
Average field: approximately 175
Background and other examples 176
Common time evolution solutions 178
Infinite growth and attenuation: 178
Constrained growth: the logistic swarm equation. 180
Oscillation 181
Case study: How many servers are there?
Best? . 182
Why modeling? 184
(Workshop) sage.184
Additional reading. 188
Chapter 1 Discussion of probability models 9th
9.1 binary distribution and bonuli test 191
Accurate result 192
Use the bernuoli test to establish an average field
Model 194
9.2 Gaussian distribution and central limit theorem 195
Central limit theorem. 195
Central and end items. 197
Why is Gaussian distribution so practical? 198
(Optional) Gaussian points. 199
Power Law distribution and unconventional statistics 201
Power Law distribution 203
(Optional) the expected value is unlimited.
Distribution 204
Next study. 206
Other distributions: 206
Geometric Distribution: 207
Poisson distribution of 207
Log Normal Distribution. 209
Distribution of special purposes: 211
Directory XIV
(Optional) case study-change orders over time
1 million visitors
(Workshop) power law distribution 215
Additional reading 219
Chapter 2 what you really need to know
Classical statistics knowledge 221
Source 221
Statistic definition 223
226 from a statistical point of view
Example: Formula Test
Vs graphic solution 229
Control experiment vs Observation Study 230
Experimental Design 232
Outlook 234
(Optional) Bayesian statistics --
Another view: 235
Use Frequency theory to explain probability 235
Understanding probability 236 using Bayesian methods
Bayesian Data Analysis: an actual
Example of effect 238
Bayesian reasoning: Summary and discussion. 241
(Workshop) r language 243
Additional reading. 249
Chapter 2: Mathematical manhunt --
Bigfoot and the least person
Multiplication equal to 253
11.1 how to average. 253
Simpson (Simpson) paradox. 254
Standard deviation. 256
Calculate 258
(Optional) which 259 should be selected?
(Optional) standard error. 259
Least Squares. 260
Statistical parameter estimation. 261
Function Approximation 263
Additional reading. 264
Part III computing: Data Mining
Chapter 1 simulation 12th
Warm-up question 267
Monte Carlo simulations 270
Problem combination 270
Distribution of the obtained results: 272
275 advantages and disadvantages
Sampling Method 276
Boot pulling 277
Which situations does the boot pulling method apply ?. 278
Boot variable 280
(Workshop) Simulation of simpy discrete events 280
Simpy introduction 281
The simplest queuing process is 282.
Queuing Theory (optional) 285
Run the simpy simulation 288
Conclusion 290
Directory XV
Additional reading 291
Chapter 2 finding clusters 13th
What makes a cluster? 293
A different view 296
Distance Calculation and similarity calculation 298
Common distance and Similarity
Calculation method 300
Clustering Method 304
Central exploration 305
Tree constructor 307
Neighbor feeder 309
311 of pre-processing and post-processing
Scale standardization: 311
Class attributes and evaluation 311
Other ideas 314
Case study: shopping baskets in supermarkets
Analysis 316
Reminder 319
(Workshop) pycluster and C cluster library 320
Additional reading 324
Chapter 4 Yimu Jianlin:
Identify important attributes 327
Principal Component Analysis 328
Motivation. 328
(Optional) theory 330
Description: 333
Computing. 334
Practical opinion 335
Double logo 336
Visualization Technology 337
Multivariate Scale Method 338
Network Diagram 339
Ke huonantu. 339
(Workshop) Use R for pca342
Additional reading. 348
Linear Algebra 349
Chapter 1 Episode: When Data fails
Increase by 351 in proportion
A real story 353
Some suggestions. 354
How does MAP/reduce 356
(Workshop) Generation and arrangement 357
Additional reading. 358
Part IV Application: Data Usage
Chapter 4 reports, business intelligence and
Dashboards 361
Business Intelligence 362
Report 364
Enterprise metrics and dashboards 369
Recommendation 370 on the indicator plan
373 of data quality problems
Data availability. 373
Data Consistency 375
(Workshop) Berkeley dB and SQLite. 376
Directory XVI
Berkeley dB 377
SQLite 379
Additional reading 381
Chapter 2 Financial computing and modeling 17th
Time Value of currency: 384
One-time payment: Future Value and
Present Value 384
Multiple payments: Compound Interest 386
Compound interest computing skills 387
Overview: Cash Flow Analysis and
Net Present Value 389
Plan cost and Opportunity Cost
Uncertainty 391
Use the expected values of the account to consider
Uncertainty 391
Opportunity Cost: 393
Cost concept and depreciation of 394
Direct and indirect costs: 394
Fixed and variable costs: 396
397 capital expenditure and operating cost
Should I pay attention to it? 398
Are these all? 399
(Workshop) Newspaper dealers issue 400
(Optional) Exact Solution 402
Additional reading 403
Newspaper dealers issue 404
Chapter 4 Prediction and Analysis 18th
Topic of predictive analysis 406
Some classification terminologies 407
Classification Algorithm 408
Instance-based classification and nearest neighbor
Classification Algorithm 409
Bayesian classifier. 409
Regression. 413
Support Vector Machine 414
Decision tree and rule-based
Classifier 416
Other classification algorithms 418
Process 419
Integration Method: bagging and
Boosting 419
Estimated prediction error. 420
Class imbalance issue 421
Private secrets 423
The essence of Statistical Learning 424
(Workshop) two self-prepared
Classifier. 426
Additional reading. 431
Chapter 4 Conclusion: facts are not
Reality 433
Appendix A scientific computing and data analysis
435 programming environment
Appendix B Application: calculus 447
Appendix C use data 485
Index 499
Source of this book: China Interactive publishing network