Pydata Ecosystem data analysis ecosystem based on Python
0.
Agenda
Data Science Ecosystem
Data wrangling
Data Analysis
Data Visualization
3 Real Case Demo
Bigger Data consideration
Spark Data Frame Demo
1.
Data Science Process
Data Collection
Databases
Applications
3rdpart data
Data wrangling
Enrichment
Etl/blending
Data
Intergration
Data Analysis
Insights
Statistics
Visualization
Modeling
2.
Data wrangling
Data scientists spend 80% of their time convert data into a usable form.
Clean Data:handle messy or missed data
Transform and Extract data
Merge,join and reshape data
Time series Resampling
3.Data Analysis
Interactive Data Exploration
Rich visualzation
Satistical Modeling
4.python vs R
TIOBE Index
5.Pros and Cons
R+visualization = Perfect Match
R,lingua Franca of Statistics (develop by Statistics)
R is slow
Python is multi-purpose language
Python is challenger for either visualization or essential R packages replacement
6.PyData Ecosystem
Fundamental Libs
Numpy\scipy
Advancedlibs
Pandas\sympy\scikit-lean\xray\blaze
7.Numpy
High Performance N-arrary Operation Lib
High-performance multidimensional
8.pands
Packaged
9.Blaze
High-level user interface for databases and array computing systems
10.Spark
11.DataFrame
12.matplotlib
13.seaborn
14.Bokeh
15.IPython
PyconChina2015 as Tangerine Strong Pydata Ecosystem