Advanced NumPy of Python data analysis

Last Update:2016-10-07 Source: Internet

Author: User

Tags rand scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary

NumPy is the basis that must be mastered in data analysis using Python. is the foundation package for high-performance Scientific computing and data analysis. By using numpy, we can perform fast standard mathematical function calculation without loop, and can do linear algebra, random number, Fourier transform and so on, but for data analysis, the more important use is Data cleaning, filtering, subset construction, conversion, sorting, descriptive statistics and so On.

Creating multidimensional arrays

1. Use array to generate a basic array, such as:

>>> Import NumPy as NP

>>> A=np.array ([1,2,3,4])

>>> B=np.array ([[1,2,3],[4,5,6]])

2. Use shape to view the array dimensions, such as:

>>> A.shape

(4,)

>>> B.shape

(2, 3)

A is an array of 4 rows, and B is an array of 2 rows and 3 Columns.

3, using Zeros,ones,empty to create an array of all 0, all 1, without any specific values, such as:

>>> Np.zeros (10)

Array ([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

>>> Np.ones (5)

Array ([1., 1., 1., 1., 1.])

>>> Np.empty (8)

Array ([1.42988904e-307, 1.42990941e-307, 1.42987885e-307,

1.42991960e-307, 1.42988904e-307, 1.42992978e-307,

1.42991960e-307, 1.42946125e-307])

>>> Np.zeros ((3,2))

Array ([[0., 0.],

[0., 0.],

[0., 0.]])

4, use eye to create a diagonal matrix, such as:

>>> Np.eye (4)

Array ([[1., 0., 0., 0.],

[0., 1., 0., 0.],

[0., 0., 1., 0.],

[0., 0., 0., 1.])

5, use Arange to create an array, such as:

>>> Np.arange (10)

Array ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> Np.arange (+) Reshape ((8,4))

Array ([[0, 1, 2, 3],

[4, 5, 6, 7],

[8, 9, 10, 11],

[12, 13, 14, 15],

[16, 17, 18, 19],

[20, 21, 22, 23],

[24, 25, 26, 27],

[28, 29, 30, 31]])

An operation between an array and a variable

An operation between an array and a scalar also propagates the values of the variables to each primitive, such as:

>>> A

Array ([1, 2, 3, 4])

>>> a*2

Array ([2, 4, 6, 8])

>>> a**2

Array ([1, 4, 9, 16])

>>> a**0.5

Array ([1. , 1.41421356, 1.73205081, 2. ])

Array access

1, subscript starting from 0, such as:

>>> A

Array ([1, 2, 3, 4])

>>> a[0]

2, using a colon to control the subscript range, the colon coordinates are the starting subscript, followed by the end subscript, if the left is not, the default starting from 0, if the right is not, the default to the last, such as:

>>> a[1:3]

Array ([2, 3])

>>> a[:]

Array ([1, 2, 3, 4])

>>> C

Array ([[1, 2],

[3, 4],

[5, 6]])

>>> c[0,1]

>>> c[:,:]

Array ([[1, 2],

[3, 4],

[5, 6]])

>>> c[1:,:]

Array ([[3, 4],

[5, 6]])

Array function Operations

1. unary operators, which accept one or more scalar values, return one or more variable values, such as:

>>> np.sqrt (a)

Array ([1. , 1.41421356, 1.73205081, 2. ])

>>> Np.exp (a)

Array ([2.71828183, 7.3890561, 20.08553692, 54.59815003])

>>> Np.log (a)

Array ([0. , 0.69314718, 1.09861229, 1.38629436])

>>> Np.square (a)

Array ([1, 4, 9, 16])

2. Mathematical and statistical methods

Statistics can be calculated from a set of data on an array (such as a row, or a column, or all elements), such as sum summation, mean mean, std standard, etc.:

>>> Np.mean (c)

3.5

>>> Np.sum (c)

21st

>>> NP.STD (c)

1.707825127659933

or accept a row, the operation of a column, by the parameter Axis=1 (row) or axis=0 (column) to control, such as:

>>> C.mean (1)

Array ([1.5, 3.5, 5.5])

>>> C.mean (0)

Array ([3., 4.])

Linear algebra

1. Use dox to multiply the matrix, as

>>> A=np.array ([[5,7,2],[1,4,3]])

>>> A

Array ([[5, 7, 2],

[1, 4, 3]])

>>> B=np.ones (3)

>>> b

Array ([1., 1., 1.])

>>> A.dot (b)

Array ([14., 8.])

Or:

>>> Np.dot (a, B)

Array ([14., 8.])

A is the 2*3 array, B is the 3*1 array, then A.dot (b) is clearly the 2*1 array

2. Other common operations, such as:

Diag: returns the diagonal elements of a square as a one-dimensional array, such as:

>>> Np.diag (a)

Array ([5, 4])

Trace: calculates the sum of the diagonal, as

>>> Np.trace (a)

Eig: calculating the eigenvalues and eigenvectors of a square matrix (which has a great effect when solving PCA principal component Analysis)

Svd: compute singular value decomposition (svd)

Random stochastic function

1. Normal produces a normal (gaussian) distribution sample, such as:

>>> nor= np.random.normal (size= (BIS))

>>> nor

Array ([[1.82509434,-0.08174943,-0.03192186,-1.32022539],

[0.5635118,-0.01755259,-0.6218383,-0.47245589],

[0.65491108,-0.07561601,-0.77738699,-1.0271891],

[0.00750912,-0.28588276, 0.04140614,-0.0730934]])

2. Rand produces evenly distributed samples, such as:

>>> Ran=np.random.rand (10)

>>> ran

Array ([0.05615543, 0.30253678, 0.05719663, 0.93391993, 0.56396041,

0.88799492, 0.90171215, 0.99980605, 0.4308874, 0.75317069])

or create a 4*4 matrix

>>> Np.random.rand (bis)

Array ([[0.6606665, 0.61180694, 0.80557148, 0.29191235],

[0.45824131, 0.71035683, 0.64597049, 0.53813232],

[0.19844871, 0.99582822, 0.66510914, 0.38786658],

[0.22661631, 0.24502371, 0.29560581, 0.65864835]])

3.uniform the uniform distribution value of [0,1], such as:

>>> Np.random.uniform (size= (4*4))

Array ([0.08978688, 0.69810777, 0.60858528, 0.88008121, 0.42380056,

0.6660461, 0.38487761, 0.89294656, 0.8344627, 0.33255587,

0.15196568, 0.38325999, 0.76401535, 0.30862096, 0.83909417,

0.88435482])

Matrix Mat

There are two different data types in the NumPy library, matrices Matrix and array arrays, all of which can be used to manipulate the number elements represented by rows and columns. however, performing the same mathematical operation on both data types may result in different results, in general, if you need some operations between matrices, such as inverse, matrix multiplication, transpose, etc., can be converted to the mat matrix in Progress.

1. Use the mat () to convert an array to a matrix, such as:

>>> Np.mat (a)

Matrix ([[5, 7, 2],

[1, 4, 3]])

2. After conversion to mat, two matrices can be directly calculated, such as:

Multiply

>>> Mat_a=np.mat (a)

>>> mat_a

Matrix ([[5, 7, 2],

[1, 4, 3]])

>>> Mat_b=np.mat (b)

>>> Mat_b

Matrix ([[1., 1., 1.]])

>>> mat_b.t

Matrix ([[1.],

[1.],

[1.]])

>>> MAT_B=MAT_B.I

>>> Mat_a*mat_b

Matrix ([[14.],

[8.]])

3, The general function returns the same results on the array and on the mat, such as:

>>> mat_a

Matrix ([[5, 7, 2],

[1, 4, 3]])

>>> Mat_a.sum ()

>>> Mat_a.sum (1)

Matrix ([[14],

[8]])

>>> Mat_a.mean (0)

Matrix ([[3., 5.5, 2.5]])

>>> Mat_a.mean (0). Shape

(1, 3)

>>> mat_a.sum (1). Shape

(2, 1)

Extended Application Examples

1. The Euclidean distance (knn algorithm) for all rows of a dataset with a vector inx and a primitive matrix

>>> dataset=np.array ([[2,3,4,7,6], [4,3,4,5,7], [4,6,6,8,9], [2,3,6,1,6]])

>>> Inx=np.array ([2,3,4,5,6])

>>> rowsize=dataset.shape[0] #求出行数

>>> diffmat=tile (inx, (rowsize,1)) #利用tile函数, expands the inx to the same dimension as the dataset, for the purpose of subtracting matrices

>>> Diffmat2=diffmat-dataset

>>> diffmat3=diffmat**2 #求欧式距离

>>> diffmat4=diffmat3**0.5

2. PCA principal component analysis in NumPy

A matrix dataset is known

>>> datasets

Array ([[2, 3, 4, 7, 6],

[4, 3, 4, 5, 7],

[4, 6, 6, 8, 9],

[2, 3, 6, 1, 6]])

To find out its pca, it is divided into the following steps:

1) first find each column, that is, the average of each feature, axis=0 represents a column, Axis=1 represents a row

>>> Meanvals=np.mean (dataset,axis=0)

2) Primitive Matrix De-averaging

>>> Meanremove = dataset-meanvals

3) find out the covariance matrix of the Matrix after averaging

>>> Covmat=np.cov (meanremove, rowvar=0)

4) the eigenvalues and eigenvectors of the covariance matrix are obtained, and the Eig function in the NumPy library is Used.

>>> Eigvals,eigvects=np.linalg.eig (np.mat (covmat))

>>> eigvals

Array ([1.20374494e+01, 3.44539806e+00, 1.01715252e+00,

-1.59662646e-16, 1.21625562e-16])

>>> eigvects

Matrix ([[[0.20502268, 0.21893499,-0.80686681, 0.45018645,-0.41478476],

[0.32626948, 0.5145318, 0.23557446,-0.33467377,-0.45542834],

[-0.03039502, 0.57264251, 0.43491946, 0.64725395,-0.0675817],

[0.86497081,-0.39326712, 0.20883181, 0.21575132,-0.02252723],

[0.32002433, 0.4524887,-0.24638376,-0.46887027, 0.78451505]])

5) Find the TOPN (assuming 3) large eigenvalues and corresponding eigenvectors

>>> Eigvalind=np.argsort (eigvals)

>>> Eigvalind

Array ([3, 4, 2, 1, 0])

>>> eigvalind=eigvalind[:-(3+1):-1]

>>> Eigvalind

Array ([0, 1, 2]) #TOPN的值对应的下标为0,

>>> redeigvects=eigvects[:,eigvalind] #TOPN特征值对应的特征向量组成的矩阵

>>> redeigvects

Matrix ([[0.20502268, 0.21893499,-0.80686681],

[0.32626948, 0.5145318, 0.23557446],

[-0.03039502, 0.57264251, 0.43491946],

[0.86497081,-0.39326712, 0.20883181],

[0.32002433, 0.4524887,-0.24638376]])

The Redeigvect matrix generated above is the PCA we need, which is to turn 5 features into 3 features to achieve a reduced dimension. Assuming that the original feature is x1,x2,x3,x4,x5, then after PCA conversion, the new three variables are:

y1=0.20502268*x1+ 0.32626948*x2+-0.03039502*x3 .....

y2=0.21893499*x1+ .....

Y3= -0.80686681*x1+ ...

Advanced NumPy of Python data analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More