"data analysis using python" reading Notes--fourth numpy basics: arrays and vector computing

Source: Internet
Author: User
Tags mathematical functions

Http://www.cnblogs.com/batteryhp/p/5000104.html

Fourth NumPy basics: arrays and vector calculations

Part I: Numpy's ndarray: a multidimensional Array object

To be honest, the main purpose of using NumPy is to apply vectorization operations. NumPy does not have much advanced data analysis capabilities, and understanding NumPy and array-oriented computations can help to understand the pandas behind it. according to the textbook, the Author's concern is mainly focused on:

    • Fast vectorization operations for data grooming and cleanup, subset construction and filtering, transformation, and more
    • Commonly used array solutions, such as sorting, uniqueness, set operations, etc.
    • Efficient descriptive statistics and data aggregation/digest operations
    • Data alignment and relational data operations for merge/join operations on heterogeneous datasets
    • The conditional logic is expressed as an array expression (rather than a loop with a if-elif-else Branch)
    • Grouping operations of data (aggregation, transformations, function applications, etc.).

The author said, Maybe Pandas better, I feel obviously pandas more advanced, where the function is really convenient, the data frame is the best data structure. just, the functions in NumPy are the basics and need to be familiar.

NumPy ndarray: a multidimensional Array object

The Ndarray object is the most important object of numpy and is characterized by vectorization. Ndarray Each element must have the same data type, each array has two properties: shape and dtype.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport matplotlib.pyplot as Pltdata = [[1,2,5.6],[21,4,2]]data = Np.array (data) print data.shapeprint data.dtypeprint Data.ndim
>>>
(2, 3)
Float64
2

The array function accepts all serialized objects (including other Arrays) and then produces a new NumPy array containing the incoming data, and the array automatically infers an appropriate data type. Another way is ndim: this translates to a dimension, which indicates the dimension of the Data. The above example is Two-dimensional. Zeros and ones can create arrays of a specified length or shape that are all 0 or 1. Empty can create an array with no specific values, and the Arange function is an array version of the Python built-in function range.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport matplotlib.pyplot as Pltdata = [[1,2,5.6],[21,4,2],[2,5  , 3]]data1 = [[2,3,4],[5,6,7,3]]data = Np.array (data) data1 = np.array (data1) arr1 = Np.zeros (ten) arr2 = np.ones ((2,3)) ARR3 = Np.empty ((2,3,4)) print arr1print arr2print arr3print Arr3.ndim
>>>
[0.0.  0.0.  0.0.  0.0. 0.0.]
[1.1. 1.]
[1.1. 1.]
[[[3.83889007e-321 0.00000000e+000 0.00000000e+000 0.00000000e+000]
[0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000]
[0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000]]
[[0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000]
[0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000]
[0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000]]
3

The above is a common function for generating arrays.

Data types for Ndarray

Dtype (data Type) is a special object. It contains the information required by Ndarray to interpret a piece of memory as the specified data type. He is one of the reasons why NumPy is so flexible and powerful. In most cases, they map directly to the corresponding machine representation, making it easier to "read and write binary data streams on disk" and "integrate low-level language code (c\fortran)". The Dtype is named by the type name + number that represents the element bit Length. Standard double-precision floating-point data takes 8 bytes (64 bits). Remember as float64. common data types are:

I finally found the meaning of f4,f8 ... The code for Boolean data is quite personal. The function Astype can cast a data type.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport matplotlib.pyplot as Pltarr = Np.array ([1,2,3,4,5]) prin T Arr.dtypefloat_arr = arr.astype (np.float64) Print float_arr.dtypearr1 = Np.array ([2.3,4.2,32.3,4.5]) #浮点型会被整型截断print Arr1.astype (np.int32) #一个全是数字的字符串也可以转换为数值类型arr2 = Np.array ([' 2323.2 ', ' [']]) print arr2.astype (float) # The dtype of the array also has a usage Int_array = np.arange (ten) calibers = np.array ([. 22,.270,.357,.44,.50],dtype = np.float64) Print Int_ Array.astype (calibers.dtype) print np.empty (' U4 ')

Calling Astype always creates a new array (a copy of the original array), even if it is the same as the original data type. Warning: floating-point numbers can only represent approximate numbers and should be noted when comparing Decimals.

An operation between an array and a scalar

Vectorization (vectorization) is the most important feature of Arrays. You can avoid (show) loops. Note the vectorization of the Subtraction. Operations between arrays of different sizes are called broadcasts (broadcasting).

Indexing and slicing, no longer repeating, note that the presence of a broadcast allows an array to be broadcast to all array elements even if only one value is assigned, in fact the Auto-complement function in the R language. The following nature is a bit of an egg ache: the most important difference from the list is that the array slice is the view of the original array, and any changes to the views are reflected on the source Data. Even if the following is the Case:

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport matplotlib.pyplot as Pltarr = Np.array ([1,2,3,4,5,6,7,8 , 9]) arr1 = arr[1:2]arr1[0] = 10print arr# If you want a copy, you need to copy the ARR2 = arr[3:4].copy () arr2[0] = 10print arrarr2d = Np.array ([[1, 2,3],[4,5,6],[7,8,9]]) #下面两种索引方式等价print arr2d[0][2]print arr2d[0,2]print arr2d[:,1] #注意这里的方式和下面的方式print arr2d[:,:1] Arr3d = Np.array ([[[[[[1,2,3],[4,5,6]],[[7,8,9],[[10,11,12]]]) print arr3d[(1,0)]
>>>
[1 10 3 4 5 6 7 8 9]
[1 10 3 4 5 6 7 8 9]
3
3
[2 5 8] #注意这里的方式和下面的方式
[[1]
[4]
[7]]
[7, 8, 9]

Boolean index

The Boolean index here is the true or false index. = =,! =,-(denotes negation), & (and), | (or). Note The Boolean index picks up the data in the array and creates a copy of the Data. The Python keyword and, or is not valid.

Fancy index (Fancy Indexing)

A fancy index refers to an array of integers for indexing.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport matplotlib.pyplot as Pltarr = Np.arange (+). reshape (8,4 Print  arr# Note here the vector style print arr[[1,5,7,2],[0,3,1,2]]print arr[[1,5,7,2]][:,[0,3,1,2]] #也可以使用np. ix_ function, Two one-dimensional array of integers to select the indexer for the square region print arr[np.ix_ ([1,5,7,2],[0,3,1,2])]>>>[[0  1  2  3] [4  5  6  7] [8 9 10 11] [12 13 14 15] [16 17 18 19] [20 21 22 23]  [24 25 26 27] [28 29 30 31]] [4 23 29 10] [[4  7  5 6] [all] [8]  9 10]][[4  7  5  6] [20 23 21 22] [28 31 29 [8] [  9 10]]

Fancy indexes always copy data into a new array, unlike slices, be sure to note the following differences:

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport matplotlib.pyplot as Pltarr = Np.arange (+). reshape (8,4 ) arr1 = Np.arange (+). reshape (8,4) #注意下面得到的结果是一样的arr3 = Arr[[1,2,3]][:,[0,1,2,3]]arr3_1 = arr1[1:4][:] #注意下面是区别了arr3 [ 0,1] =  #花式索引得到的是复制品, after re-assignment arr does not change arr3_1[0,1] = #切片方式得到的是一个视图, after re-assignment arr1 will change print arr3print Arr3_1print Arrprint arr1>>>[[  4   6   7] [  8   9  ] [  15]][[  4   6   7] [  8   9  ten  ] [  15]][[0  1 2 3 ] [4  5  6  7] [8  9] [all] [] [all] [] [+] [+] />0   1   2   3] [  4 6   7   ] [  8   9  ten  ] [  15 ] [[+] [] [[+]] []  31]]

Array Transpose and Axis conversions

Transpose transpose, which is a view of the source data and will not be copied. Call T to do So. The matrix product function in NP is Np.dot.

The more complex is the high-dimensional array:

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport matplotlib.pyplot as Pltarr = Np.arange (in). reshape (2, 3,4) #下面解释一下transpose: # (1,0,2) is to change the parameters in the reshape (2,3,4) into a (3,2,4) #但是由于是转置, so the subscript of all the elements of the above changes, such as 12 of this element, the original index is ( 1,0,0), now for (0,1,0) arr1 = arr.transpose ((1,0,2)) arr2 = arr. T #直接用T是变为了 (4,3,2) in the form #arr3 = Np.arange (+). reshape ((2,3,4,5)) #arr4 = arr3. T #直接用T就是将形式变为 (5,4,3,2) #ndarray还有swapaxes方法, accepts a pair of axis numbers ARR5 = arr.swapaxes #print arr#print arr1#print arr2#print Arr3#print arr4print arr5>>>[[[0  4  8]  [1  5  9]  [2 6]  [3  7 11]] [[12 [  15 19 23]] [[+]] []

The second part is about some element-level functions: functions that function on each element of the array, and after using the R language, I think It's really nothing.

Here are some common vectorization functions (let's call that).

Here are a few examples:

#-*-encoding:utf-8-*-import NumPy as Npimport numpy.random as Nprimport pandas as pd# a function that receives two arrays, the corresponding value takes the maximum value x = Npr.randn (8) y = Npr.randn (8) #注意不是max函数z = np.maximum (x, y) print x,y,z# Although not common, some Ufunc functions can indeed return multiple arrays. The MODF function is an example that separates the integer and fractional parts of a decimal, is a vectorized version of Divmod in python, arr = npr.randn (8) print np.modf (arr) #ceil函数取天花板, The smallest integer that is not less than the number of print Np.ceil (arr) #concatenate函数是将两个numpy数组连接, Note the #arr = np.concatenate ((arr,np.array ([0,0])) to form the tuple mode and then connect to the #logical_not函数, non- function #print np.logical_not (arr) print np.greater (x, y) print np.multiply (x, y)

Part iii: using arrays for data processing

The authors say that the vectorized array operation is 1-2 faster than the pure Pyhton method (or more), and once again emphasizes the broadcasting effect is very powerful.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as plt# false imagine calculating a sqrt on a two-dimensional grid (x^2 + y^2 ) #生成 5 to 5 of the grid, spaced 0.01points = Np.arange ( -5,5,0.01) #meshgrid返回两个二维矩阵, describing all ( -5,5) * ( -5,5) points to Xs,ys = Np.meshgrid (points, Points) z = np.sqrt (xs * * 2 + ys * * 2) #print xs#print ys# don't make a diagram all sorry viewers #imshow function, show Z is a matrix, cmap is colormap, when used is worth studying plt.imshow (z , Cmap=plt.cm.gray) plt.colorbar () plt.title ("Image plot of $\sqrt{x^2 + y^2}$ for a grid of Values") Plt.show ()

The above drawing statements need to be studied well when Used.

The following example is the Np.where function, a concise version of the If-else.

#np. the Where function is typically used to produce a new array with an existing array arr = Npr.randn (#正值赋成2) with negative values of -2print np.where (arr > 0,2,-2) #注意这里的用法print np.where (arr > 0,2,arr) #可以用where表示更为复杂的逻辑表达 # Two Boolean arrays cond1 and cond2,4 different combinations assigned different # note: according to the textbook, the following statement is Left-to-right operation, not from the inner bracket calculation This seems to be inconsistent with the Python syntax np.where (cond1 & cond2,0,np.where (cond1,1,np.where (cond2,2,3))) #不过感觉没有更好的写法了. #书上 "opportunistic" formula, If true = 1,false = 0result = 1 * (cond1-cond2) + 2 * (cond2 &-cond1) + 3 *-(cond1 | Cond2) ""
#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltimport Numpy.random as npr# noteworthy Is that a function such as mean, sum, has a parameter axis indicates which dimension is evaluated by arr = Np.array ([[0,1,2],[3,4,5],[6,7,8]]) #cumsum不是聚合函数, and the dimension does not reduce print arr.cumsum (0)

The following are common mathematical functions:

Methods for arrays of Boolean types

Sum is often used for the sums of true, and any and all determine whether the existence and all are true, respectively.

Sorting and uniqueness

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltimport numpy.random as Npr#sort The function is in-place sorted arr = npr.randn () print arrarr.sort () print arr# multidimensional arrays can be sorted by dimension, axis number passed to sort can arr = npr.randn (5,3) print arr# Sort passed in 1, that is, the 1th axis is ordered, that is, column arr.sort (1) print arr#np.sort Returns a sorted copy, not in-place sort # output 5% min. arr_npr = npr.randn (+) arr_npr.sort () Print Arr_npr[int (0.05 * Len (arr_npr))] #pandas中有更多排序, the number of bits and other functions, directly can take the number of bits, the second chapter of the example has a unique function in the #numpy, the uniqueness of the function, The R language also has names = Np.array ([' Bob ', ' Joe ', ' will ', ' Bob ', ' will ']) print sorted (set (NAMES)) print np.unique (names) values = Np.array ([6,0,0,3,2,5,6]) #in1d函数用来查看一个数组中的元素是否在另一个数组中, The name is very interesting, note the return length is the same as the first array print np.in1d (values,[6,2,3])

The following are common set operations

File input and output for arrays

NumPy can read or write text data or binary data on Disk. The following chapters will give some of the tools used in pandas to read tabular data to Memory.

Np.save and Np.load are the two main functions for reading and writing disk Data. By default, arrays are saved in the uncompressed original binary file format in files with The. npy Extension.

#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltimport numpy.random as NPR ' AR r = Np.arange (np.save) (' some_array ', arr) np.savez (' array_archive.npz ', a = Arr,b = Arr) arr1 = np.load (' some_array.npy ') Arch = np.load (' array_archive.npz ') print arr1print arch[' a '] "#下面是存取文本文件, Pandas and Read_csv are the best Sometimes you need to load data into a normal numpy array with np.loadtxt or Np.genfromtxt # These functions have many options to use: specifying various delimiters, converter functions for specific columns, number of rows to skip, etc. # Np.savetxt performs the opposite Operation: writes an array to a text file separated by a delimiter #genfromtxt is similar to loadtxt, except that it is oriented towards structured arrays and missing data processing

Linear algebra

With regard to some functions of linear algebra, There are many functions about matrices in the linalg of numpy, and the same industry standard-level FORTRAN libraries are used with MATLAB and R.

Random number generation

The Numpy.random module complements Python's built-in random, adding a number of functions to efficiently generate sample values for multiple probability distributions.

#-*-encoding:utf-8-*-import NumPy as Npimport numpy.random as Nprfrom random import normalvariate# generate standard normal 4*4 sample arrays samples = Npr.normal (size = (+)) print samples# as shown in the following example, if a large number of sample values are generated, Numpy.random is faster than an order of magnitude n = 1000000#xrange () Although it is also a built-in function, But it is defined as a type in python, and this type is called Xrange. #下面的循环中, for _ in xrange (N) very good ah, check the relationship with range, both for the loop, but in large loops, Xrange much Better%timeit samples = [normalvariate (0,1) for _ in xrange (n)]%timeit npr.normal (size = N)

Example: Random Walk

 #-*-encoding:utf-8-*-import NumPy as Npimport random #这里的random是python内置的模块import Matplotlib.pyplot as  Pltposition = 0walk = [position]steps = 1000for i in xrange (steps): step = 1 if random.randint (0,1) else-1 position + = Step walk.append (position) plt.plot (walk) plt.show () #下面看看简单的写法nsteps = 1000draws = Np.random.randint (0,2,size = Nstep S) steps = Np.where (draws > 0,1,-1) walk = steps.cumsum () plt.plot (walk) plt.show () #argmax函数返回数组第一个最大值的索引, But in this argmax is not efficient because it scans the entire array of print (np.abs (walk) >=). argmax () nwalks = 5000nsteps = 1000draws = Np.random.randint (0,2, Size = (NWALKS,NSTEPS)) steps = Np.where (draws > 0,1,-1) walks = steps.cumsum (1) print walks.max () print walks.min () # The parameter 1 behind any of these indicates whether there is Truehist30 = (np.abs (walks) >=) for each row (axis 1). any (1) print hist30print hist30.sum () #这就是有多少行超过了30 # Here the Argmax parameter 1 is crossing_time = (np.abs (walks[hist30]) >=). argmax (1) print Crossing_time.mean () X = Range (1000) Plt.plot (x,walks. T) plt.show () 

NumPy finished writing, then write Pandas. NumPy Write well, relatively smooth.

"data analysis using python" reading Notes--fourth numpy basics: arrays and vector computing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.