Using Mxnet's Ndarray to process data
2018-03-06 14:29 By☆ronny, 382 Read, 0 reviews, Favorites, compilation
Ndarray.ipynb
Ndarray Introduction
The object of machine learning is data, the data is usually collected by external sensor (sensors), digitized and stored in the computer, it may be text, sound, picture, video and other different forms.
These digitized data will eventually be loaded into memory for various cleaning, operation operations.
Almost all machine learning algorithms involve a variety of mathematical operations on data, such as: Add and subtract, point multiplication, matrix multiplication, and so on. So we need an easy-to-use, efficient, powerful tool to work with these data and groups to support a variety of complex math operations.
A lot of efficient algorithms have been developed for vectors and matrices, such as OPENBLAS,ALTLAS,MKL and so on.
For Python, NumPy is undoubtedly a powerful toolkit for data science, providing a powerful array representation of high-dimensional data, as well as supporting broadcasting operations, and providing powerful functions such as linear algebra, Fourier transforms, and random numbers.
Mxnet's Ndarray is very similar to Ndarray in NumPy, Ndaarray provides the core data structure for various mathematical calculations in Mxnet, Ndarray represents a multidimensional, fixed-size array, and supports heterogeneous computing. So why not just use NumPy? Mxnet's Ndarray offers two additional benefits:
- Support heterogeneous computing, data can be efficiently computed in CPU,GPU, and in hardware environments of multi-GPU machines
- Ndarray supports lazy evaluation, and for complex operations, it is possible to automate parallel operations on devices with multiple units of computing.
Important attributes of Ndarray
Each ndarray has the following important attributes, which we can access through the appropriate API:
ndarray.shape
: The dimension of the array. It returns an integer tuple whose length is equal to the number of dimensions of the array, and each element of the tuple corresponds to the length of the array on that dimension. For example, for an n-row m-column matrix, its shape is (n,m).
ndarray.dtype
: The type of all elements in an array, which returns a type of Numpy.dtype, which can be, int32/float32/float64
by default, ' float32 '.
ndarray.size
: The number of elements in the array, which equals ndarray.shape
the product of all the elements.
ndarray.context
: An array of storage devices, such as: cpu()
orgpu(1)
import mxnet as mximport mxnet.ndarray as nda = nd.ones(shape=(2,3),dtype=‘int32‘,ctx=mx.gpu(1))print(a.shape, a.dtype, a.size, a.context)
Creation of Ndarray
Generally there are 2 ways to create an Ndarray array:
- Use to
ndarray.array
convert a list or Numpy.ndarray directly to a Ndarray
- Use some built-in functions
zeros
, ones
as well as some random number modules ndarray.random
to create Ndarray, and pre-populate some data.
- Reshape from a one-dimensional ndarray
Import NumPyAs NPL= [[1,2],[3,4]]Print (Nd.array (l))# go to Ndarray from listPrint (Nd.array (Np.array (l)))# from Np.array to Ndarray# Create a Ndarrayprint (Nd.zeros (3) using functions directly. 4), Dtype= ' float32 ') print (Nd.ones (3,4), CTX =mx.gpu ())) # from a normal distribution of random number engine generated a specified size of Ndarray, we can also specify the parameters of the distribution, such as mean, standard deviation print (Nd.random.normal (Shape= (3, 4)) print (Nd.arange (18). Reshape ( 3,2,3))
Ndarray's view
In general, we can view the content in Ndarray by using print directly, or we can use nd.asnumpy()
a function to convert a ndarray to a numpy.ndarray.
= nd.random.normal(0, 2, shape=(3,3))print(a)print(a.asnumpy())
Basic mathematical operations
A series of mathematical operations such as subtraction can be performed between Ndarray, and most of the operations are carried out on a per-element basis.
shape=(3,4)x = nd.ones(shape)y = nd.random_normal(0, 1, shape=shape)x + y # 逐元素相加x * y # 逐元素相乘nd.exp(y) # 每个元素取指数nd.sin(y**2).T # 对y逐元素求平方,然后求sin,最后对整个NDArray转置nd.maximum(x,y) # x与y逐元素求最大值
It is important to note that the *
operation is an element-wise multiplication between the two ndarray, and to multiply the matrix, the matrix multiplication must be performed using a ndarray.dot
function
nd.dot(x, y.T)
Indexes and slices
MXNet Ndarray provides a variety of interception methods, which are basically consistent with the interception of lists in Python and the interception operations in Numpy.ndarray.
= nd.arange(0, 9).reshape((3,3))x[1:3] # 截取x的axis=0的第1和第2行x[1:2,1:3] # 截取x的axis=0的第1行,axis=1的第一行和第二行
Storage changes
Each operation opens up new memory to store the result of the operation when the Ndarray is used for arithmetic operations. For example: If we write y = x + y
, we will move the y
instance from now to the newly created instance. We can take the above operation as two steps: z = x + y; y = z
.
We can use Python's built-in functions id()
to validate. id()
returns an identifier for an object that, when it exists, must be unique, and in CPython this identifier is actually the address of the object.
= nd.ones((3,4))y = nd.ones((3,4))before = id(y)y = x + yprint(before, id(y))
In many cases, we want to be able to operate an array in situ, so we can use some of the following statements:
+= xprint(id(y))nd.elemwise_add(x, y, out=y)print(id(y))y[:] = x + yprint(id(y))
In Ndarray, a general assignment statement like y = x
, y is actually just an alias for X, and X and Y are shared data storage spaces.
= nd.ones((2,2))y = xprint(id(x))print(id(y))
If we want to get a real copy of x, we can use the copy function
= x.copy()print(id(y))
Broadcasting
Broadcasting is a powerful mechanism that allows ndarray of different sizes to be mathematically calculated together. We often have a small matrix and a large matrix, and then we will need to use a small matrix to do some calculations on the large matrix.
For example, if we want to add a vector to each line of the matrix, we can do that.
# Add V to each row of x and store the result in Y X= nd.array ([[1,2,3], [4,< Span class= "DV" >5,6], [7,8,9], [ 10, 11, 12]]) v = nd.array ([1, 0, 1]) y = nd.zeros_like (x) # Create an empty matrix with the same shape as Xfor I in range (4): Y[i,:] = x[i,:] + vprint (y)
This is going to work, but when the x matrix is very large, it can be very slow to calculate by using loops. We can change the idea:
X= Nd.array ([[1,2,3], [4,5,6], [7,8, 9], [10, 11, 12]) v = Nd.array ([1, 0, 1]) vv = nd.tile (V, (4, 1)) # Stack 4 copies of V on top of each othery = x + vv # Add x and vv elementwise< Span class= "Hljs-keyword" >print (y) # can also be broadcast_to to implement VV = v.broadcast_to ((4,3)) print (vv)
Ndarray's broadcast mechanism allows us not to create VV as above, and to directly perform operations
= nd.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])v = nd.array([1, 0, 1])y = x + vprint(y)
Use the broadcast mechanism for two arrays to comply with the following rules:
- If the rank of the array is different, use one to extend the lower rank array until the size of the two array is the same length.
- If two arrays are the same length on a dimension, or if one of the arrays has a length of 1 on that dimension, then we say that the two arrays are compatible on that dimension.
- If two arrays are compatible on all dimensions, they will be able to use the broadcast.
- If the size of the two input arrays is different, pay attention to the larger of the size. Since the broadcast, the size of the two arrays will be the same size as that of the larger one.
- In either dimension, if an array has a length of 1 and another array is longer than 1, it is as if the first array was copied on that dimension.
Computing on the GPU
Ndarray supports arrays on GPU devices, which is the largest difference between mxnet Ndarray and NumPy Ndarray. By default, all Ndarray operations are performed on the CPU, and we can query the device with the array by Ndarray.context. On a GPU-supported environment, we can specify Ndarray on the GPU device.
gpu_device = Mx.gpu (0) def f (): a = Mx.nd.ones ((100,100)) b = mx.nd.ones ((100,< Span class= "DV" >100), Ctx=mx.cpu ()) C = a + b.as_in_context ( A.context) print (c) F () # operating on the CPU < Span class= "hljs-comment" ># operations on the GPU with mx. Context (Gpu_device): F ()
With is used in the above statement to construct the context of a GPU environment, all statements in the context, and if there is no explicitly specified context, the context specified by the WTIH statement is used.
The current version of Ndarray requires that the context of an array of reciprocal operations must be consistent. We can use it for as_in_context
Ndarray context switching.
Serialization of Ndarray
There are two ways to save a Ndarray object after it is serialized, and the first method is to use pickle
it just as we would serialize other Python objects.
import Picklea = nd.ones ((2,3)) data = pickle.dumps (a) # serializes Ndarray directly into memory bytesb = pickle.loads (data) # from In-memory bytes deserialized to Ndarraypickle.dump (A, open ( ' Tmp.pickle ', ' WB ') # directly serialize Ndarray to file B = pickle.load ( ' Tmp.pickle ', ' RB ')) # from file deserialization to Ndarray
In the Ndarray module, a better interface is provided for data conversion between arrays and disk files (distributed storage systems)
= mx.nd.ones((2,3))b = mx.nd.ones((5,6))nd.save("temp.ndarray", [a, b]) # 写入与读取的路径支持Amzzon S3以及Hadoop HDFS等。c = nd.load("temp.ndarray")
Lazy Evaluation and automatic parallelization
Mxnet uses lazy evaluation to pursue the best performance. When we run in Python a = b + 1
, the python thread simply push the operation to the back-end execution engine and then it returns. This has the following two benefits:
- When the operation is push to the back end, the main thread of Python can continue to execute the following statement, which is especially helpful for interpreted languages such as Python when performing computational tasks.
- The backend engine can optimize the executed statements, for example, to automate the parallelization process.
The problem that the backend engine must solve is data dependency and reasonable dispatch. However, these operations are completely transparent to the front-end users. We can use it wait_to_read
to wait for the backend for the completion of the Ndarray operation. In the Ndarray module, the operation of copying data to other modules has been used internally, such as Wait_to_read asnumpy()
.
Import timeDef Do(x, N):"" "Push computation into the backend engine" "return [Mx.nd.dot (x,x)For IInchRange (n)]Def Wait(x):"" Wait until all results is available "" "For YIn X:y.wait_to_read () tic= Time.time () a= Mx.nd.ones ((1000,1000)) b = do (A, 50) print ( ' Time for all computations is pushed into the backend engine:\n %f sec ' % (Time.time () -tic)) Wait (b) "time for all computations is Finished:\n %f sec ' % (Time.time () -tic))
In addition to analyzing the read and write dependencies of the data, the backend engine can also parallelize the operation statements that are not dependent on each other. For example, the following code can be executed in parallel with the second and third lines.
= mx.nd.ones((2,3))b = a + 1c = a + 2d = b * c
The following code shows the parallel scheduling on different devices
N=10a = mx.nd.ones ((1000,1000)) b = mx.nd.ones ((6000,6000), Gpu_device) tic = Time.time () C = do (A, n) wait (c) print ( ' Time to finish ' CPU workload: %f sec ' % (Time.time () -tic)) d = do (b, N) wait (d) print ( ' time to finish both Cpu/gpu workloads: %f sec ' % (Time.time () -tic))
= time.time()c = do(a, n) d = do(b, n) #上面两条语句可以同时执行,一条在CPU上运算,一条在GPU上运算wait(c)wait(d)print(‘Both as finished in: %f sec‘ % (time.time() - tic))
Reference Resources
- MXNet Ndarray API
- Hands-on deep learning-using Ndarray to process data
- Ndarray-imperative tensor operations on CPU/GPU
Using Mxnet's Ndarray to process data