Python numpy array expansion efficiency
The ndarray of the Numpy library allows you to conveniently process multiple dimensions of data.
But its biggest drawback is that it cannot be dynamically expanded -- "The NumPy array does not have this function of dynamically changing the size, numpy. the append () function will re-allocate the entire array each time and copy the original array to the new array." (Reference from http://blog.chinaunix.net/uid-23100982-id-3164530.html)
Scenario:
Today, when we use ndarray to process 42000 data records, we have encountered an array expansion efficiency problem.
File Name: train.csv (downloaded later)
Size: 73.2 MB
File Format:. csv
File data: 42001*785
File Description: The title bar of the first behavior, ignored; the first column is the sample label column.
Objective: to read all the data, store the sample value in one matrix, and store the sample tag in another matrix.
Method 1:
The ndarray. vstack () method is used to merge two matrices. The idea is to create a new ndarray with only one row, read the file by row each time, and merge the read data with the original matrix. Finally, it forms the matrix data of 42000 rows.
Code:
from numpy import *import timeDIM = 28def img2matrix(filename): start_time = time.time() fr = open(filename) #drop the header fr.readline() return_mat = array(fr.readline().strip().split(','))[1::] labels = array(['vector[0]']) training_num = 0 for line in fr.readlines(): vector = line.strip().split(',') labels = hstack((labels, array([vector[0]]))) return_mat = vstack((return_mat, vector[1::])) training_num += 1 print training_num end_time = time.time() print end_time - start_time return return_mat, labels, training_num
Result:
1096.56099987 # About 18 minutes
Cause analysis:
After checking, ndarray. vstack () is the performance bottleneck of this program. At this time, because the function is working to copy all the return_mat data and the vector data, re-create a new matrix to write, so it is quite time-consuming. During running, it is easy to see that with the expansion of return_mat, the program is getting slower and slower (output training_num, jumping thousands at the beginning, then several hundred, then dozens, and then several ......).
The following is the source code of vstack on github. The source code of _ nx. concatenate is not found. Which of the following experts have found it and told me ~
def vstack(tup): """ Stack arrays in sequence vertically (row wise). Take a sequence of arrays and stack them vertically to make a single array. Rebuild arrays divided by `vsplit`. Parameters ---------- tup : sequence of ndarrays Tuple containing arrays to be stacked. The arrays must have the same shape along all but the first axis. Returns ------- stacked : ndarray The array formed by stacking the given arrays. See Also -------- hstack : Stack arrays in sequence horizontally (column wise). dstack : Stack arrays in sequence depth wise (along third dimension). concatenate : Join a sequence of arrays together. vsplit : Split array into a list of multiple sub-arrays vertically. Notes ----- Equivalent to ``np.concatenate(tup, axis=0)`` if `tup` contains arrays that are at least 2-dimensional. Examples -------- >>> a = np.array([1, 2, 3]) >>> b = np.array([2, 3, 4]) >>> np.vstack((a,b)) array([[1, 2, 3], [2, 3, 4]]) >>> a = np.array([[1], [2], [3]]) >>> b = np.array([[2], [3], [4]]) >>> np.vstack((a,b)) array([[1], [2], [3], [2], [3], [4]]) """ return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
Method 2:
Since the expansion of a row is too slow, it's just a bit violent. You just need to build the return_mat size at one time, and then change the data in it.
Code:
from numpy import *import timeDIM = 28def img2matrix(filename): start_time = time.time() fr = open(filename) training_num = len(fr.readlines())-1 return_mat = zeros((training_num, DIM*DIM)) labels = array(['vector[0]']) index = 0 fr.seek(0, 0) # drop the header fr.readline() for line in fr.readlines(): vector = line.strip().split(',') labels = hstack((labels, array([vector[0]]))) return_mat[index, :] = vector[1::] index += 1 end_time = time.time() print end_time - start_time return return_mat, labels, training_num
Result:
7.63100004196 # about 7.6 seconds
Cause analysis:
It can be seen that the "belly" of ndarray is still quite large, and there is no pressure to create a 42000*784 matrix. How big is it, and how does it exceed the size? For details, refer to StackOverflow's post: Very large matrices using Python and NumPy
Test data: https://www.kaggle.com/c/digit-recognizer/data welcomes comrades to join my Kaggle project team :)