Python numpy array expansion efficiency

Source: Internet
Author: User

Python numpy array expansion efficiency

The ndarray of the Numpy library allows you to conveniently process multiple dimensions of data.


But its biggest drawback is that it cannot be dynamically expanded -- "The NumPy array does not have this function of dynamically changing the size, numpy. the append () function will re-allocate the entire array each time and copy the original array to the new array." (Reference from http://blog.chinaunix.net/uid-23100982-id-3164530.html)


Scenario:

Today, when we use ndarray to process 42000 data records, we have encountered an array expansion efficiency problem.

File Name: train.csv (downloaded later)

Size: 73.2 MB

File Format:. csv

File data: 42001*785

File Description: The title bar of the first behavior, ignored; the first column is the sample label column.

Objective: to read all the data, store the sample value in one matrix, and store the sample tag in another matrix.


Method 1:

The ndarray. vstack () method is used to merge two matrices. The idea is to create a new ndarray with only one row, read the file by row each time, and merge the read data with the original matrix. Finally, it forms the matrix data of 42000 rows.


Code:

from numpy import *import timeDIM = 28def img2matrix(filename):    start_time = time.time()    fr = open(filename)    #drop the header    fr.readline()    return_mat = array(fr.readline().strip().split(','))[1::]    labels = array(['vector[0]'])    training_num = 0    for line in fr.readlines():        vector = line.strip().split(',')        labels = hstack((labels, array([vector[0]])))        return_mat = vstack((return_mat, vector[1::]))        training_num += 1        print training_num    end_time = time.time()    print end_time - start_time    return return_mat, labels, training_num


Result:

1096.56099987 # About 18 minutes


Cause analysis:

After checking, ndarray. vstack () is the performance bottleneck of this program. At this time, because the function is working to copy all the return_mat data and the vector data, re-create a new matrix to write, so it is quite time-consuming. During running, it is easy to see that with the expansion of return_mat, the program is getting slower and slower (output training_num, jumping thousands at the beginning, then several hundred, then dozens, and then several ......).


The following is the source code of vstack on github. The source code of _ nx. concatenate is not found. Which of the following experts have found it and told me ~

def vstack(tup):    """    Stack arrays in sequence vertically (row wise).    Take a sequence of arrays and stack them vertically to make a single    array. Rebuild arrays divided by `vsplit`.    Parameters    ----------    tup : sequence of ndarrays        Tuple containing arrays to be stacked. The arrays must have the same        shape along all but the first axis.    Returns    -------    stacked : ndarray        The array formed by stacking the given arrays.    See Also    --------    hstack : Stack arrays in sequence horizontally (column wise).    dstack : Stack arrays in sequence depth wise (along third dimension).    concatenate : Join a sequence of arrays together.    vsplit : Split array into a list of multiple sub-arrays vertically.    Notes    -----    Equivalent to ``np.concatenate(tup, axis=0)`` if `tup` contains arrays that    are at least 2-dimensional.    Examples    --------    >>> a = np.array([1, 2, 3])    >>> b = np.array([2, 3, 4])    >>> np.vstack((a,b))    array([[1, 2, 3],           [2, 3, 4]])    >>> a = np.array([[1], [2], [3]])    >>> b = np.array([[2], [3], [4]])    >>> np.vstack((a,b))    array([[1],           [2],           [3],           [2],           [3],           [4]])    """    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)


Method 2:

Since the expansion of a row is too slow, it's just a bit violent. You just need to build the return_mat size at one time, and then change the data in it.


Code:

from numpy import *import timeDIM = 28def img2matrix(filename):    start_time = time.time()    fr = open(filename)    training_num = len(fr.readlines())-1    return_mat = zeros((training_num, DIM*DIM))    labels = array(['vector[0]'])    index = 0    fr.seek(0, 0)    # drop the header    fr.readline()    for line in fr.readlines():        vector = line.strip().split(',')        labels = hstack((labels, array([vector[0]])))        return_mat[index, :] = vector[1::]        index += 1    end_time = time.time()    print end_time - start_time    return return_mat, labels, training_num


Result:

7.63100004196 # about 7.6 seconds


Cause analysis:

It can be seen that the "belly" of ndarray is still quite large, and there is no pressure to create a 42000*784 matrix. How big is it, and how does it exceed the size? For details, refer to StackOverflow's post: Very large matrices using Python and NumPy



Test data: https://www.kaggle.com/c/digit-recognizer/data welcomes comrades to join my Kaggle project team :)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.