Python numpy array expansion efficiency

Last Update:2014-08-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The ndarray of the Numpy library allows you to conveniently process multiple dimensions of data.

But its biggest drawback is that it cannot be dynamically expanded -- "The NumPy array does not have this function of dynamically changing the size, numpy. the append () function will re-allocate the entire array each time and copy the original array to the new array." (Reference from http://blog.chinaunix.net/uid-23100982-id-3164530.html)

Scenario:

Today, when we use ndarray to process 42000 data records, we have encountered an array expansion efficiency problem.

File Name: train.csv (downloaded later)

Size: 73.2 MB

File Format:. csv

File data: 42001*785

File Description: The title bar of the first behavior, ignored; the first column is the sample label column.

Objective: to read all the data, store the sample value in one matrix, and store the sample tag in another matrix.

Method 1:

The ndarray. vstack () method is used to merge two matrices. The idea is to create a new ndarray with only one row, read the file by row each time, and merge the read data with the original matrix. Finally, it forms the matrix data of 42000 rows.

Code:

from numpy import *import timeDIM = 28def img2matrix(filename):    start_time = time.time()    fr = open(filename)    #drop the header    fr.readline()    return_mat = array(fr.readline().strip().split(','))[1::]    labels = array(['vector[0]'])    training_num = 0    for line in fr.readlines():        vector = line.strip().split(',')        labels = hstack((labels, array([vector[0]])))        return_mat = vstack((return_mat, vector[1::]))        training_num += 1        print training_num    end_time = time.time()    print end_time - start_time    return return_mat, labels, training_num

Result:

1096.56099987 # About 18 minutes

Cause analysis:

After checking, ndarray. vstack () is the performance bottleneck of this program. At this time, because the function is working to copy all the return_mat data and the vector data, re-create a new matrix to write, so it is quite time-consuming. During running, it is easy to see that with the expansion of return_mat, the program is getting slower and slower (output training_num, jumping thousands at the beginning, then several hundred, then dozens, and then several ......).

The following is the source code of vstack on github. The source code of _ nx. concatenate is not found. Which of the following experts have found it and told me ~

def vstack(tup):    """    Stack arrays in sequence vertically (row wise).    Take a sequence of arrays and stack them vertically to make a single    array. Rebuild arrays divided by `vsplit`.    Parameters    ----------    tup : sequence of ndarrays        Tuple containing arrays to be stacked. The arrays must have the same        shape along all but the first axis.    Returns    -------    stacked : ndarray        The array formed by stacking the given arrays.    See Also    --------    hstack : Stack arrays in sequence horizontally (column wise).    dstack : Stack arrays in sequence depth wise (along third dimension).    concatenate : Join a sequence of arrays together.    vsplit : Split array into a list of multiple sub-arrays vertically.    Notes    -----    Equivalent to ``np.concatenate(tup, axis=0)`` if `tup` contains arrays that    are at least 2-dimensional.    Examples    --------    >>> a = np.array([1, 2, 3])    >>> b = np.array([2, 3, 4])    >>> np.vstack((a,b))    array([[1, 2, 3],           [2, 3, 4]])    >>> a = np.array([[1], [2], [3]])    >>> b = np.array([[2], [3], [4]])    >>> np.vstack((a,b))    array([[1],           [2],           [3],           [2],           [3],           [4]])    """    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)

Method 2:

Since the expansion of a row is too slow, it's just a bit violent. You just need to build the return_mat size at one time, and then change the data in it.

Code:

from numpy import *import timeDIM = 28def img2matrix(filename):    start_time = time.time()    fr = open(filename)    training_num = len(fr.readlines())-1    return_mat = zeros((training_num, DIM*DIM))    labels = array(['vector[0]'])    index = 0    fr.seek(0, 0)    # drop the header    fr.readline()    for line in fr.readlines():        vector = line.strip().split(',')        labels = hstack((labels, array([vector[0]])))        return_mat[index, :] = vector[1::]        index += 1    end_time = time.time()    print end_time - start_time    return return_mat, labels, training_num

Result:

7.63100004196 # about 7.6 seconds

Cause analysis:

It can be seen that the "belly" of ndarray is still quite large, and there is no pressure to create a 42000*784 matrix. How big is it, and how does it exceed the size? For details, refer to StackOverflow's post: Very large matrices using Python and NumPy

Test data: https://www.kaggle.com/c/digit-recognizer/data welcomes comrades to join my Kaggle project team :)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python numpy array expansion efficiency

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python numpy array expansion efficiency

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support