Data storage on Python: recommended h5py

Source: Internet
Author: User

Recently in a CNN project, there are 20w images in the folder to read into and save to a data file (otherwise it is too troublesome to read the 20w file every time).

Tossing an afternoon, found a very useful package H5py: Store the data in the Hdf5 file.

How good is this thing?

Speed, memory footprint, compression is better than cpickle+gzip.

In contrast, the above two is a tease ratio ...

I put all the pictures in a ndarray and save them as a file:

8190 pictures of. Mat 16GB, 81900 pictures of. Pkl.gz ... Can not be generated at all, 81900 pictures of the. h5 15GB.

Not only can the big data be saved, but the compression rate is 10 times times the mat!

See why I am so excited to come ...

To talk about the disadvantages of h5py other ways:

1, Numpy.save, Numpy.savez, Scipy.io.savemat

Data storage methods provided by NumPy and scipy. Officials say Savez is a compressed version of Save, although in practice nothing can be compressed.

And all three of these methods produce the same file size ...... Very large.

8000 photos of 256*256*3 is a 16G file, simply unbearable. And calling the method is cumbersome.

2, Cpickle + gzip

Here ignore pickle this guy, directly be Cpiclke abused.

. pkl.gz is the official suffix of mnist. It looks like it's going to be very useful.

However, there are two difficult problems to avoid in actual use:

      • Slow, high memory consumption (bad performance)
      • Large matrix Storage incompetence

I will not say the former. Regarding the latter, this is the Python official bug, if you encounter "Systemerror:error return without exception set" at Cpickle.dump (), then congratulations, winning.

The official Python explanation for the problem: http://bugs.python.org/issue11564

Hey? It's fixed? Wool! 3 fixed, 2.7 so bug, so if your Linux or Ubuntu embedded is python2.7, cry to death.

Although Cpickle+gzip performance has been excellent, but compared with the h5py performance, see this article:

http://www.shocksolution.com/2010/01/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

3, H5py

Sorry to find no shortcomings, the only drawback is that it is difficult to install. So here is the H5py installation tutorial.

H5py Installation:

Official Tutorial: Http://docs.h5py.org/en/latest/build.html#install

Here teaches you, the official course is the Pit father: no source calls you apt-get, gives you the bin lets you make. So here, I walk through the road that can be done:

1, determine the system has python,numpy,libhdf5-serial-dev, and HDF5. The first three are generally available. Here to install HDF5

2, to HDF5 the official website to download the compiled bin (yes, although the tutorial to compile, here to the user is compiled bin, make me this small white compiled half a day);

http://www.hdfgroup.org/HDF5/

3, Unzip, rename folder to Hdf5, move to/usr/local/hdf5

4. Add Environment variables:

Export HDF5_DIR=/USR/LOCAL/HDF5

HDF5 is installed here, only the installed HDF5 can be installed smoothly h5py

5. Pip Install H5py

Simple routines:

Write: Import H5pyimport NumPy as NP

123456 <em id="__mceDel"><em id="__mceDel"><em id="__mceDel">data = mp.array( [222,333,444] )<br>label = np.array( [0,1,0] )<br>img_num = np.array( [0,1,2] )<br><br># 创建HDF5文件<br>file = h5py.File(‘TrainSet_rotate.h5‘,‘w‘)<br># 写入file.create_dataset(‘train_set_x‘, data = data)file.create_dataset(‘train_set_y‘, data = label)file.create_dataset(‘train_set_num‘,data = img_num)# 。。。。。。。。。<br>file.close()</em></em></em>

Read:

12345678910 import numpy as npimport h5py# 读方式打开文件file=h5py.File(‘TrainSet_rotate.h5‘,‘r‘)# 尽管后面有 ‘[:]‘, 但是矩阵怎么进去的就是怎么出来的,不会被拉长(matlab后遗症)train_set_data = file[‘train_set_x‘][:]train_set_y = file[‘train_set_y‘][:]train_set_img_num = file[‘train_set_img_num‘][:]# .........file.close()

Well, you've already used h5py, and try to get the thrill of h5py!

Bonus tips: How to output on the same line

1.

123 for i in Code class= "Python functions" >range ( 10      print ( "Loading" + * i)      sys.stdout.write ( ) # Cursor up one line

2,

123 for x in range ( 0 5 ):        b = "Loading" + * x      print (b, end = "\ r"

The previous method will be useful.

Data storage on Python: recommended h5py

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.