Recently in a CNN project, there are 20w images in the folder to read into and save to a data file (otherwise it is too troublesome to read the 20w file every time).
Tossing an afternoon, found a very useful package H5py: Store the data in the Hdf5 file.
How good is this thing?
Speed, memory footprint, compression is better than cpickle+gzip.
In contrast, the above two is a tease ratio ...
I put all the pictures in a ndarray and save them as a file:
8190 pictures of. Mat 16GB, 81900 pictures of. Pkl.gz ... Can not be generated at all, 81900 pictures of the. h5 15GB.
Not only can the big data be saved, but the compression rate is 10 times times the mat!
See why I am so excited to come ...
To talk about the disadvantages of h5py other ways:
1, Numpy.save, Numpy.savez, Scipy.io.savemat
Data storage methods provided by NumPy and scipy. Officials say Savez is a compressed version of Save, although in practice nothing can be compressed.
And all three of these methods produce the same file size ...... Very large.
8000 photos of 256*256*3 is a 16G file, simply unbearable. And calling the method is cumbersome.
2, Cpickle + gzip
Here ignore pickle this guy, directly be Cpiclke abused.
. pkl.gz is the official suffix of mnist. It looks like it's going to be very useful.
However, there are two difficult problems to avoid in actual use:
-
- Slow, high memory consumption (bad performance)
- Large matrix Storage incompetence
I will not say the former. Regarding the latter, this is the Python official bug, if you encounter "Systemerror:error return without exception set" at Cpickle.dump (), then congratulations, winning.
The official Python explanation for the problem: http://bugs.python.org/issue11564
Hey? It's fixed? Wool! 3 fixed, 2.7 so bug, so if your Linux or Ubuntu embedded is python2.7, cry to death.
Although Cpickle+gzip performance has been excellent, but compared with the h5py performance, see this article:
http://www.shocksolution.com/2010/01/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/
3, H5py
Sorry to find no shortcomings, the only drawback is that it is difficult to install. So here is the H5py installation tutorial.
H5py Installation:
Official Tutorial: Http://docs.h5py.org/en/latest/build.html#install
Here teaches you, the official course is the Pit father: no source calls you apt-get, gives you the bin lets you make. So here, I walk through the road that can be done:
1, determine the system has python,numpy,libhdf5-serial-dev, and HDF5. The first three are generally available. Here to install HDF5
2, to HDF5 the official website to download the compiled bin (yes, although the tutorial to compile, here to the user is compiled bin, make me this small white compiled half a day);
http://www.hdfgroup.org/HDF5/
3, Unzip, rename folder to Hdf5, move to/usr/local/hdf5
4. Add Environment variables:
Export HDF5_DIR=/USR/LOCAL/HDF5
HDF5 is installed here, only the installed HDF5 can be installed smoothly h5py
5. Pip Install H5py
Simple routines:
Write: Import H5pyimport NumPy as NP
123456 |
<em
id
=
"__mceDel"
><em
id
=
"__mceDel"
><em
id
=
"__mceDel"
>data
= mp.array( [
222
,
333
,
444
] )<br>label
= np.array( [
0
,
1
,
0
] )<br>img_num
= np.array( [
0
,
1
,
2
] )<br><br>
# 创建HDF5文件<br>file = h5py.File(‘TrainSet_rotate.h5‘,‘w‘)<br># 写入
file
.create_dataset(
‘train_set_x‘
, data
= data)
file
.create_dataset(
‘train_set_y‘
, data
= label)
file
.create_dataset(
‘train_set_num‘
,data
= img_num)
# 。。。。。。。。。<br>file.close()
<
/
em><
/
em><
/
em>
|
Read:
12345678910 |
import numpy as np
import h5py
# 读方式打开文件
file
=
h5py.
File
(
‘TrainSet_rotate.h5‘
,
‘r‘
)
# 尽管后面有 ‘[:]‘, 但是矩阵怎么进去的就是怎么出来的,不会被拉长(matlab后遗症)
train_set_data
= file
[
‘train_set_x‘
][:]
train_set_y
= file
[
‘train_set_y‘
][:]
train_set_img_num
= file
[
‘train_set_img_num‘
][:]
# .........
file
.close()
|
Well, you've already used h5py, and try to get the thrill of h5py!
Bonus tips: How to output on the same line
1.
123 |
for i in Code class= "Python functions" >range ( 10 print ( "Loading" + * i) sys.stdout.write ( ) # Cursor up one line |
2,
123 |
for x in range ( 0 5 ): b = "Loading" + * x print (b, end = "\ r" |
The previous method will be useful.
Data storage on Python: recommended h5py