Bowen 1:pandas.read_csv--Chunk Read large file
http://blog.csdn.net/zm714981790/article/details/51375475
Today, when reading a large CSV file, you have a problem: first using Office, and then using basic pandas.read_csv in Python to open the file: Memoryerror the last lookup read_csv document Discovery can be read in chunks. In Read_csv, there is a parameter chunksize, which reads the file by specifying a chunksize chunk size, and returns an iterative object Textfilereader,io Tools examples are as follows:
In [138]: reader = pd.read_table (' tmp.sv ', sep= ' | ', chunksize=4) in [139]: Reader out[139]: <pandas.io.parsers.textfil ereader at 0x120d2f290> in [140]: for chunk in reader: ...: print (chunk) ...: unnamed:0 0 1 2 3 0 0 0.469112-0.282863-1.509059-1.135632 1 1 1.212112-0.173215 0.11
9209-1.044236 2 2-0.861849-2.104569-0.494929 1.071804 3 3 0.721555-0.706771-1.039575 0.271860 unnamed:0 0 1 2 3 0 4-0.424972 0.567020 0.276232-1.087401 1 5 -0.673690 0.113648-1.478427 0.524988 2 6 0.404705 0.577046-1.715002-1.039268 3 7-0.370647-1. 157892-1.344312 0.844885 unnamed:0 0 1 2 3 0 8 1.075770-0.10905 1.643563 -1.469388 1 9 0.357021-0.67460-1.776904-0.968914
The 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The 21 1 2 3 4 5 6 7 8 9 10 11 12----the--13----the--14-15 16-17 18-19-20 designation Iterator=tru E can also return an iterative object textfilereader:
In [a: reader = pd.read_table (' tmp.sv ', sep= ' | ', iterator=true) in
[): Reader.get_chunk (5)
out[142]:
unnamed:0 0 1 2 3
0 0 0.469112-0.282863-1.509059-1.135632
1 1 1.212112-0.173215 0.119209-1.044236
2 2-0.861849-2.104569-0.494929 1.071804
3 3 0.721555-0.706771-1.039575 0.271860
4 4-0.424972 0.567020 0.276232-1.087401
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 I need to open a dataset that is a CSV file with a size of 3.7G and knows nothing about the data, so start by opening the first 5 rows of observation data type, column labels, and so on:
chunks = pd.read_csv (' train.csv ', iterator = True)
chunk = Chunks.get_chunk (5)
print chunk
"'
date_ Time site_name posa_continent user_location_country
0 2014-08-11 07:46:59 2 3
1 2014-08-11 08:22:12 2 3
2 2014-08-11 08:24:33 2 3
3 2014-08-09 18:05:16 2 3
4 2014-08-09 18:08:18 2 3 ""
Bowen 2:python One of the big data processing techniques: data-common operations
http://blog.csdn.net/asdfg4381/article/details/51689344
In the face of reading data on G, Python cannot be as arbitrary as simple code validation, and must take into account the effect of the corresponding code implementation on efficiency. As shown below, the Pandas object's row count is implemented differently, and the efficiency of the operation varies greatly. While time may seem trivial, when the number of runs reaches millions, the runtime is simply not negligible:
So the next few articles will be sorted out slag in the large-scale data on the practice of some of the problems encountered, the article summarized the skills are based on pandas, there are errors in the place. 1. External CSV file reading and writing Large data csv read to memory analysis thought: When the amount of data is very large, such as a bank's one-month chronological list, there may be a record of up to tens of millions of. For a computer with general performance, or read into a special data structure, memory may be very difficult to store. In view of the fact that we use the data, we do not need to extract all the data out of memory. Of course, reading into the database is a wiser approach. If you don't use the database. You can split large files into small chunks and read them in chunks, which reduces memory storage and computing resource considerations: Open (File.csv) and Pandas package Pd.read_csv (file.csv): python32 bit words will limit memory, Indicates that the data is too large to cause a memory error. The solution is to install Python64 bits. If Python various package installation process trouble, you can directly install the ANACONDA2 64-bit version easy to use method:
Chunker = Pd.read_csv (path_load, chunksize = chunk_size)
1 1 Read the required columns:
Columns = ("Date_time", "user_id")
Chunks_train = pd.read_csv (filename, usecols = columns, chunksize = 100000)
1 2 1 2
The Chunker object points to multiple Chunking objects, but does not read the actual data first, but extracts the data before extracting it. Data processing and cleaning is often handled in chunks, which can greatly reduce memory usage, but it can be more time-consuming to read each row in the chunk:
For rawpiece in chunker_rawdata:
current_chunk_size = Len (rawpiece.index) #rawPiece is dataframe for
i in Range (current_chunk_size):
Timeflag = Timeshape (Rawpiece.ix[i]) #获取第i行的数据
1 2 3 4 1 2 3 4
to save data to a hard diskWrite directly to disk:
Data.to_csv (path_save, index = False, mode = ' W ') '
1 1 blocks written out to disk: storage IO Using pandas packets for the first chunk:
Retain header information, ' W ' mode writes to Data.to_csv (path_save, index = False, mode = ' W ') block write
Remove header information, ' A ' mode write, that is, do not delete the original document, then continue to write Data.to_csv (path_save, index = False, Header = False, mode = a ') A small amount of data to write:
A small amount of data is pickle (cpickle faster) to output and read, very convenient, the following is written and read
Write:
import cpickle as pickle def save_trainingset (Fileloc, X, y): pack = [X, y] with open (Fileloc, ' W ') as F:pickle.dump (Pack, F)