How Python handles large data (knowledge collation)

Source: Internet
Author: User

Bowen 1:pandas.read_csv--Chunk Read large file

http://blog.csdn.net/zm714981790/article/details/51375475

Today, when reading a large CSV file, you have a problem: first using Office, and then using basic pandas.read_csv in Python to open the file: Memoryerror the last lookup read_csv document Discovery can be read in chunks. In Read_csv, there is a parameter chunksize, which reads the file by specifying a chunksize chunk size, and returns an iterative object Textfilereader,io Tools examples are as follows:

In [138]: reader = pd.read_table (' tmp.sv ', sep= ' | ', chunksize=4) in [139]: Reader out[139]: <pandas.io.parsers.textfil         ereader at 0x120d2f290> in [140]: for chunk in reader: ...: print (chunk) ...: unnamed:0 0 1 2 3 0 0 0.469112-0.282863-1.509059-1.135632 1 1 1.212112-0.173215 0.11 
   9209-1.044236 2 2-0.861849-2.104569-0.494929 1.071804 3 3 0.721555-0.706771-1.039575 0.271860  unnamed:0 0 1 2 3 0 4-0.424972 0.567020 0.276232-1.087401 1 5 -0.673690 0.113648-1.478427 0.524988 2 6 0.404705 0.577046-1.715002-1.039268 3 7-0.370647-1.  157892-1.344312 0.844885 unnamed:0 0 1 2 3 0 8 1.075770-0.10905 1.643563 -1.469388 1 9 0.357021-0.67460-1.776904-0.968914
The 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The 21 1 2 3 4 5 6 7 8 9 10 11 12----the--13----the--14-15 16-17 18-19-20 designation Iterator=tru E can also return an iterative object textfilereader:
In [a: reader = pd.read_table (' tmp.sv ', sep= ' | ', iterator=true) in

[): Reader.get_chunk (5)
out[142]: 
   unnamed:0         0         1         2         3
0           0  0.469112-0.282863-1.509059-1.135632
1           1  1.212112-0.173215  0.119209-1.044236
2           2-0.861849-2.104569-0.494929  1.071804
3           3  0.721555-0.706771-1.039575  0.271860
4           4-0.424972  0.567020  0.276232-1.087401
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 I need to open a dataset that is a CSV file with a size of 3.7G and knows nothing about the data, so start by opening the first 5 rows of observation data type, column labels, and so on:
chunks = pd.read_csv (' train.csv ', iterator = True)
chunk = Chunks.get_chunk (5)
print chunk
"'
             date_ Time  site_name  posa_continent  user_location_country  
0  2014-08-11 07:46:59          2               3   
1  2014-08-11 08:22:12          2               3   
2  2014-08-11 08:24:33          2               3   
3  2014-08-09 18:05:16          2               3   
4  2014-08-09 18:08:18          2               3 ""   
Bowen 2:python One of the big data processing techniques: data-common operations

http://blog.csdn.net/asdfg4381/article/details/51689344

In the face of reading data on G, Python cannot be as arbitrary as simple code validation, and must take into account the effect of the corresponding code implementation on efficiency. As shown below, the Pandas object's row count is implemented differently, and the efficiency of the operation varies greatly. While time may seem trivial, when the number of runs reaches millions, the runtime is simply not negligible:

So the next few articles will be sorted out slag in the large-scale data on the practice of some of the problems encountered, the article summarized the skills are based on pandas, there are errors in the place. 1. External CSV file reading and writing Large data csv read to memory analysis thought: When the amount of data is very large, such as a bank's one-month chronological list, there may be a record of up to tens of millions of. For a computer with general performance, or read into a special data structure, memory may be very difficult to store. In view of the fact that we use the data, we do not need to extract all the data out of memory. Of course, reading into the database is a wiser approach. If you don't use the database. You can split large files into small chunks and read them in chunks, which reduces memory storage and computing resource considerations: Open (File.csv) and Pandas package Pd.read_csv (file.csv): python32 bit words will limit memory, Indicates that the data is too large to cause a memory error. The solution is to install Python64 bits. If Python various package installation process trouble, you can directly install the ANACONDA2 64-bit version easy to use method:

    Chunker = Pd.read_csv (path_load, chunksize = chunk_size)
1 1 Read the required columns:
    Columns = ("Date_time",  "user_id")
    Chunks_train = pd.read_csv (filename, usecols = columns, chunksize = 100000)
1 2 1 2

The Chunker object points to multiple Chunking objects, but does not read the actual data first, but extracts the data before extracting it. Data processing and cleaning is often handled in chunks, which can greatly reduce memory usage, but it can be more time-consuming to read each row in the chunk:

    For rawpiece in chunker_rawdata:
        current_chunk_size = Len (rawpiece.index)   #rawPiece is dataframe for
        i in Range (current_chunk_size):
            Timeflag = Timeshape (Rawpiece.ix[i])   #获取第i行的数据
1 2 3 4 1 2 3 4 to save data to a hard diskWrite directly to disk:
    Data.to_csv (path_save, index = False, mode = ' W ') '
1 1 blocks written out to disk: storage IO Using pandas packets for the first chunk:
Retain header information, ' W ' mode writes to Data.to_csv (path_save, index = False, mode = ' W ') block write
Remove header information, ' A ' mode write, that is, do not delete the original document, then continue to write Data.to_csv (path_save, index = False, Header = False, mode = a ') A small amount of data to write:

A small amount of data is pickle (cpickle faster) to output and read, very convenient, the following is written and read

Write:

 import cpickle as pickle def save_trainingset (Fileloc, X, y): pack = [X, y] with open (Fileloc, ' W ') as F:pickle.dump (Pack, F) 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.