How Python handles large data (knowledge collation)

Last Update:2018-07-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bowen 1:pandas.read_csv--Chunk Read large file

http://blog.csdn.net/zm714981790/article/details/51375475

Today, when reading a large CSV file, you have a problem: first using Office, and then using basic pandas.read_csv in Python to open the file: Memoryerror the last lookup read_csv document Discovery can be read in chunks. In Read_csv, there is a parameter chunksize, which reads the file by specifying a chunksize chunk size, and returns an iterative object Textfilereader,io Tools examples are as follows:

In [138]: reader = pd.read_table (' tmp.sv ', sep= ' | ', chunksize=4) in [139]: Reader out[139]: <pandas.io.parsers.textfil         ereader at 0x120d2f290> in [140]: for chunk in reader: ...: print (chunk) ...: unnamed:0 0 1 2 3 0 0 0.469112-0.282863-1.509059-1.135632 1 1 1.212112-0.173215 0.11 
   9209-1.044236 2 2-0.861849-2.104569-0.494929 1.071804 3 3 0.721555-0.706771-1.039575 0.271860  unnamed:0 0 1 2 3 0 4-0.424972 0.567020 0.276232-1.087401 1 5 -0.673690 0.113648-1.478427 0.524988 2 6 0.404705 0.577046-1.715002-1.039268 3 7-0.370647-1.  157892-1.344312 0.844885 unnamed:0 0 1 2 3 0 8 1.075770-0.10905 1.643563 -1.469388 1 9 0.357021-0.67460-1.776904-0.968914

The 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The 21 1 2 3 4 5 6 7 8 9 10 11 12----the--13----the--14-15 16-17 18-19-20 designation Iterator=tru E can also return an iterative object textfilereader:

In [a: reader = pd.read_table (' tmp.sv ', sep= ' | ', iterator=true) in

[): Reader.get_chunk (5)
out[142]: 
   unnamed:0         0         1         2         3
0           0  0.469112-0.282863-1.509059-1.135632
1           1  1.212112-0.173215  0.119209-1.044236
2           2-0.861849-2.104569-0.494929  1.071804
3           3  0.721555-0.706771-1.039575  0.271860
4           4-0.424972  0.567020  0.276232-1.087401

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 I need to open a dataset that is a CSV file with a size of 3.7G and knows nothing about the data, so start by opening the first 5 rows of observation data type, column labels, and so on:

chunks = pd.read_csv (' train.csv ', iterator = True)
chunk = Chunks.get_chunk (5)
print chunk
"'
             date_ Time  site_name  posa_continent  user_location_country  
0  2014-08-11 07:46:59          2               3   
1  2014-08-11 08:22:12          2               3   
2  2014-08-11 08:24:33          2               3   
3  2014-08-09 18:05:16          2               3   
4  2014-08-09 18:08:18          2               3 ""

Bowen 2:python One of the big data processing techniques: data-common operations

http://blog.csdn.net/asdfg4381/article/details/51689344

In the face of reading data on G, Python cannot be as arbitrary as simple code validation, and must take into account the effect of the corresponding code implementation on efficiency. As shown below, the Pandas object's row count is implemented differently, and the efficiency of the operation varies greatly. While time may seem trivial, when the number of runs reaches millions, the runtime is simply not negligible:

So the next few articles will be sorted out slag in the large-scale data on the practice of some of the problems encountered, the article summarized the skills are based on pandas, there are errors in the place. 1. External CSV file reading and writing Large data csv read to memory analysis thought: When the amount of data is very large, such as a bank's one-month chronological list, there may be a record of up to tens of millions of. For a computer with general performance, or read into a special data structure, memory may be very difficult to store. In view of the fact that we use the data, we do not need to extract all the data out of memory. Of course, reading into the database is a wiser approach. If you don't use the database. You can split large files into small chunks and read them in chunks, which reduces memory storage and computing resource considerations: Open (File.csv) and Pandas package Pd.read_csv (file.csv): python32 bit words will limit memory, Indicates that the data is too large to cause a memory error. The solution is to install Python64 bits. If Python various package installation process trouble, you can directly install the ANACONDA2 64-bit version easy to use method:

    Chunker = Pd.read_csv (path_load, chunksize = chunk_size)

1 1 Read the required columns:

    Columns = ("Date_time",  "user_id")
    Chunks_train = pd.read_csv (filename, usecols = columns, chunksize = 100000)

1 2 1 2

The Chunker object points to multiple Chunking objects, but does not read the actual data first, but extracts the data before extracting it. Data processing and cleaning is often handled in chunks, which can greatly reduce memory usage, but it can be more time-consuming to read each row in the chunk:

    For rawpiece in chunker_rawdata:
        current_chunk_size = Len (rawpiece.index)   #rawPiece is dataframe for
        i in Range (current_chunk_size):
            Timeflag = Timeshape (Rawpiece.ix[i])   #获取第i行的数据

1 2 3 4 1 2 3 4 to save data to a hard diskWrite directly to disk:

    Data.to_csv (path_save, index = False, mode = ' W ') '

1 1 blocks written out to disk: storage IO Using pandas packets for the first chunk:
Retain header information, ' W ' mode writes to Data.to_csv (path_save, index = False, mode = ' W ') block write
Remove header information, ' A ' mode write, that is, do not delete the original document, then continue to write Data.to_csv (path_save, index = False, Header = False, mode = a ') A small amount of data to write:

A small amount of data is pickle (cpickle faster) to output and read, very convenient, the following is written and read

Write:

 import cpickle as pickle def save_trainingset (Fileloc, X, y): pack = [X, y] with open (Fileloc, ' W ') as F:pickle.dump (Pack, F)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How Python handles large data (knowledge collation)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How Python handles large data (knowledge collation)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support