How to deal with big data in pandas?

Source: Internet
Author: User

Recent work and Hive SQL to deal with more, occasionally encountered some problems of SQL is not easy to solve, will be downloaded to the file with pandas to deal with, due to the large amount of data, so there are some relevant experience can be shared with you, hope to learn pandas help YOU.

Read and write large text data

Sometimes we get a lot of text files, full read into the memory, read the process will be very slow, even may not be able to read into memory, or can read into memory, but no further calculation, this time if we do not want to do a very complex operation, we can use Read_ CSV provides the chunksize or iterator parameters, to partially read the file, after processing, and then through the To_csv mode= ' a ', each part of the result is gradually written into the File.

Import Pandas as PD

input = pd.read_csv (' input.csv ', chunksize=1000000)

For I in Input:

Chunk = Dosomethig (input) # take some action

Chunk.to_csv (' output.csv ', mode= ' A ', Header=false) # Remember to use Header=false or it will be repeatedly written to the column name

input = pd.read_csv (' input.csv ', Iterator=true)

While Loop:

Try

Chunk = Reader.get_chunk (1000000)

Chunk.to_csv (' output.csv ', mode= ' a ', Header=false) # just like the code above is done by iterator

Except Stopiteration:

Break

to_csv, the choice of To_excel

In the output results are collectively encountered in the output format selection, usually we use the Most. csv,. xls,. xlsx, The latter two are excel2003, one is excel2007, my experience is csv>xls>xlsx, Large file output CSV is much faster than output excel, xls only supports 60000+ records, xlsx Although support record is more, but, if the content of Chinese often appear strange content lost. therefore, if the number of small can choose xls, and a large number of the proposed output to the CSV,XLSX or a limited number, and the large amount of data, it will make you feel that Python is dead

Processing date columns in Read-time

I've done it before. after the data is read into the To_datetime function to process the date column, if the amount of data is large this is a waste of time, in fact, when reading data, you can directly specify the resolution to the date column by the Parse_dates parameter. It has several parameters, true when the index is parsed into a date format, and the column name is passed in as a list to resolve each column to a date format

About To_datetime function to say a few more words, we get the time format often appear some messy strange data, encounter these data to_datimetime function default will error, in fact, these data can be ignored, You only need to set the errors parameter to ' ignore ' in the Function.

In addition, to_datetime, like the function name, returns a timestamp, sometimes we only need the date part, we can make this modification on the date column, Datetime_col = datetime_col.apply (lambda x:x.date ()) , the same as the map function Datetime_col = datetime_col.map (lambda x:x.date ())

Convert some numerical codes to text

Before referring to the map method, I came up with a little trick, and some of the data we get is often encoded by numbers, for example, we have a gender column, where 0 represents a male and 1 is a female. of course, we can do that in an indexed way.

data[' gender '].ix[data[' Gender ']==0] = U ' female ' data[' gender ',].ix[data[' gender ', ']==1] = U ' man ' #这里要注意一下, the above notation is to change the value of the column found by the index, The following method does not modify the original value data.ix[data[' gender ']==0][' gender '] = U ' female ' data.ix[data[' gender ']==1][' gender '] = U ' man '

In fact, we have an easier way to change the biographies into a dict, will achieve the same effect.

data[' Gender ' = data[' gender '].map ({0: ' male ', 1: ' female '})

Using the shift function to find the time difference between the User's two consecutive login records

Before there is a project need to calculate the user next two login records of the time difference, I look in fact this demand is very simple, but the amount of data, it is not a simple task, disassembly to do, it takes two steps, the first step to the login data in accordance with the user group, and then calculate the time interval between two logons per User. The format of the data is simple, as shown below

UID time111 2016-05-01112 2016-05-02111 2016-05-03113 2016-05-04113 2016-05-05112 2016-05-06

If the amount of data is small, you can first unique uid, and then each time a User's two login interval, like this

Reg_data = reg_data.sort_values ([' uid ', time]) # First follows the UID and time order UIDs = reg_data[' uid '].unique () # to get all the uidfor u in uid:

data = []

Uid_reg_data = reg_data.ix[reg_data[' UID ']

Pre = None

For i, row in Uid_reg_data.iterrows ():

If Len (pre) = 0:

Pre = row[' time ']

Continue

row[' days ' = (row[' time ')-pre). days

Data.append (row)

Pre = row[' time ']

Reg_data_f = PD. DataFrame (pre)

Reg_data_f.to_csv (' output.csv ', mode= ' a ', header=false)

Although the computational logic of this method is clear and understandable, the shortcomings are very obvious, the computational amount is huge, and the number of records to be calculated how many Times.

So why is Pandas's shift function suitable for this calculation? Take a look at the role of the shift function

Col1

Aaaa

BBBB

CCCC

DDDD

Assuming we have the data above, if we use Cols.shift (1), we will get the following result

Col1

NaN

Aaaa

BBBB

CCCC

Just put the value downward dislocation of a bit, is not exactly what we need. Let's use the shift function to transform the above Code.

Reg_data = reg_data.sort_values ([' uid ', time]) # First follows the UID and time order UIDs = reg_data[' uid '].unique () # to get all the uidfor u in uid:

data = []

Uid_reg_data = reg_data.ix[reg_data[' UID ']

uid_reg_data[' Pre '] = uid_reg_data[' time '].shift (1)

uid_reg_data[' days '] = (uid_reg_data[' time ')-uid_reg_data[' pre '). map (lambda X:x.days)

uid_reg_data.ix[~uid_reg_data[' Pre '].isnull ()].to_csv (' output.csv ', mode= ' a ', header=false)

A few orders of magnitude are reduced by the amount of Calculation. however, in my actual application scenario is still far from enough, I ran into the log log is level 1 billion, the number of users is tens. There is no simpler way, the answer is yes, there is a little trick. first, the Code.

Reg_data = reg_data.sort_values ([' uid ', time]) # First ordered by UID and time reg_data[' pre '] = reg_data[' time '].shift (1) reg_data[' Uid0 '] = reg_data[' uid0 '].shift (1) reg_data[' days '] = (reg_data[' time ')-reg_data[' pre '). map (lambda x:x.days) reg_ Data_f = Reg_data.ix (reg_data[' uid ') = = reg_data[' uid0 ']

The above code takes advantage of the pandas to quantify computing, bypassing the most time-consuming in the calculation process by UID Loops. If we have a UID that is one as long as the order with shift (1) can be taken to all the previous login time, but the real login data has a lot of unused uid, so the UID also named uid0, retain the UID and uid0 matching records can Be.

Source: Cloga's Internet Notes

How to deal with big data in pandas?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.