Recent work and Hive SQL to deal with more, occasionally encountered some problems of SQL is not easy to solve, will be downloaded to the file with pandas to deal with, due to the large amount of data, so there are some relevant experience can be shared with you, hope to learn pandas help YOU.
Read and write large text data
Sometimes we get a lot of text files, full read into the memory, read the process will be very slow, even may not be able to read into memory, or can read into memory, but no further calculation, this time if we do not want to do a very complex operation, we can use Read_ CSV provides the chunksize or iterator parameters, to partially read the file, after processing, and then through the To_csv mode= ' a ', each part of the result is gradually written into the File.
Import Pandas as PD
input = pd.read_csv (' input.csv ', chunksize=1000000)
For I in Input:
Chunk = Dosomethig (input) # take some action
Chunk.to_csv (' output.csv ', mode= ' A ', Header=false) # Remember to use Header=false or it will be repeatedly written to the column name
input = pd.read_csv (' input.csv ', Iterator=true)
While Loop:
Try
Chunk = Reader.get_chunk (1000000)
Chunk.to_csv (' output.csv ', mode= ' a ', Header=false) # just like the code above is done by iterator
Except Stopiteration:
Break
to_csv, the choice of To_excel
In the output results are collectively encountered in the output format selection, usually we use the Most. csv,. xls,. xlsx, The latter two are excel2003, one is excel2007, my experience is csv>xls>xlsx, Large file output CSV is much faster than output excel, xls only supports 60000+ records, xlsx Although support record is more, but, if the content of Chinese often appear strange content lost. therefore, if the number of small can choose xls, and a large number of the proposed output to the CSV,XLSX or a limited number, and the large amount of data, it will make you feel that Python is dead
Processing date columns in Read-time
I've done it before. after the data is read into the To_datetime function to process the date column, if the amount of data is large this is a waste of time, in fact, when reading data, you can directly specify the resolution to the date column by the Parse_dates parameter. It has several parameters, true when the index is parsed into a date format, and the column name is passed in as a list to resolve each column to a date format
About To_datetime function to say a few more words, we get the time format often appear some messy strange data, encounter these data to_datimetime function default will error, in fact, these data can be ignored, You only need to set the errors parameter to ' ignore ' in the Function.
In addition, to_datetime, like the function name, returns a timestamp, sometimes we only need the date part, we can make this modification on the date column, Datetime_col = datetime_col.apply (lambda x:x.date ()) , the same as the map function Datetime_col = datetime_col.map (lambda x:x.date ())
Convert some numerical codes to text
Before referring to the map method, I came up with a little trick, and some of the data we get is often encoded by numbers, for example, we have a gender column, where 0 represents a male and 1 is a female. of course, we can do that in an indexed way.
data[' gender '].ix[data[' Gender ']==0] = U ' female ' data[' gender ',].ix[data[' gender ', ']==1] = U ' man ' #这里要注意一下, the above notation is to change the value of the column found by the index, The following method does not modify the original value data.ix[data[' gender ']==0][' gender '] = U ' female ' data.ix[data[' gender ']==1][' gender '] = U ' man '
In fact, we have an easier way to change the biographies into a dict, will achieve the same effect.
data[' Gender ' = data[' gender '].map ({0: ' male ', 1: ' female '})
Using the shift function to find the time difference between the User's two consecutive login records
Before there is a project need to calculate the user next two login records of the time difference, I look in fact this demand is very simple, but the amount of data, it is not a simple task, disassembly to do, it takes two steps, the first step to the login data in accordance with the user group, and then calculate the time interval between two logons per User. The format of the data is simple, as shown below
UID time111 2016-05-01112 2016-05-02111 2016-05-03113 2016-05-04113 2016-05-05112 2016-05-06
If the amount of data is small, you can first unique uid, and then each time a User's two login interval, like this
Reg_data = reg_data.sort_values ([' uid ', time]) # First follows the UID and time order UIDs = reg_data[' uid '].unique () # to get all the uidfor u in uid:
data = []
Uid_reg_data = reg_data.ix[reg_data[' UID ']
Pre = None
For i, row in Uid_reg_data.iterrows ():
If Len (pre) = 0:
Pre = row[' time ']
Continue
row[' days ' = (row[' time ')-pre). days
Data.append (row)
Pre = row[' time ']
Reg_data_f = PD. DataFrame (pre)
Reg_data_f.to_csv (' output.csv ', mode= ' a ', header=false)
Although the computational logic of this method is clear and understandable, the shortcomings are very obvious, the computational amount is huge, and the number of records to be calculated how many Times.
So why is Pandas's shift function suitable for this calculation? Take a look at the role of the shift function
Col1
Aaaa
BBBB
CCCC
DDDD
Assuming we have the data above, if we use Cols.shift (1), we will get the following result
Col1
NaN
Aaaa
BBBB
CCCC
Just put the value downward dislocation of a bit, is not exactly what we need. Let's use the shift function to transform the above Code.
Reg_data = reg_data.sort_values ([' uid ', time]) # First follows the UID and time order UIDs = reg_data[' uid '].unique () # to get all the uidfor u in uid:
data = []
Uid_reg_data = reg_data.ix[reg_data[' UID ']
uid_reg_data[' Pre '] = uid_reg_data[' time '].shift (1)
uid_reg_data[' days '] = (uid_reg_data[' time ')-uid_reg_data[' pre '). map (lambda X:x.days)
uid_reg_data.ix[~uid_reg_data[' Pre '].isnull ()].to_csv (' output.csv ', mode= ' a ', header=false)
A few orders of magnitude are reduced by the amount of Calculation. however, in my actual application scenario is still far from enough, I ran into the log log is level 1 billion, the number of users is tens. There is no simpler way, the answer is yes, there is a little trick. first, the Code.
Reg_data = reg_data.sort_values ([' uid ', time]) # First ordered by UID and time reg_data[' pre '] = reg_data[' time '].shift (1) reg_data[' Uid0 '] = reg_data[' uid0 '].shift (1) reg_data[' days '] = (reg_data[' time ')-reg_data[' pre '). map (lambda x:x.days) reg_ Data_f = Reg_data.ix (reg_data[' uid ') = = reg_data[' uid0 ']
The above code takes advantage of the pandas to quantify computing, bypassing the most time-consuming in the calculation process by UID Loops. If we have a UID that is one as long as the order with shift (1) can be taken to all the previous login time, but the real login data has a lot of unused uid, so the UID also named uid0, retain the UID and uid0 matching records can Be.
Source: Cloga's Internet Notes
How to deal with big data in pandas?