Use Python pandas to process billions of levels of data

Source: Internet
Author: User
Tags intel core i7

In the field of data analysis, the most popular is the Python and the R language, before an article "Don't talk about Hadoop, your data is not big enough" point out: Only in the size of more than 5TB of data, Hadoop is a reasonable technology choice. This time to get nearly billions of log data, tens data is already a relational database query analysis bottleneck, before using Hadoop to classify a large number of text, this time decided to use Python to process the data:

    • Hardware environment
      • cpu:3.5 GHz Intel Core i7
      • Memory: 3 GB hddr MHz
      • HDD: 3 TB Fusion Drive
    • Data analysis Tools
      • python:2.7.6
      • pandas:0.15.0
      • IPython notebook:2.0.0

The source data is shown in the following table:

Table Size Desc
Servicelogs 98,706,832 rows x Columns 8.77 GB Transaction log data, each trade session can have multiple transactions
Servicecodes 286 rowsx8 Columns KB Dictionary Table of Transaction classifications
Data read

Start Ipython notebook, load the Pylab environment:

--Pylab=inline  

Pandas provides IO tools that can read large files, test performance, and load 98 million of data in just 263 seconds or so, or pretty good.

ImportPandasas Pdreader = Pd.read_csv ( ' data/servicelogs '  Iterator=true) try: DF = reader get_chunk (100000000) except stopiteration: print  "iteration is stopped."       
1 million article 10 million article 100 million article
Servicelogs 1 s + S 263 s

Using a different chunk size to read and then call the pandas.concat connection dataframe,chunksize set at about 10 million speed optimization is more obvious.

Loop= TrueChunkSize= 100000Chunks= []WhileLoop: Try:Chunk=Reader.Get_chunk (chunksize)  Chunks.< Span class= "PLN" >append (chunkexcept stopiteration: Loop = false print  "iteration is stopped." df = Pd. (chunks, Ignore_index =true)          

Here is the statistics, read time is the data read times, total time is read and pandas for concat operation, according to the amount of data, the 5~50 dataframe objects are merged, performance is better.

Total time
Chunk Size Read Time (s)(s) Performance
100,000 224.418173 261.358521
200,000 232.076794 256.674154
1,000,000 213.128481 234.934142 √√
2,000,000 208.410618 230.006299 √√√
5,000,000 209.460829 230.939319 √√√
10,000,000 207.082081 228.135672 √√√√
20,000,000 209.628596 230.775713 √√√
50,000,000 222.910643 242.405967
100,000,000 263.574246 263.574246

If you use the Python Shell provided by Spark, and also write pandas to load the data for 25 seconds or so, it appears that spark has optimized the memory usage of Python.

Data cleansing

Pandas provides a Dataframe.describe method for viewing data summaries, including data viewing (default total output of 60 rows of data) and row and column statistics. Because the source data usually contains some empty values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, these invalid data needs to be processed.

First call the Dataframe.isnull () method to see which null values are in the data table, and the opposite is dataframe.notnull () , pandas will null evaluate all the data in the table to True The/false is populated as a result, as shown in:

Pandas's non-empty calculation is fast, and 98 million data takes only 28.7 seconds. After you get the preliminary information, you can remove the empty column from the table. Tried to calculate the column name in order to get a non-empty column, and Dataframe.dropna () two ways, the time is 367 seconds and 345.3 seconds, but the inspection found Dropna () after all the lines are gone, checked the pandas manual, Without arguments, Dropna () removes all rows that contain null values. If you want to remove only columns with null values, you need to add axis and how two parameters:

DF.  Dropna(axis=1, how=' All ')         

A total of 6 columns in the 14 column were removed, and the time consumed was only 85.9 seconds.

The next step is to process the empty values in the remaining rows, and after testing, using an empty string in dataframe.replace () saves some space than the default null value Nan, but for the entire CSV file, the empty column only has one ",", so the removed 98 million The X 6 column also saves 200M of space. Further data cleansing is still the removal of useless data and merging.

Discard the data column, in addition to invalid values and requirements, some of the table's own redundant columns also need to be cleaned up in this link, such as the table of the serial number is a two-field splicing, type description, through the discard of these data, the new data file size of 4.73GB, a full reduction of 4.04g!

Data processing

With dataframe.dtypes you can see the data type of each column, pandas can read int and float64 by default, others are processed as object, and the conversion format is typically DateTime. The Dataframe.astype () method enables data format conversions for an entire DataFrame or a column, supporting Python and numpy data types.

DF[' name ']= DF[' name '].  Astype(np.  Datetime64)             

For data aggregation, I tested Dataframe.groupby and dataframe.pivot_table as well as Pandas.merge, GroupBy 98 million rows x 3 columns for 99 seconds, a connection table of 26 seconds, and a faster generation of pivot tables. It takes only 5 seconds.

Df.GroupBy([' NO ',' Time ',' SVID ']).Count() # GroupFulldata=Pd.Merge(df, Trancodedata) [[ ' NO ' , ' SVID ' , ' time ' ,< span class= "str" > ' class ' , ' TYPE ' ]] # connection actions = Fulldata.pivot_table ( ' SVID ' , columns = ' TYPE ' , Aggfunc=< span class= "str" > ' count ' )  # pivot table      

TRADE/query scale generated by pivot Table pie chart:

Add log time to pivot table and output daily TRADE/query scale graph:

Total_actions=Fulldata.pivot_table(' SVID ', index=' time ', columns=' TYPE ', aggfunc =' count ')total_actions.  Plot(subplots=False, figsize= (6), kind=  ' Area ')                 

In addition, PANDAS provides the Dataframe query statistical function speed performance is also very good, within 7 seconds can be queried to generate all types of transaction data sub-table:

= Fulldata[fulldata[' Type ']= =' Transaction ']     

The size of the child table is [10250666 rows x 5 columns]. Some basic scenarios for data processing have been completed here. The results are sufficient to demonstrate that Python's performance in the case of non-">5TB" data has made it possible for data analysts who are adept at using statistical analysis language.

http://www.justinablog.com/archives/1357

Use Python pandas to process billions of levels of data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.