A simple introduction to using Pandas Library to process large data in Python _python

Source: Internet
Author: User
Tags memory usage intel core i7

The hottest thing in the field of data analysis is the Python and R languages, and there was an article, "Don't be ridiculous, your data is not big enough" points out that Hadoop is a reasonable technology choice only on the scale of more than 5TB of data. This time to get nearly billion log data, tens data is already a relational database query analysis bottlenecks, before using Hadoop to classify a large number of text, this decision to use Python to process data:

Hardware environment
cpu:3.5 GHz Intel Core i7
Memory: GB hddr 3 1600 MHz
HDD: 3 TB Fusion Drive
Data analysis Tools
python:2.7.6
pandas:0.15.0
IPython notebook:2.0.0

The source data is shown in the following table:

Data read

Start Ipython notebook, load Pylab environment:

Ipython Notebook--pylab=inline

Pandas provides an IO tool to read chunks of large files, test performance, complete loading 98 million of data will only need 263 seconds or so, or pretty good.

Import Pandas as PD
Reader = pd.read_csv (' data/servicelogs ', iterator=true)
Try
DF = Reader.get_chunk (100000000)
Except stopiteration:
Print "Iteration is stopped."


Using different block sizes to read and then call Pandas.concat connection Dataframe,chunksize set at about 10 million speed optimization is more obvious.

loop = True
chunksize = 100000
chunks = [] while
loop:
  try:
    chunk = Reader.get_chunk (chunksize)
    chunks.append (chunk)
  except stopiteration:
    loop = False
    print "Iteration is stopped."
DF = Pd.concat (chunks, ignore_index=true)

The following is the statistical data, read time is the data read times, total time is read and pandas to carry out concat operations, according to the amount of data, the 5~50 dataframe objects merged, performance is better.

If you use the Python Shell provided by Spark, you can write pandas load data for a short period of 25 seconds, and it seems that spark is optimized for Python memory usage.
Data Cleaning

Pandas provides a dataframe.describe method for viewing data summaries, including data viewing (the default total output of 60 rows of data) and row and column statistics. Since the source data usually contains some null values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, the invalid data needs to be processed.

The Dataframe.isnull () method is called first to see which of the data table is null, and the opposite is Dataframe.notnull (), where pandas all the data in the table is computed as null and true/false as a result. As shown in the following illustration:

The pandas calculation speed is very fast, and 98 million data only needs 28.7 seconds. After you get the preliminary information, you can remove the empty columns from the table. Try to get a non-empty column by column name, and Dataframe.dropna () two ways, the time is 367 seconds and 345.3 seconds, respectively, but when the check found Dropna () after all the rows are gone, check the pandas manual, the original without parameters of the case, Dropna () Removes all rows that contain null values. If you only want to remove all columns that are NULL, you need to add axis and how two parameters:

Df.dropna (Axis=1, how= ' all ')

A total of 6 columns in 14 columns were removed, and the time was only 85.9 seconds.

The next step is to process the null values in the remaining rows and, after testing, use an empty string in Dataframe.replace () to save some space than the default null value Nan, but for the entire CSV file, the empty column just saves one more "," so the 98 million x removed The 6 column also saved only 200M of space. Further data cleansing is still on the removal of unwanted data and merging.

The drop of the data column, in addition to the invalid values and requirements, some of the table's own redundant columns also need to clean up in this link, such as the table of the serial number is a certain two field splicing, type description, etc., by discarding these data, the new data file size of 4.73GB, a full reduction of 4.04g!

Data processing

Using Dataframe.dtypes, you can view the data types for each column, pandas can read int and float64 by default, others are processed as object, and the format needs to be converted to date time. The Dataframe.astype () method enables data format conversion of the entire dataframe or a column, supporting the Python and numpy data types.

df[' name '] = df[' name '].astype (np.datetime64)

For data aggregation, I tested Dataframe.groupby and dataframe.pivot_table and Pandas.merge, GroupBy 98 million rows x 3 columns for 99 seconds, joined tables for 26 seconds, and made the pivot table faster, It takes only 5 seconds.

Df.groupby ([' No ', ' time ', ' Svid ']). COUNT () # Group
fulldata = Pd.merge (DF, Trancodedata) [[' No ', ' svid ', ' time ', ' CLASS '] , ' type ']] # connection
actions = fulldata.pivot_table (' svid ', columns= ' type ', aggfunc= ' count ') # pivot table

Scale pie chart of transactions/queries generated from the pivot table:

Add log time to the pivot table and output the daily transaction/query scale chart:

Total_actions = fulldata.pivot_table (' svid ', index= ' time ', columns= ' TYPE ', aggfunc= ' count ')
Total_actions.plot ( Subplots=false, Figsize= (18,6), kind= ' area ')

In addition, PANDAS provides Dataframe query statistics function is also very good performance, within 7 seconds can be queried to generate all types of transactions of the Data child table:

Trandata = fulldata[fulldata[' Type '] = = ' Transaction ']

The size of the child table is [10250666 rows x 5 columns]. Some basic scenarios for data processing have been completed here. The results show that Python's performance in the context of non-">5TB" data has made it possible for data analysts who are good at using statistical analysis language.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.