A simple introduction to working with big data in Python using the Pandas Library

Source: Internet
Author: User
Tags intel core i7
In the field of data analysis, the most popular is the Python and the R language, before an article "Don't talk about Hadoop, your data is not big enough" point out: Only in the size of more than 5TB of data, Hadoop is a reasonable technology choice. This time to get nearly billions of log data, tens data is already a relational database query analysis bottleneck, before using Hadoop to classify a large number of text, this time decided to use Python to process the data:

Hardware environment
cpu:3.5 GHz Intel Core i7
Memory: 3 GB hddr MHz
HDD: 3 TB Fusion Drive
Data analysis Tools
python:2.7.6
pandas:0.15.0
IPython notebook:2.0.0

The source data is shown in the following table:

Data read

Start Ipython notebook, load the Pylab environment:

Ipython Notebook--pylab=inline

Pandas provides IO tools that can read large files, test performance, and load 98 million of data in just 263 seconds or so, or pretty good.

Import Pandas as PD
Reader = pd.read_csv (' data/servicelogs ', iterator=true)
Try
DF = Reader.get_chunk (100000000)
Except stopiteration:
Print "Iteration is stopped."


Using a different chunk size to read and then call the Pandas.concat connection dataframe,chunksize set at about 10 million speed optimization is more obvious.

loop = Truechunksize = 100000chunks = []while loop:  try:    chunk = Reader.get_chunk (chunkSize)    chunks.append ( Chunk)  except stopiteration:    loop = False    print "Iteration is stopped." DF = Pd.concat (chunks, ignore_index=true)

Here is the statistics, read time is the data read times, total time is read and pandas for concat operation, according to the amount of data, the 5~50 dataframe objects are merged, performance is better.

If you use the Python Shell provided by Spark, and also write pandas to load the data for 25 seconds or so, it appears that spark has optimized the memory usage of Python.
Data cleansing

Pandas provides a dataframe.describe method for viewing data summaries, including data viewing (default total output of 60 rows of data) and row and column statistics. Because the source data usually contains some empty values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, these invalid data needs to be processed.

First call the Dataframe.isnull () method to see which null values are in the data table, and the opposite is Dataframe.notnull (), which pandas all the data in the table to be null-evaluated to TRUE/FALSE as a result. As shown in the following:

Pandas's non-empty calculation is fast, and 98 million data takes only 28.7 seconds. After you get the preliminary information, you can remove the empty column from the table. Try to calculate the column name in order to get a non-empty column, and Dataframe.dropna () two ways, the time is 367 seconds and 345.3 seconds, but the inspection found that Dropna () after all the lines are gone, checked the pandas manual, the original without parameters of the case, Dropna () Removes all rows that contain null values. If you want to remove only columns with null values, you need to add axis and how two parameters:

Df.dropna (Axis=1, how= ' all ')

A total of 6 columns in the 14 column were removed, and the time consumed was only 85.9 seconds.

The next step is to process the empty values in the remaining rows, and after testing, using an empty string in Dataframe.replace () saves some space than the default null value Nan, but for the entire CSV file, the empty column is just one more ",", so the 98 million x removed The 6 column also saved only 200M of space. Further data cleansing is still the removal of useless data and merging.

Discard the data column, in addition to invalid values and requirements, some of the table's own redundant columns also need to be cleaned up in this link, such as the table of the serial number is a two-field splicing, type description, through the discard of these data, the new data file size of 4.73GB, a full reduction of 4.04g!

Data processing

With Dataframe.dtypes you can see the data type of each column, pandas can read int and float64 by default, others are processed as object, and the conversion format is typically DateTime. The Dataframe.astype () method enables data format conversions for an entire DataFrame or a column, supporting Python and numpy data types.

df[' name '] = df[' name '].astype (np.datetime64)

For data aggregation, I tested Dataframe.groupby and dataframe.pivot_table as well as Pandas.merge, GroupBy 98 million rows x 3 columns for 99 seconds, a connection table of 26 seconds, and a faster generation of pivot tables. It takes only 5 seconds.

Df.groupby ([' No ', ' time ', ' SVID ']). COUNT () # Group Fulldata = Pd.merge (DF, Trancodedata) [[' No ', ' SVID ', ' time ', ' CLASS ', ' TYPE ']] # Connection actions = fulldata.pivot_table (' SVID ', columns= ' TYPE ', aggfunc= ' count ') # pivot table

TRADE/query scale generated by pivot Table pie chart:

Add log time to pivot table and output daily TRADE/query scale graph:

Total_actions = fulldata.pivot_table (' SVID ', index= ' time ', columns= ' TYPE ', aggfunc= ' count ') Total_actions.plot ( Subplots=false, Figsize= (18,6), kind= ' area ')

In addition, PANDAS provides the Dataframe query statistical function speed performance is also very good, within 7 seconds can be queried to generate all types of transaction data sub-table:

Trandata = fulldata[fulldata[' Type '] = = ' Transaction ']

The size of the child table is [10250666 rows x 5 columns]. Some basic scenarios for data processing have been completed here. The results are sufficient to demonstrate that Python's performance in the case of non-">5TB" data has made it possible for data analysts who are adept at using statistical analysis language.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.