Pandas Data Processing Example display: Global listing of listed companies

Source: Internet
Author: User

There is now a list of the top 2000 global listed companies in Forbes 2016, but the original data is not standardized and needs to be processed before it can be used further.

In this paper, we introduce the data pandas by using the example operation.

As usual, let me start by saying my operating environment, as follows:

    • Windows 7, 64-bit

    • Python 3.5

    • Pandas 0.19. Version 2

After getting the raw data, let's take a look at the data and think about what kind of data results we need.

Here's the raw data:

In this article, we need the following preliminary results for continued use later.

You can see that in the raw data, the data that is associated with the Enterprise ("Sales", "profits", "Assets", "Market_value"), is currently not a numeric type that can be used to calculate.

The original content contains the currency symbol "$", "-", a string of pure letters, and other information that we think is abnormal. More importantly, the units of the data are not consistent. They are expressed in "B" (billion, 1 billion) and "M" (Million, million) respectively. Unit unification is required prior to subsequent calculations.

1 Treatment Method Method-1

The first thought of the process is to divide the data information by 1 billion (' B ') and million (' M ') respectively, processing, and finally merging together. The procedure is shown below.

    • Load the data and add the name of the column

Import Pandas as pddf_2016 = Pd.read_csv (' data_2016.csv ', encoding= ' GBK ', header=none) # update column name df_2016.columns = [' Year ', ' Rank ', ' company_cn ', ' company_en ',                   ' country_en ', ' Sales ', ' profits ', ' Assets ', ' Market_value ']print (' The Shape of DataFrame: ', df_2016.shape) print (df_2016.dtypes) df_2016.head (3)
    • Get data in units of 1 billion (' B ')

# Data in Unit B (billion, 1 billion) Df_2016_b = df_2016[df_2016[' Sales '].str.endswith (' B ')]print (df_2016_b.shape) df_2016_b
    • Get data in million (' M ') units

# Data Unit m data (Million, million) Df_2016_m = df_2016[df_2016[' Sales '].str.endswith (' m ')]print (df_2016_m.shape) df_2016_m

This method is relatively simple to understand, but it can be cumbersome to operate, especially if there are many columns of data that need to be processed, which can take a lot of time.

Further processing, I will not describe here. Of course, you can try this method.

Here's a little simpler way to do this.

2 Treatment Method Method-2

2.1 Loading Data

The first step is to load the data, which is the same as Method-1.

below to process the ' Sales ' column

2.2 Replacing the associated exception character

The first is to replace the associated exception character, including the dollar's currency symbol ' $ ', the pure letter string ' undefined ', and ' B '. Here, we want to unify the data units into 1 billion, so ' B ' can be replaced directly. and ' M ' requires more processing steps.

2.3 Handling of ' M ' related data

Processing of data containing millions of "M" units, that is, the end of "M" data, the idea is as follows:

(1) Set the search condition mask;

(2) The replacement string "M" is a null value

(3) Convert to digital using pd.to_numeric ()

(4) divided by 1000, converted to $1 billion, consistent with data from other rows

The above two steps are related to the following code:

# Replace dollar sign df_2016[' sales ' = df_2016[' sales '].str.replace (' $ ', ') # # View outliers, all letters ("undefined") # df_2016[df_2016[' Sales '] . Str.isalpha ()]# Replace exception value "undefined" to blank # df_2016[' sales ' = df_2016[' sales '].str.replace (' undefined ', ') df_2016[' Sales '] = df_2016[' Sales '].str.replace (' ^[a-za-z]+$ ', ') # replacement symbol $1 billion "B" is blank, the number itself represents $1 billion for the unit df_2016[' Sales ' = df_2016 [' Sales '].str.replace (' B ', ') # Processing of data containing millions of "M", i.e. data ending with "M" # Idea: # (1) Set Find condition mask;# (2) Replace string "M" for Null Value # (3) with Pd.to_numeric () Convert to Number # (4) divided by 1000, converted to $1 billion, consistent with data from other rows mask = df_2016[' Sales '].str.endswith (' M ') df_2016.loc[mask, ' sales ' = pd.to_ Numeric (df_2016.loc[mask, ' Sales '].str.replace (' M ', ') '/1000df_2016[' sales '] = pd.to_numeric (df_2016[' sales ') Print (' The Shape of DataFrame: ', df_2016.shape) print (df_2016.dtypes) df_2016.head (3)

Work with other columns in similar ways

As you can see, this method is much more convenient than the first method. Of course, this method for each column of data dataframe to do related operations, if the number of columns, it is still more cumbersome.

Is there a more convenient way to do it? The answer is yes.

Interrupted by a hard wide: technical articles forwarded too much. The article comes from the public "Python Data Path" (id:pydataroad).

3 Treatment Method Method-3

On the basis of Method-2, the processing method is written into a more general data processing function, according to the structure of the data, expand more applicability, it can be more convenient to process the relevant data.

3.1 Loading Data

The first step is to load the data, which is the same as Method-1.

3.2 Writing a custom function for data processing

Refer to the processing process of Method-2, write the custom function ' pro_col ' of data processing, and expand the other replacement functions on the basis of Method-2 to make it applicable to these four columns of data ("Sales", "profits", "Assets", "market_ Value ").

function to write the following code:

def pro_col (DF, col):   # Replace the related string, if there are more substitution cases, you can add df[col] = Df[col].str.replace (' $ ', ')    df[col] = Df[col]. Str.replace (' ^[a-za-z]+$ ', ')    df[col] = df[col].str.replace (' B ', ') # Note here is '-$ ', that is, '-' End With '-', because there is a negative number df[col] = DF [Col].str.replace ('-$ ', ')    df[col] = df[col].str.replace (', ', ') # Processing of data containing millions of "m" units, i.e. data ending with "M" # Idea: # (1) Set Find condition mask;# (2) Replace string "M" is null # (3) converted to number # (4) by 1000, converted to 1 billion USD, with data consistent with other rows, mask = Df[col].str.endswith (' M ')    df.loc[mask, col] = Pd.to_numeric (Df.loc[mask, Col].str.replace (' m ', '))/1000# converts the number of the character type to the number type Df[col] = pd.to_ Numeric (Df[col]) return DF

3.3 Apply a custom function

For each column of dataframe, the custom function is applied to data processing to obtain the desired result.

Pro_col (df_2016, ' Sales ') pro_col (df_2016, ' profits ') pro_col (df_2016, ' Assets ') pro_col (df_2016, ' Market_value ') Print (' The Shape of DataFrame: ', df_2016.shape) print (df_2016.dtypes) df_2016.head ()

Of course, if the number of columns in the Dataframe is particularly numerous, you can use a for loop, which makes the code more concise. The code is as follows:

cols = [' Sales ', ' profits ', ' Assets ', ' Market_value ']for col in cols:    pro_col (df_2016, col) print (' The shape of Datafra Me: ', df_2016.shape) print (df_2016.dtypes) df_2016.head ()

After the final processing, the data obtained are as follows:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.