There is now a list of the top 2000 global listed companies in Forbes 2016, but the original data is not standardized and needs to be processed before it can be used further.
In this paper, we introduce the data pandas by using the example operation.
As usual, let me start by saying my operating environment, as follows:
Windows 7, 64-bit
Python 3.5
Pandas 0.19. Version 2
After getting the raw data, let's take a look at the data and think about what kind of data results we need.
Here's the raw data:
In this article, we need the following preliminary results for continued use later.
You can see that in the raw data, the data that is associated with the Enterprise ("Sales", "profits", "Assets", "Market_value"), is currently not a numeric type that can be used to calculate.
The original content contains the currency symbol "$", "-", a string of pure letters, and other information that we think is abnormal. More importantly, the units of the data are not consistent. They are expressed in "B" (billion, 1 billion) and "M" (Million, million) respectively. Unit unification is required prior to subsequent calculations.
1 Treatment Method Method-1
The first thought of the process is to divide the data information by 1 billion (' B ') and million (' M ') respectively, processing, and finally merging together. The procedure is shown below.
Import Pandas as pddf_2016 = Pd.read_csv (' data_2016.csv ', encoding= ' GBK ', header=none) # update column name df_2016.columns = [' Year ', ' Rank ', ' company_cn ', ' company_en ', ' country_en ', ' Sales ', ' profits ', ' Assets ', ' Market_value ']print (' The Shape of DataFrame: ', df_2016.shape) print (df_2016.dtypes) df_2016.head (3)
# Data in Unit B (billion, 1 billion) Df_2016_b = df_2016[df_2016[' Sales '].str.endswith (' B ')]print (df_2016_b.shape) df_2016_b
# Data Unit m data (Million, million) Df_2016_m = df_2016[df_2016[' Sales '].str.endswith (' m ')]print (df_2016_m.shape) df_2016_m
This method is relatively simple to understand, but it can be cumbersome to operate, especially if there are many columns of data that need to be processed, which can take a lot of time.
Further processing, I will not describe here. Of course, you can try this method.
Here's a little simpler way to do this.
2 Treatment Method Method-2
2.1 Loading Data
The first step is to load the data, which is the same as Method-1.
below to process the ' Sales ' column
2.2 Replacing the associated exception character
The first is to replace the associated exception character, including the dollar's currency symbol ' $ ', the pure letter string ' undefined ', and ' B '. Here, we want to unify the data units into 1 billion, so ' B ' can be replaced directly. and ' M ' requires more processing steps.
2.3 Handling of ' M ' related data
Processing of data containing millions of "M" units, that is, the end of "M" data, the idea is as follows:
(1) Set the search condition mask;
(2) The replacement string "M" is a null value
(3) Convert to digital using pd.to_numeric ()
(4) divided by 1000, converted to $1 billion, consistent with data from other rows
The above two steps are related to the following code:
# Replace dollar sign df_2016[' sales ' = df_2016[' sales '].str.replace (' $ ', ') # # View outliers, all letters ("undefined") # df_2016[df_2016[' Sales '] . Str.isalpha ()]# Replace exception value "undefined" to blank # df_2016[' sales ' = df_2016[' sales '].str.replace (' undefined ', ') df_2016[' Sales '] = df_2016[' Sales '].str.replace (' ^[a-za-z]+$ ', ') # replacement symbol $1 billion "B" is blank, the number itself represents $1 billion for the unit df_2016[' Sales ' = df_2016 [' Sales '].str.replace (' B ', ') # Processing of data containing millions of "M", i.e. data ending with "M" # Idea: # (1) Set Find condition mask;# (2) Replace string "M" for Null Value # (3) with Pd.to_numeric () Convert to Number # (4) divided by 1000, converted to $1 billion, consistent with data from other rows mask = df_2016[' Sales '].str.endswith (' M ') df_2016.loc[mask, ' sales ' = pd.to_ Numeric (df_2016.loc[mask, ' Sales '].str.replace (' M ', ') '/1000df_2016[' sales '] = pd.to_numeric (df_2016[' sales ') Print (' The Shape of DataFrame: ', df_2016.shape) print (df_2016.dtypes) df_2016.head (3)
Work with other columns in similar ways
As you can see, this method is much more convenient than the first method. Of course, this method for each column of data dataframe to do related operations, if the number of columns, it is still more cumbersome.
Is there a more convenient way to do it? The answer is yes.
Interrupted by a hard wide: technical articles forwarded too much. The article comes from the public "Python Data Path" (id:pydataroad).
3 Treatment Method Method-3
On the basis of Method-2, the processing method is written into a more general data processing function, according to the structure of the data, expand more applicability, it can be more convenient to process the relevant data.
3.1 Loading Data
The first step is to load the data, which is the same as Method-1.
3.2 Writing a custom function for data processing
Refer to the processing process of Method-2, write the custom function ' pro_col ' of data processing, and expand the other replacement functions on the basis of Method-2 to make it applicable to these four columns of data ("Sales", "profits", "Assets", "market_ Value ").
function to write the following code:
def pro_col (DF, col): # Replace the related string, if there are more substitution cases, you can add df[col] = Df[col].str.replace (' $ ', ') df[col] = Df[col]. Str.replace (' ^[a-za-z]+$ ', ') df[col] = df[col].str.replace (' B ', ') # Note here is '-$ ', that is, '-' End With '-', because there is a negative number df[col] = DF [Col].str.replace ('-$ ', ') df[col] = df[col].str.replace (', ', ') # Processing of data containing millions of "m" units, i.e. data ending with "M" # Idea: # (1) Set Find condition mask;# (2) Replace string "M" is null # (3) converted to number # (4) by 1000, converted to 1 billion USD, with data consistent with other rows, mask = Df[col].str.endswith (' M ') df.loc[mask, col] = Pd.to_numeric (Df.loc[mask, Col].str.replace (' m ', '))/1000# converts the number of the character type to the number type Df[col] = pd.to_ Numeric (Df[col]) return DF
3.3 Apply a custom function
For each column of dataframe, the custom function is applied to data processing to obtain the desired result.
Pro_col (df_2016, ' Sales ') pro_col (df_2016, ' profits ') pro_col (df_2016, ' Assets ') pro_col (df_2016, ' Market_value ') Print (' The Shape of DataFrame: ', df_2016.shape) print (df_2016.dtypes) df_2016.head ()
Of course, if the number of columns in the Dataframe is particularly numerous, you can use a for loop, which makes the code more concise. The code is as follows:
cols = [' Sales ', ' profits ', ' Assets ', ' Market_value ']for col in cols: pro_col (df_2016, col) print (' The shape of Datafra Me: ', df_2016.shape) print (df_2016.dtypes) df_2016.head ()
After the final processing, the data obtained are as follows: