Links: http://www.jb51.net/article/90946.htm
Data extraction is a frequent requirement in the daily work of analysts. such as a user's loan amount, the total interest income for a particular month or quarter, the amount of loans and the number of pens for a specific period of time, the number of loans greater than 5000 yuan, and so on. This article describes how to extract data by using Python in a specific dimension or condition to complete the data extraction requirements.
preparatory work
The first is to prepare the work, import the library to use, read and create the data table named Loandata.
123 |
import numpy as np import pandas as pd loandata = pd.DataFrame(pd.read_excel( ‘loan_data.xlsx‘ )) |
Set index fields
Set the member_id column as an indexed field before starting to extract the data. It then starts extracting the data.
1 |
Loandata = loandata.set_index( ‘member_id‘ ) |
Extracting information by row
The first step is to extract data by row, such as extracting information from a user. The following uses the IX function to extract the user information for member_id 1303503.
Extracting Information by column
The second step is to extract data by column, such as extracting all the information from the user's working Life column, following the specific code and extract results, showing the working age information for all users.
1 |
loandata.ix[:, ‘emp_length‘ ] |
Extracting information by rows and columns
The third step is to extract information by row and column, put the query conditions of the previous two together, query the specific information of specific users, the following is emp_length information of the user who query member_id to 1303503.
1 |
loandata.ix[ 1303503 , ‘emp_length‘ ] |
On the basis of the above, continue to increase the condition, add a row to query two specific users of the loan amount information. The specific code and query results are as follows. The results list the code amounts for the two users separately.
1 |
loandata.ix[[ 1303503 , 1298717 ], ‘loan_amnt‘ ] |
Add the SUM function to the previous code, sum the results, and query the loan of two specific users, and give the summary value of the loan amount directly in the results below.
1 |
loandata.ix[[ 1303503 , 1298717 ], ‘loan_amnt‘ ]. sum () |
In addition to adding query criteria for rows, you can also increase the query criteria for a column, which in the following code queries the loan amount and annual income of a specific user, showing the results of the two fields respectively.
1 |
loandata.ix[ 1303503 ,[ ‘loan_amnt‘ , ‘annual_inc‘ ]] |
Queries for multiple columns can also be summed, adding the SUM function after the preceding code, summing the loan amount of the user to two fields of annual income, and showing the results.
1 |
loandata.ix[ 1303503 ,[ ‘loan_amnt‘ , ‘annual_inc‘ ]]. sum () |
extract information for a specific date
Another common requirement in data extraction is to summarize and extract data by date dimension, such as monthly, quarterly summary data extraction, and data extraction by a specific time period, and so on.
Set index fields
First change the index field to a date field in the datasheet, where Issue_d is set to the index field of the datasheet. Query and extract data by date.
1 |
loandata = loandata.set_index( ‘issue_d‘ ) |
Extract information by date
The following code queries all data for the year 2016.
Add the month based on the previous code and query all March 2016 data.
Continue to increase the date on the basis of the previous code, querying all data for June 16, 2016.
In addition to querying on a single date, you can query data by date, and the following code queries all data from January 2016 to May. The following shows the specific query results, you can find that the date of the data is in January-May, but by the date dimension, which requires us to summarize the data by month.
1 |
loandata[ ‘2016-01‘ : ‘2016-05‘ ] |
summarize information by date
The Resample function in pandas can accomplish the aggregation of dates, including by hour dimension, Date dimension, monthly dimension, quarter and year dimension, etc. Let's explain the following separately. The first is to sum the data in the preceding data table by the dimensions of the week. The following code, W, means that the aggregation is by week, how to represent how the data is computed, by default the average is computed, and this is set to sum, and the sum is calculated.
1 |
loandata.resample( ‘W‘ ,how = sum ).head( 10 ) |
Change W to M, and the data becomes aggregated by month. The calculation is still summed. It should be noted here that the Resample function shows all successive periods of time, such as the previous week-by-month aggregation operation showing a continuous period of Sunday, where monthly operations show consecutive months in the results, and if there is no data for a time period, it is displayed as a Nan value.
1 |
loandata.resample( ‘M‘ ,how = sum ) |
To change m in the previous code to Q, the data is aggregated on a quarterly basis and the calculation is still summed. From the data table below, the dates are displayed on the last day of each quarter, and if you want to display the first day of each quarter, you can change to QS.
1 |
loandata.resample( ‘Q‘ ,how = sum ) |
Change the Q in the previous code to a, which aggregates the data by year, and the calculation is still summed.
1 |
loandata.resample( ‘A‘ ,how = sum ) |
The previous method is to aggregate and sum the entire data table, and if you only need to aggregate and sum the values of one field, you can increase the name of the column after the data table. Here is the sum of the loan amount fields aggregated by month and filled with 0 empty values.
1 |
loandata[ ‘loan_amnt‘ ].resample( ‘M‘ ,how = sum ).fillna( 0 ) |
Add a numeric field to the previous code and add Len to count in the subsequent calculation. In the following results, the loan amount and the interest income are aggregated monthly, and the sum and count are calculated.
1 |
loandata[[ ‘loan_amnt‘ , ‘total_rec_int‘ ]].resample( ‘M‘ ,how = [ len , sum ]) |
Sometimes we need to aggregate and calculate only the data for a certain time period, and the following code aggregates data from January 2016 to May by month and calculates the sum. Fills the null value with 0.
1 |
loandata[ ‘2016-01‘ : ‘2016-05‘ ].resample( ‘M‘ ,how = sum ).fillna( 0 ) |
Or only some of the qualifying data is aggregated and calculated. The following code aggregates monthly for a loan amount greater than 5000, and calculates the sum. Null values are populated with 0.
1 |
loandata[loandata[ ‘loan_amnt‘ ]> 5000 ].resample( ‘M‘ ,how = sum ).fillna( 0 ) |
In addition to weeks, months, quarters, and years, the Resample function can also aggregate dates in the following ways. The specific table and description are given below.
Read MORE:HTTP://BLUEWHALE.CC/2016-08-22/DATA-EXTRACTION-USING-PYTHON.HTML#IXZZ4V7D7T3MH
Original link: http://bluewhale.cc/2016-08-22/data-extraction-using-python.html
Summary of "Learning" python Methods for data extraction "reproduced"