Data extraction is a frequent requirement in the daily work of analysts. such as a user's loan amount, the total interest income for a particular month or quarter, the amount of loans and the number of pens for a specific period of time, the number of loans greater than 5000 yuan, and so on. This article describes how to extract data by using Python in a specific dimension or condition to complete the data extraction requirements.
Preparatory work
The first is to prepare the work, import the library to use, read and create the data table named Loandata.
Import NumPy as Npimport pandas as PDLOANDATA=PD. DataFrame (Pd.read_excel (' loan_data.xlsx '))
Set index fields
member_id
set the column as an indexed field before starting to extract the data. It then starts extracting the data.
Loandata = Loandata.set_index (' member_id ')
Extracting information by row
The first step is to extract data by row, such as extracting information from a user. The following uses the IX function to member_id
extract the user information for 1303503.
LOANDATA.IX[1303503]
Extracting Information by column
The second step is to extract data by column, such as extracting all the information from the user's working Life column, following the specific code and extract results, showing the working age information for all users.
loandata.ix[:, ' emp_length ']
Extracting information by rows and columns
The third step is to extract information by rows and columns, put together the query criteria for the previous two, query specific information for a specific user, and the following is the information for the user who is querying member_id
for 1303503 emp_length
.
loandata.ix[1303503, ' emp_length ']
On the basis of the above, continue to increase the condition, add a row to query two specific users of the loan amount information. The specific code and query results are as follows. The results list the code amounts for the two users separately.
LOANDATA.IX[[1303503,1298717], ' loan_amnt ']
After the previous code to add the sum
function, the results are summed, the same query for two specific users of the loan, the following results directly gives the sum of the loan amount.
LOANDATA.IX[[1303503,1298717], ' loan_amnt '].sum ()
In addition to adding query criteria for rows, you can also increase the query criteria for a column, which in the following code queries the loan amount and annual income of a specific user, showing the results of the two fields respectively.
loandata.ix[1303503,[' loan_amnt ', ' annual_inc ']
Queries for multiple columns can also be summed, adding a function after the preceding code sum
, summing the loan amount of the user to two fields of annual income, and showing the results.
loandata.ix[1303503,[' loan_amnt ', ' Annual_inc ']].sum ()
Extract information for a specific date
Another common requirement in data extraction is to summarize and extract data by date dimension, such as monthly, quarterly summary data extraction, and data extraction by a specific time period, and so on.
Set index fields
First change the index field to a date field in the datasheet, where Issue_d is set to the index field of the datasheet. Query and extract data by date.
Loandata = Loandata.set_index (' issue_d ')
Extract information by date
The following code queries all data for the year 2016.
loandata[' 2016 ']
Add the month based on the previous code and query all March 2016 data.
loandata[' 2016-03 ']
Continue to increase the date on the basis of the previous code, querying all data for June 16, 2016.
loandata[' 2016-06-16 ']
In addition to querying on a single date, you can query data by date, and the following code queries all data from January 2016 to May. The following shows the specific query results, you can find that the date of the data is in January-May, but by the date dimension, which requires us to summarize the data by month.
loandata[' 2016-01 ': ' 2016-05 ']
Summarize information by date
The functions in Pandas resample
can complete the aggregation of dates, including by hour dimension, Date dimension, month dimension, quarter and year dimension, and so on. Let's explain the following separately. The first is to sum the data in the preceding data table by the dimensions of the week. In the following code, w means that the aggregation is by week, how is the calculation of the data, and the default is to calculate the average, which is set to, for the sum
sum calculation.
Loandata.resample (' W ', how=sum). Head (10)
Change W to M, and the data becomes aggregated by month. The calculation is still summed. It is necessary to note that the resample
function shows all successive periods of time, for example, the previous week-by-month aggregation operation shows a continuous period of Sunday, where monthly operations show consecutive months in the results, and if there is no data for a time period, it is displayed as a Nan value.
Loandata.resample (' M ', how=sum)
To change m in the previous code to Q, the data is aggregated on a quarterly basis and the calculation is still summed. From the data table below, the dates are displayed on the last day of each quarter, and if you want to display the first day of each quarter, you can change to QS.
Loandata.resample (' Q ', how=sum)
Change the Q in the previous code to a, which aggregates the data by year, and the calculation is still summed.
Loandata.resample (' A ', how=sum)
The previous method is to aggregate and sum the entire data table, and if you only need to aggregate and sum the values of one field, you can increase the name of the column after the data table. Here is the sum of the loan amount fields aggregated by month and filled with 0 empty values.
loandata[' loan_amnt '].resample (' M ', how=sum). Fillna (0)
Add a numeric field on top of the previous code and add it to count in the subsequent calculation len
. In the following results, the loan amount and the interest income are aggregated monthly, and the sum and count are calculated.
loandata[[' loan_amnt ', ' Total_rec_int ']].resample (' M ', How=[len,sum])
Sometimes we need to aggregate and calculate only the data for a certain time period, and the following code aggregates data from January 2016 to May by month and calculates the sum. Fills the null value with 0.
loandata[' 2016-01 ': ' 2016-05 '].resample (' M ', how=sum). Fillna (0)
Or only some of the qualifying data is aggregated and calculated. The following code aggregates monthly for a loan amount greater than 5000, and calculates the sum. Null values are populated with 0.
loandata[loandata[' loan_amnt ']>5000].resample (' M ', how=sum). Fillna (0)
In addition to weeks, months, quarters, and years, resample
functions can also aggregate dates in the following ways.
The specific table and description are given below.