Big Data Cleaning Notes

Source: Internet
Author: User
Keywords big data big data cleaning data cleaning
introduction
Big data may be the hottest subject after 2017. Instead of being a subject matter, I feel that it can be understood as an ingredient, an ingredient that only needs to be heated with oil and fire. As a developer of JavaWEB, I used to be "look down on" big data, thinking that big data is not something anyone can do, nor is it something that I like Java developer Xiaobai can afford. But "God" just likes to joke with you. The less you want to touch, the harder he will let you touch. Our company has ushered in a period of strategic development transformation, and big data is one of the development directions. "Lucky" I was assigned to the big data R & D team and started my warlike career. Without further ado, the text begins:

Introduction
Big data: more than TB, EB-level data information

Python: a high-level (willful writing) development language

So the question is coming, why use Python for big data cleaning? Answer: Because of import! ! ! Python provides countless dependent libraries, Pandas, numpy, matplotlib, etc., the wisdom of these predecessors prompted us to use Python language for big data cleaning, Python provides a lot of algorithm analysis to further help us in data analysis and cleaning .

text
Pandas implements reading different data source files:
#Import required libraries
import pandas as pd
import cx_Oracle
import os
from _operator import index

Method of reading csv file:

xls = pd.read_excel ('D: /RData/demo/1234.xlsx')
How to read Excel files:

Note: This is the office Excel file, WPS may report an error.

data = pd.read_csv ("D: /RData/demo/123.csv")
Method of reading Oracle database:

#Prevent database garbled
os.environ ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.ZHS16GBK'
   
# 2. Link to the Oracle database
host = "127.0.0.1" #database ip
port = "1521" #port
sid = "DAXON" #Database name
dsn = cx_Oracle.makedsn (host, port, sid)
   
#scott is the data username and tiger is the login password (default username and password)
conn = cx_Oracle.connect ("WMIS", "root", dsn)
   
#SQL statement, can be customized to achieve flexible query
sql = 'SELECT RACID, RAADDR, RAREADER, RABUSSID, RABUSSDATE, RAUSENUM, RAPRICE, RAREADMONTH FROM WMIS_REC_ACC'
   
#Using pandas read_sql function, you can directly store the data in the dataframe
result = pd.read_sql (sql, conn)
conn.close

The read data is saved in the dataframe, which is convenient for data cleaning operation. There are many parameters in the following methods, please refer to the API yourself

Data cleaning based on Python
Data deduplication (and deduplication by specified fields):

#Data deduplication
result_noDup = result.drop_duplicates ()
#Deduplicate data by specified field
data_drop = result.drop_duplicates (['ID'])

Data blank (use with caution):
result_noNull = result_noDup.dropna ()
Print the description of the data to determine the abnormal value. (Only integer values will be counted here)

#Print data description
print (result_noNull.describe ())
#min minimum, max maximum, count statistics, mean average

According to the business, you can choose to clear the maximum and minimum values:

#Clear abnormal data
result = result_noNull.drop ((result_noNull [result_noNull ['RAUSENUM'] <0]). index.tolist ())
result = result.drop ((result [result ['RAUSENUM'] == 826]). index.tolist ())
#Remove abnormal data of unit price
# result = result.drop ((result [result ['RAPRICE'] == 0]). index.tolist ())
#test
print (result.describe ())

Data replacement:

When the data sample size is small, deleting data is undesirable, which will affect the analysis and modeling of the data. Therefore, we use the average data, mode or median replacement method to read the data for cleaning. The mode, average, and mode can be obtained by traversing the fields, which will not be shown here.


#replace
result ['RAUSENUM'] = result ['RAUSENUM']. replace ([1,2,3], np.nan)
result ['RAPRICE'] = result ['RAPRICE']. replace (0,1)
result_rep = result

The specific data type:

When the business needs count the data value range and define a certain range of data as a type, then we need to specifically type the data within a certain range of data.

#Define the split area
cutPoint = [0,1000,3000,6000,10000]
result_rep ['RAUSENUM'] = pd.cut (result_rep ['RAUSENUM'], cutPoint)
print (result_rep)
#Group display
groupLable = ['1', '2', '3', '4']
result_rep ['CutUseNum'] = pd.cut (result_rep ['RAUSENUM'], cutPoint, labels = groupLable)
print (result_rep)

According to the regular expression cleaning:

Regular expression is a very standardized way of data cleaning. We can define the regular expression to match the value of each row of the field, so as to discard the data that does not meet the regular expression.

#Delete row data that BFNAME does not match
for row in result_noNull.index:
    pattern = result_noNull.loc [row] .values [1]
    if (not (re.search (r '^ [\ u4e00- \ u9fa5] {1,8} $', pattern))):
        result_noNull.drop (row)
# print ("Deleted")



Data statistics based on date
Statistic according to year, quarter and month respectively:

#Convert data type to date format
result ['RABUSSDATE'] = pd.to_datetime (result ['RABUSSDATE'])
#Output the date as an index
result1 = result.set_index ('RABUSSDATE')
  
#Statistical annual water consumption
result_year = result1.resample ('AS'). sum (). to_period ('A')
#Statistical quarterly water consumption
result_year = result1.resample ('Q'). sum (). to_period ('Q')
#Statistical monthly water consumption
result_year = result1.resample ('M'). sum (). to_period ('M')
print (result_year)


Statistics based on double conditions:

#Double conditional data statistics
result_gb = result ['RAUSENUM']. groupby ([result ['RAREADMONTH'], result ['RAADDR']]). sum ()
print (result_gb)



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.